Skip to content

Readability-Alternative 1.0 A robust, configurable alternative to Readability.js for extracting main content from web pages. Supports Node.js (via jsdom) and browsers, intelligently handling multi-column layouts, noise removal, and Markdown-ready text output. Ideal for web scraping, content extraction, and Chrome extensions.

License

Notifications You must be signed in to change notification settings

pcontact/Readability-Alternative-v.1.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readability-Alternative 1.0

A robust, configurable alternative to Readability.js for extracting main content from web pages. Supports Node.js (with jsdom) and browser environments, including multi-column layouts, noise removal, and Markdown-ready text output.


Features

  • Extracts main content intelligently from articles, sections, and divs.

  • Handles multi-column layouts and merges content in reading order.

  • Removes noise: script, style, noscript, iframe, .ads, .advertisement.

  • Returns API-compatible objects with:

    • title – Document title
    • textContent – Cleaned main text
    • content – HTML of the main container(s)
    • length – Character count of textContent
    • excerpt – First N characters of text
  • Markdown-ready text formatting (headings, paragraphs, lists, code, figures).

  • Fully configurable: ignored classes, thresholds, tag boosts, excerpt length.

  • Compatible with Node.js (via jsdom) and browser/Chrome extensions.


Installation (Node.js)

npm install jsdom

Include the Readability.js file in your project.


Node.js Usage Example

import Readability from './Readability.js';
import fs from 'fs';

const html = fs.readFileSync('example.html', 'utf-8');

const reader = new Readability(html, {
  ignoreClasses: /nav|sidebar|ads/i,
  minTextLength: 60,
  columnMinText: 40,
  columnThreshold: 0.25,
  tagBoosts: { article: 1.7, main: 1.5 },
  excerptLength: 300
});

const article = reader.parse();

console.log("Title:", article.title);
console.log("Excerpt:", article.excerpt);
console.log("Text content:\n", article.textContent);

Browser Usage Example

// In a Chrome extension content script or browser console
const reader = new Readability(document, {
  ignoreClasses: /nav|sidebar|ads/i,
  minTextLength: 60,
  columnMinText: 40,
  columnThreshold: 0.25,
  tagBoosts: { article: 1.7, main: 1.5 },
  excerptLength: 300
});

const article = reader.parse();

console.log("Title:", article.title);
console.log("Excerpt:", article.excerpt);
console.log("Text content:\n", article.textContent);

Configuration Options

Option Type Default Description
ignoreClasses RegExp `/aside nav footer header sidebar ads advertisement/i` Regex to ignore unwanted elements
minTextLength Number 50 Minimum text length for main content candidates
columnMinText Number 30 Minimum text length for column children
columnThreshold Number 0.3 Fraction of max column score to include
tagBoosts Object { article: 1.5, section: 1.2, main: 1.3 } Boost multipliers for tags
excerptLength Number 200 Number of characters in returned excerpt

Returned Object

The parse() method returns an object:

{
  title: "Page Title",
  textContent: "Cleaned main text of the article...",
  content: "<article>HTML content...</article>",
  length: 1234,
  excerpt: "First 200 characters of main text..."
}

If extraction fails, parse() returns null.


Supported Tags & Markdown Conversion

  • <h1><h6> → Markdown headings
  • <p> → Paragraphs
  • <li> → List items (- )
  • <pre> / <code> → Code blocks
  • <figure>![caption](src) if <img> and optional <figcaption> exist

Comparison with Readability.js

Feature / Aspect Readability.js Readability-Alternative 1.0
Environment Support Browser only Browser + Node.js (via jsdom)
Noise Removal Basic (scripts/styles) Enhanced (scripts, styles, ads, iframes, custom ignored classes)
Multi-column Detection No Yes, intelligently merges columns in reading order
Markdown-ready Text Output No Yes, handles headings, lists, code blocks, figures
Configurable Tag Boosts No Yes, supports boosting article, section, main tags
Configurable Thresholds No Yes, minimum text length, column thresholds, excerpt length
Excerpt Generation No Yes, configurable excerpt from main content
API Output {title, textContent} {title, textContent, content, length, excerpt}
Resiliency / Edge Cases Moderate High: guards against empty nodes, malformed HTML, missing elements
Installation Built-in in browser Node.js: requires jsdom, browser: works directly
Customization Limited High: regex for ignored classes, tag boosts, thresholds, excerpt length

Key Enhancements in Readability-Alternative 1.0

  1. Node.js Support — Extract content server-side or in scripts.
  2. Multi-column & Merge-aware — Preserves reading order.
  3. Robust Noise Removal — Removes scripts, ads, and custom unwanted elements.
  4. Markdown-friendly Output — Ready for export or processing.
  5. Configurable & Extensible — Fine-tune thresholds, boosts, and ignored elements.
  6. Production-Ready — Handles empty or malformed DOMs gracefully.

Notes

  • Node.js version requires jsdom.
  • Browser version works with any Document or HTMLElement (e.g., document).
  • Both versions share the same config structure and API, enabling consistent behavior across environments.

License

MIT License — free to use and modify.

About

Readability-Alternative 1.0 A robust, configurable alternative to Readability.js for extracting main content from web pages. Supports Node.js (via jsdom) and browsers, intelligently handling multi-column layouts, noise removal, and Markdown-ready text output. Ideal for web scraping, content extraction, and Chrome extensions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published