Readability-Alternative 1.0

A robust, configurable alternative to Readability.js for extracting main content from web pages. Supports Node.js (with jsdom) and browser environments, including multi-column layouts, noise removal, and Markdown-ready text output.

Features

Extracts main content intelligently from articles, sections, and divs.
Handles multi-column layouts and merges content in reading order.
Removes noise: script, style, noscript, iframe, .ads, .advertisement.
Returns API-compatible objects with:
- title – Document title
- textContent – Cleaned main text
- content – HTML of the main container(s)
- length – Character count of textContent
- excerpt – First N characters of text
Markdown-ready text formatting (headings, paragraphs, lists, code, figures).
Fully configurable: ignored classes, thresholds, tag boosts, excerpt length.
Compatible with Node.js (via jsdom) and browser/Chrome extensions.

Installation (Node.js)

npm install jsdom

Include the Readability.js file in your project.

Node.js Usage Example

import Readability from './Readability.js';
import fs from 'fs';

const html = fs.readFileSync('example.html', 'utf-8');

const reader = new Readability(html, {
  ignoreClasses: /nav|sidebar|ads/i,
  minTextLength: 60,
  columnMinText: 40,
  columnThreshold: 0.25,
  tagBoosts: { article: 1.7, main: 1.5 },
  excerptLength: 300
});

const article = reader.parse();

console.log("Title:", article.title);
console.log("Excerpt:", article.excerpt);
console.log("Text content:\n", article.textContent);

Browser Usage Example

// In a Chrome extension content script or browser console
const reader = new Readability(document, {
  ignoreClasses: /nav|sidebar|ads/i,
  minTextLength: 60,
  columnMinText: 40,
  columnThreshold: 0.25,
  tagBoosts: { article: 1.7, main: 1.5 },
  excerptLength: 300
});

const article = reader.parse();

console.log("Title:", article.title);
console.log("Excerpt:", article.excerpt);
console.log("Text content:\n", article.textContent);

Configuration Options

Option	Type	Default	Description
`ignoreClasses`	RegExp	`/aside	nav	footer	header	sidebar	ads	advertisement/i`	Regex to ignore unwanted elements
`minTextLength`	Number	`50`	Minimum text length for main content candidates
`columnMinText`	Number	`30`	Minimum text length for column children
`columnThreshold`	Number	`0.3`	Fraction of max column score to include
`tagBoosts`	Object	`{ article: 1.5, section: 1.2, main: 1.3 }`	Boost multipliers for tags
`excerptLength`	Number	`200`	Number of characters in returned excerpt

Returned Object

The parse() method returns an object:

{
  title: "Page Title",
  textContent: "Cleaned main text of the article...",
  content: "<article>HTML content...</article>",
  length: 1234,
  excerpt: "First 200 characters of main text..."
}

If extraction fails, parse() returns null.

Supported Tags & Markdown Conversion

<h1>–<h6> → Markdown headings
<p> → Paragraphs
<li> → List items (- )
<pre> / <code> → Code blocks
<figure> → ![caption](src) if <img> and optional <figcaption> exist

Comparison with Readability.js

Feature / Aspect	Readability.js	Readability-Alternative 1.0
Environment Support	Browser only	Browser + Node.js (via `jsdom`)
Noise Removal	Basic (scripts/styles)	Enhanced (scripts, styles, ads, iframes, custom ignored classes)
Multi-column Detection	No	Yes, intelligently merges columns in reading order
Markdown-ready Text Output	No	Yes, handles headings, lists, code blocks, figures
Configurable Tag Boosts	No	Yes, supports boosting `article`, `section`, `main` tags
Configurable Thresholds	No	Yes, minimum text length, column thresholds, excerpt length
Excerpt Generation	No	Yes, configurable excerpt from main content
API Output	`{title, textContent}`	`{title, textContent, content, length, excerpt}`
Resiliency / Edge Cases	Moderate	High: guards against empty nodes, malformed HTML, missing elements
Installation	Built-in in browser	Node.js: requires `jsdom`, browser: works directly
Customization	Limited	High: regex for ignored classes, tag boosts, thresholds, excerpt length

Key Enhancements in Readability-Alternative 1.0

Node.js Support — Extract content server-side or in scripts.
Multi-column & Merge-aware — Preserves reading order.
Robust Noise Removal — Removes scripts, ads, and custom unwanted elements.
Markdown-friendly Output — Ready for export or processing.
Configurable & Extensible — Fine-tune thresholds, boosts, and ignored elements.
Production-Ready — Handles empty or malformed DOMs gracefully.

Notes

Node.js version requires jsdom.
Browser version works with any Document or HTMLElement (e.g., document).
Both versions share the same config structure and API, enabling consistent behavior across environments.

License

MIT License — free to use and modify.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
src		src
test		test
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Readability-Alternative 1.0

Features

Installation (Node.js)

Node.js Usage Example

Browser Usage Example

Configuration Options

Returned Object

Supported Tags & Markdown Conversion

Comparison with Readability.js

Key Enhancements in Readability-Alternative 1.0

Notes

License

About

Uh oh!

Releases

Packages

Languages

License

pcontact/Readability-Alternative-v.1.0

Folders and files

Latest commit

History

Repository files navigation

Readability-Alternative 1.0

Features

Installation (Node.js)

Node.js Usage Example

Browser Usage Example

Configuration Options

Returned Object

Supported Tags & Markdown Conversion

Comparison with Readability.js

Key Enhancements in Readability-Alternative 1.0

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages