scrpr

A fast, clean CLI tool for extracting main content from websites. Supports JavaScript rendering, browser cookie integration, and UNIX pipe operations.

Features

Clean Content Extraction: Uses readability algorithms to extract main article content with intelligent newline cleaning
JavaScript Support: Handles dynamic websites with ChromeDP (coming soon)
Browser Integration: Extract and use cookies from Chrome, Firefox, Safari, and Zen browsers
Pipe-Friendly: Full UNIX pipe support for command chaining
Multiple Formats: Output as clean text or Markdown
Parallel Processing: Process multiple URLs concurrently (coming soon)
Cookie Banner Handling: Automatically dismiss cookie banners (coming soon)

Installation

From Source

git clone https://github.com/byteowlz/scrpr.git
cd scrpr
go build -o scrpr cmd/scrpr/main.go

Binary Releases (Coming Soon)

Pre-built binaries will be available for Linux, macOS, and Windows.

Quick Start

# Extract content from a single URL (with clean text flow)
scrpr https://openshovelshack.com/blog/the-octopus-and-the-rake

# Output as Markdown
scrpr https://example.com --format markdown

# Process multiple URLs
scrpr https://example.com https://news.ycombinator.com

# Use pipes
echo "https://example.com" | scrpr
cat urls.txt | scrpr --format markdown

Usage

Basic Syntax

scrpr [urls...] [flags]

Arguments

urls - Website URLs to extract content from (space-separated)
If no URLs provided, reads from stdin

Flags

Input/Output

-f, --file string - Read URLs from file (one per line)
-o, --output string - Output to file or directory (default: stdout)
--format string - Output format: text, markdown (default: text)
--separator string - Output separator for multiple URLs (default: "---")
--null-separator - Use null byte separator (for xargs -0)

Browser Integration

-b, --browser string - Browser for cookie extraction: chrome, firefox, safari, zen (default: auto)

Content Processing

--include-metadata - Include page metadata in output
--user-agent string - Custom user agent string

Network

--timeout int - Request timeout in seconds (default: 30)

System

-v, --verbose - Verbose logging
--config string - Custom config file path

Examples

Basic Usage

# Simple content extraction
scrpr https://news.ycombinator.com

# Save as markdown file
scrpr https://example.com --format markdown -o article.md

# Include metadata
scrpr https://example.com --include-metadata

Multiple URLs

# Process multiple URLs
scrpr https://example.com https://news.ycombinator.com

# Custom separator
scrpr url1 url2 --separator "==="

# From file
scrpr -f urls.txt --format markdown

Pipe Operations

# Basic pipe input
echo "https://example.com" | scrpr

# Chain with other commands
cat bookmarks.txt | scrpr | grep "important"

# Process API responses
curl -s "api.example.com/articles" | jq -r '.urls[]' | scrpr

# Use with xargs
cat urls.txt | scrpr --null-separator | xargs -0 process_text

# Save individual articles
cat urls.txt | while read url; do
  scrpr "$url" --format markdown -o "$(basename "$url").md"
done

Advanced Usage

# Custom timeout and user agent
scrpr https://example.com --timeout 60 --user-agent "MyBot/1.0"

# Verbose output
scrpr https://example.com --verbose

# Use cookies from specific browser
scrpr https://example.com --browser chrome

Configuration

scrpr uses a TOML configuration file located at:

$XDG_CONFIG_HOME/scrpr/config.toml
~/.config/scrpr/config.toml (fallback)

Example Configuration

[browser]
default = "auto"  # auto, chrome, firefox, safari, zen

[extraction]
skip_cookie_banners = true
min_content_length = 100
remove_ads = true

[output]
default_format = "text"  # text, markdown
include_metadata = false
preserve_links = true

[network]
timeout = 30
user_agent = ""
follow_redirects = true

[parallel]
max_concurrency = 5
show_progress = true

[logging]
level = "info"

See examples/config.toml for a complete configuration example.

Browser Support

scrpr can extract and use cookies from:

Chrome/Chromium - ~/.config/google-chrome/Default/Cookies
Firefox - ~/.mozilla/firefox/*/cookies.sqlite
Safari (macOS only) - ~/Library/Cookies/Cookies.binarycookies
Zen Browser - ~/.zen/*/cookies.sqlite

Cookie extraction is read-only and secure - no cookies are modified or stored.

Output Formats

Text Format (default)

Clean, readable text with intelligent newline cleaning. Removes unwanted line breaks that split sentences while preserving paragraph structure.

scrpr https://example.com --format text

Newline Cleaning Features:

Automatically joins lines that break mid-sentence
Preserves paragraph breaks and intentional formatting
Maintains proper spacing between words
Recognizes sentence-ending punctuation
Handles bullet points and numbered lists correctly

Markdown Format

Structured Markdown with preserved formatting and links.

scrpr https://example.com --format markdown

Integration Examples

Content Archiving

# Archive articles from RSS feed
curl -s "https://feeds.example.com/rss.xml" | \
  grep -oE 'https://[^<]+' | \
  scrpr --format markdown -o archive/

Research Pipeline

# Extract and analyze content
cat research_urls.txt | \
  scrpr --format text | \
  grep -i "machine learning" | \
  sort | uniq -c

Documentation Generation

# Convert web pages to documentation
echo -e "url1\nurl2\nurl3" | \
  scrpr --format markdown | \
  pandoc -f markdown -t pdf > docs.pdf

Development Status

✅ Completed

CLI interface and argument parsing
Content extraction with readability
Intelligent newline cleaning for better text flow
Pipe input/output support
Browser cookie extraction
Text and Markdown output formats
Configuration system

🚧 In Progress

JavaScript rendering with ChromeDP
Cookie banner dismissal
Parallel processing
Enhanced metadata extraction

📋 Planned

Browser plugin interface
Custom extraction rules
Output to various formats (JSON, PDF)
Web UI for configuration

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Note: the built binary at repo root (/scrpr) is intentionally ignored by git.

License

MIT License - see LICENSE for details.

Similar Projects

Trafilatura (Python) - Web scraping and content extraction
Mercury Parser (JavaScript) - Content extraction
newspaper3k (Python) - Article scraping

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cmd/scrpr		cmd/scrpr
examples		examples
internal		internal
pkg/extractor		pkg/extractor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
go.mod		go.mod
go.sum		go.sum

License

byteowlz/scrpr

Folders and files

Latest commit

History

Repository files navigation

scrpr

Features

Installation

From Source

Binary Releases (Coming Soon)

Quick Start

Usage

Basic Syntax

Arguments

Flags

Input/Output

Browser Integration

Content Processing

Network

System

Examples

Basic Usage

Multiple URLs

Pipe Operations

Advanced Usage

Configuration

Example Configuration

Browser Support

Output Formats

Text Format (default)

Markdown Format

Integration Examples

Content Archiving

Research Pipeline

Documentation Generation

Development Status

Contributing

License

Similar Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages