rescrape

A personal Reddit scraper using OAuth authentication.

Overview

This project is a comprehensive Reddit scraper that uses OAuth for authentication to collect data from Reddit. It provides both programmatic and command-line interfaces for scraping posts and comments from subreddits, users, or with specific search queries.

Features

Reddit OAuth authentication
Configurable scraping parameters
Data export in JSON, CSV, and Excel formats
Support for multiple sorting methods (hot, new, top, rising)
Time filtering for top posts
Search functionality across subreddits
Command-line interface for easy execution
Text User Interface (TUI) for interactive scraping
Logging for monitoring operations
Data visualization capabilities
Use case examples for market research, sentiment analysis, content aggregation, and academic research
NEW: Advanced semantic search for finding relevant technical content
NEW: Research assistant with intelligent answer retrieval
NEW: Authority scoring to prioritize reliable sources
NEW: Technical content detection to identify valuable technical posts

Setup

Clone the repository:
```
git clone <repository-url>
cd rescrape
```
Install dependencies:
```
pip install -r requirements.txt
```
Register a Reddit application:
- Go to https://www.reddit.com/prefs/apps
- Click "Create App" or "Create Another App"
- Choose "script" as the application type
- Fill in the required fields and create the app
- Note down your client_id and client_secret

Configure your Reddit app credentials:

cp config/credentials.env .env
# Edit .env with your credentials

Usage

Command Line Interface

The easiest way to use the scraper is through the command-line interface:

python scripts/scrape_reddit.py <subreddit> [options]

Examples:

# Scrape 50 hot posts from r/technology
python scripts/scrape_reddit.py technology -l 50

# Scrape 25 top posts from r/programming from the last week
python scripts/scrape_reddit.py programming -l 25 -s top -t week

# Search for "python tutorial" in r/learnpython
python scripts/scrape_reddit.py learnpython -q "python tutorial" -l 10

# Export to CSV
python scripts/scrape_reddit.py python -l 20 --format csv

Text User Interface (TUI)

The interactive experience is now powered by Bubble Tea and ships as a Go program. You can start it either directly via go run or through the provided helper script:

# Preferred
go run ./cmd/tui

# Or use the compatibility wrapper
python scripts/run_tui.py

Highlights:

Real-time subreddit scraping with live status feedback
Integrated research assistant that streams the top Reddit answers
Activity log with timestamps for each operation
Keyboard-friendly workflow for editing inputs and triggering actions

Keyboard shortcuts:

Tab / Shift+Tab: Move between input fields
Ctrl+S: Start scraping with current parameters
Ctrl+R: Run the research assistant for the active query
Ctrl+C or Esc: Quit the application

Quick Launch Aliases

After installation, you can use these convenient aliases:

# Launch the TUI interface
rescrape

# Use the CLI directly
rescrapectl <subreddit> [options]

To use the aliases, add them to your shell configuration (already done):

~/.bashrc or ~/.zshrc

After adding to your shell configuration, reload it:

source ~/.bashrc
# or
source ~/.zshrc

Programmatic Usage

You can also use the scraper directly in your Python code:

from src.advanced_scraper import AdvancedRedditScraper

# Initialize the scraper
scraper = AdvancedRedditScraper()

# Scrape posts from a subreddit
posts = scraper.scrape_posts(
    subreddit_name='python',
    limit=50,
    sort_by='hot'
)

# Export to your preferred format
scraper.export_to_json(posts, 'my_posts.json')
scraper.export_to_csv(posts, 'my_posts.csv')

Data Visualization

After scraping data, you can visualize it using the included visualization script:

python scripts/visualize_data.py my_posts.json

This will create multiple plots showing:

Distribution of post scores
Relationship between comments and scores
Post frequency over time
Top subreddits by post count

Export to Obsidian

The enhanced rescrape now includes direct export to Obsidian for note-taking and research organization:

# Export research results via the CLI after running a query
python -m src.cli research "Rust async patterns" --export rust-async --summary

# Or programmatically:
from src.obsidian_export import ObsidianExporter

exporter = ObsidianExporter(vault_path="/path/to/your/obsidian/vault")
exporter.export_research_to_obsidian(
    research_results=my_results,
    query="How to implement LLM fine-tuning",
    filename="llm-fine-tuning-research"
)

This feature allows you to seamlessly integrate your Reddit research into your Obsidian knowledge base with proper metadata and tagging.

Use Case Examples

The project includes example scripts demonstrating various use cases:

python scripts/use_cases_demo.py

This script demonstrates:

Market research (analyzing technology mentions)
Sentiment analysis preparation
Content aggregation across multiple subreddits
Academic research data collection

Testing and Code Coverage

The project includes comprehensive test coverage with multiple test suites:

# Run all tests
python -m pytest tests/

# Run tests with coverage report
python -m pytest --cov=src tests/

# Run the test suite using the provided script
./scripts/run_tests.sh

The current code coverage is:

advanced_scraper.py: 58%
main.py: 28%
sops_integration.py: 56%
cli.py: not yet covered (new helper powering the Bubble Tea TUI)
Overall: 47%


### Environment Variables

The following environment variables can be set in your `.env` file:

- `REDDIT_CLIENT_ID`: Your Reddit app's client ID
- `REDDIT_CLIENT_SECRET`: Your Reddit app's client secret
- `REDDIT_USER_AGENT`: User agent string for the API
- `REDDIT_SUBREDDIT`: Default subreddit to scrape (default: python)
- `REDDIT_LIMIT`: Number of posts to retrieve (default: 100)
- `REDDIT_SORT_BY`: Sorting method (default: hot)
- `REDDIT_TIME_FILTER`: Time filter for top posts (default: all)
- `REDDIT_SEARCH_QUERY`: Optional search query

### SOPS Integration

This project supports retrieving credentials from SOPS-encrypted files. The scraper will automatically attempt to load Reddit credentials from the LAB infrastructure's SOPS-encrypted secrets file if available. If SOPS is not available or configured, it will fall back to using the `.env` file.

To use SOPS integration:
1. Ensure SOPS is installed on your system
2. Set the `SOPS_AGE_KEY_FILE` environment variable to point to your age key file
3. Store your Reddit credentials in a SOPS-encrypted file (default: `/home/miko/LAB/secrets/global.enc.env`)
4. The scraper will automatically load credentials from SOPS when available

## Advanced Features

### Export Formats
The scraper can export data in multiple formats:
- JSON: For data analysis and preservation
- CSV: For spreadsheet applications
- Excel: For detailed analysis with formatting

### Rate Limiting
The scraper respects Reddit's API rate limits. For extensive scraping, consider implementing additional delays between requests.

### Data Fields
The scraper collects comprehensive post data including:
- Post ID, title, author, score, and comment count
- URLs and permalinks
- Creation timestamps
- Post content (selftext)
- Subreddit name
- NSFW flags (over_18)
- Flair information
- Top-level comments
- Upvote ratios
- Crosspost information
- Gilding information

## Security and Privacy

- Store your Reddit credentials securely in the `.env` file
- Never commit the `.env` file to version control
- The `.env` file is included in `.gitignore` by default
- Follow Reddit's API terms of service when scraping

## Research Assistant Features

The enhanced rescrape includes a research assistant that can intelligently find answers to technical questions:

### Research Assistant Interface
- Accessible through both TUI and command-line
- Understands technical queries and expands them with related terms
- Uses semantic search to find conceptually related content
- Scores posts based on authority, engagement, and technical content indicators
- Combines traditional keyword matching with modern semantic understanding
- Provides ranked results with relevance scores

### Technical Content Detection
- Identifies posts with code snippets, technical terms, and implementation details
- Prioritizes authoritative sources with high karma in specific domains
- Uses temporal relevance to surface recent information for fast-evolving fields
- Analyzes comment threads for additional insights

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cmd/tui		cmd/tui
config		config
docs		docs
scripts		scripts
src		src
tests		tests
.coverage		.coverage
.env.example		.env.example
.gitignore		.gitignore
AI_GUIDANCE_ROADMAP.md		AI_GUIDANCE_ROADMAP.md
CLAUDE.md		CLAUDE.md
COVERAGE.md		COVERAGE.md
ENHANCEMENT_SUMMARY.md		ENHANCEMENT_SUMMARY.md
IMPROVEMENTS.md		IMPROVEMENTS.md
README.md		README.md
VALIDATION.md		VALIDATION.md
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
rescrape-launcher.sh		rescrape-launcher.sh
rescrape.desktop		rescrape.desktop
rescrape.tcss		rescrape.tcss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

rescrape

Overview

Features

Setup

Usage

Command Line Interface

Text User Interface (TUI)

Quick Launch Aliases

Programmatic Usage

Data Visualization

Export to Obsidian

Use Case Examples

Testing and Code Coverage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

BasicFist/rescrape

Folders and files

Latest commit

History

Repository files navigation

rescrape

Overview

Features

Setup

Usage

Command Line Interface

Text User Interface (TUI)

Quick Launch Aliases

Programmatic Usage

Data Visualization

Export to Obsidian

Use Case Examples

Testing and Code Coverage

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages