Skip to content

BasicFist/rescrape

rescrape

A personal Reddit scraper using OAuth authentication.

Overview

This project is a comprehensive Reddit scraper that uses OAuth for authentication to collect data from Reddit. It provides both programmatic and command-line interfaces for scraping posts and comments from subreddits, users, or with specific search queries.

Features

  • Reddit OAuth authentication
  • Configurable scraping parameters
  • Data export in JSON, CSV, and Excel formats
  • Support for multiple sorting methods (hot, new, top, rising)
  • Time filtering for top posts
  • Search functionality across subreddits
  • Command-line interface for easy execution
  • Text User Interface (TUI) for interactive scraping
  • Logging for monitoring operations
  • Data visualization capabilities
  • Use case examples for market research, sentiment analysis, content aggregation, and academic research
  • NEW: Advanced semantic search for finding relevant technical content
  • NEW: Research assistant with intelligent answer retrieval
  • NEW: Authority scoring to prioritize reliable sources
  • NEW: Technical content detection to identify valuable technical posts

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd rescrape
  2. Install dependencies:

    pip install -r requirements.txt
  3. Register a Reddit application:

    • Go to https://www.reddit.com/prefs/apps
    • Click "Create App" or "Create Another App"
    • Choose "script" as the application type
    • Fill in the required fields and create the app
    • Note down your client_id and client_secret
  4. Configure your Reddit app credentials:

    cp config/credentials.env .env
    # Edit .env with your credentials

Usage

Command Line Interface

The easiest way to use the scraper is through the command-line interface:

python scripts/scrape_reddit.py <subreddit> [options]

Examples:

# Scrape 50 hot posts from r/technology
python scripts/scrape_reddit.py technology -l 50

# Scrape 25 top posts from r/programming from the last week
python scripts/scrape_reddit.py programming -l 25 -s top -t week

# Search for "python tutorial" in r/learnpython
python scripts/scrape_reddit.py learnpython -q "python tutorial" -l 10

# Export to CSV
python scripts/scrape_reddit.py python -l 20 --format csv

Text User Interface (TUI)

The interactive experience is now powered by Bubble Tea and ships as a Go program. You can start it either directly via go run or through the provided helper script:

# Preferred
go run ./cmd/tui

# Or use the compatibility wrapper
python scripts/run_tui.py

Highlights:

  • Real-time subreddit scraping with live status feedback
  • Integrated research assistant that streams the top Reddit answers
  • Activity log with timestamps for each operation
  • Keyboard-friendly workflow for editing inputs and triggering actions

Keyboard shortcuts:

  • Tab / Shift+Tab: Move between input fields
  • Ctrl+S: Start scraping with current parameters
  • Ctrl+R: Run the research assistant for the active query
  • Ctrl+C or Esc: Quit the application

Quick Launch Aliases

After installation, you can use these convenient aliases:

# Launch the TUI interface
rescrape

# Use the CLI directly
rescrapectl <subreddit> [options]

To use the aliases, add them to your shell configuration (already done):

  • ~/.bashrc or ~/.zshrc

After adding to your shell configuration, reload it:

source ~/.bashrc
# or
source ~/.zshrc

Programmatic Usage

You can also use the scraper directly in your Python code:

from src.advanced_scraper import AdvancedRedditScraper

# Initialize the scraper
scraper = AdvancedRedditScraper()

# Scrape posts from a subreddit
posts = scraper.scrape_posts(
    subreddit_name='python',
    limit=50,
    sort_by='hot'
)

# Export to your preferred format
scraper.export_to_json(posts, 'my_posts.json')
scraper.export_to_csv(posts, 'my_posts.csv')

Data Visualization

After scraping data, you can visualize it using the included visualization script:

python scripts/visualize_data.py my_posts.json

This will create multiple plots showing:

  • Distribution of post scores
  • Relationship between comments and scores
  • Post frequency over time
  • Top subreddits by post count

Export to Obsidian

The enhanced rescrape now includes direct export to Obsidian for note-taking and research organization:

# Export research results via the CLI after running a query
python -m src.cli research "Rust async patterns" --export rust-async --summary

# Or programmatically:
from src.obsidian_export import ObsidianExporter

exporter = ObsidianExporter(vault_path="/path/to/your/obsidian/vault")
exporter.export_research_to_obsidian(
    research_results=my_results,
    query="How to implement LLM fine-tuning",
    filename="llm-fine-tuning-research"
)

This feature allows you to seamlessly integrate your Reddit research into your Obsidian knowledge base with proper metadata and tagging.

Use Case Examples

The project includes example scripts demonstrating various use cases:

python scripts/use_cases_demo.py

This script demonstrates:

  • Market research (analyzing technology mentions)
  • Sentiment analysis preparation
  • Content aggregation across multiple subreddits
  • Academic research data collection

Testing and Code Coverage

The project includes comprehensive test coverage with multiple test suites:

# Run all tests
python -m pytest tests/

# Run tests with coverage report
python -m pytest --cov=src tests/

# Run the test suite using the provided script
./scripts/run_tests.sh

The current code coverage is:

  • advanced_scraper.py: 58%
  • main.py: 28%
  • sops_integration.py: 56%
  • cli.py: not yet covered (new helper powering the Bubble Tea TUI)
  • Overall: 47%

### Environment Variables

The following environment variables can be set in your `.env` file:

- `REDDIT_CLIENT_ID`: Your Reddit app's client ID
- `REDDIT_CLIENT_SECRET`: Your Reddit app's client secret
- `REDDIT_USER_AGENT`: User agent string for the API
- `REDDIT_SUBREDDIT`: Default subreddit to scrape (default: python)
- `REDDIT_LIMIT`: Number of posts to retrieve (default: 100)
- `REDDIT_SORT_BY`: Sorting method (default: hot)
- `REDDIT_TIME_FILTER`: Time filter for top posts (default: all)
- `REDDIT_SEARCH_QUERY`: Optional search query

### SOPS Integration

This project supports retrieving credentials from SOPS-encrypted files. The scraper will automatically attempt to load Reddit credentials from the LAB infrastructure's SOPS-encrypted secrets file if available. If SOPS is not available or configured, it will fall back to using the `.env` file.

To use SOPS integration:
1. Ensure SOPS is installed on your system
2. Set the `SOPS_AGE_KEY_FILE` environment variable to point to your age key file
3. Store your Reddit credentials in a SOPS-encrypted file (default: `/home/miko/LAB/secrets/global.enc.env`)
4. The scraper will automatically load credentials from SOPS when available

## Advanced Features

### Export Formats
The scraper can export data in multiple formats:
- JSON: For data analysis and preservation
- CSV: For spreadsheet applications
- Excel: For detailed analysis with formatting

### Rate Limiting
The scraper respects Reddit's API rate limits. For extensive scraping, consider implementing additional delays between requests.

### Data Fields
The scraper collects comprehensive post data including:
- Post ID, title, author, score, and comment count
- URLs and permalinks
- Creation timestamps
- Post content (selftext)
- Subreddit name
- NSFW flags (over_18)
- Flair information
- Top-level comments
- Upvote ratios
- Crosspost information
- Gilding information

## Security and Privacy

- Store your Reddit credentials securely in the `.env` file
- Never commit the `.env` file to version control
- The `.env` file is included in `.gitignore` by default
- Follow Reddit's API terms of service when scraping

## Research Assistant Features

The enhanced rescrape includes a research assistant that can intelligently find answers to technical questions:

### Research Assistant Interface
- Accessible through both TUI and command-line
- Understands technical queries and expands them with related terms
- Uses semantic search to find conceptually related content
- Scores posts based on authority, engagement, and technical content indicators
- Combines traditional keyword matching with modern semantic understanding
- Provides ranked results with relevance scores

### Technical Content Detection
- Identifies posts with code snippets, technical terms, and implementation details
- Prioritizes authoritative sources with high karma in specific domains
- Uses temporal relevance to surface recent information for fast-evolving fields
- Analyzes comment threads for additional insights

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## License

This project is licensed under the MIT License.

About

Reddit research and scraping assistant with AI-powered semantic search

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published