A personal Reddit scraper using OAuth authentication.
This project is a comprehensive Reddit scraper that uses OAuth for authentication to collect data from Reddit. It provides both programmatic and command-line interfaces for scraping posts and comments from subreddits, users, or with specific search queries.
- Reddit OAuth authentication
- Configurable scraping parameters
- Data export in JSON, CSV, and Excel formats
- Support for multiple sorting methods (hot, new, top, rising)
- Time filtering for top posts
- Search functionality across subreddits
- Command-line interface for easy execution
- Text User Interface (TUI) for interactive scraping
- Logging for monitoring operations
- Data visualization capabilities
- Use case examples for market research, sentiment analysis, content aggregation, and academic research
- NEW: Advanced semantic search for finding relevant technical content
- NEW: Research assistant with intelligent answer retrieval
- NEW: Authority scoring to prioritize reliable sources
- NEW: Technical content detection to identify valuable technical posts
-
Clone the repository:
git clone <repository-url> cd rescrape
-
Install dependencies:
pip install -r requirements.txt
-
Register a Reddit application:
- Go to https://www.reddit.com/prefs/apps
- Click "Create App" or "Create Another App"
- Choose "script" as the application type
- Fill in the required fields and create the app
- Note down your
client_idandclient_secret
-
Configure your Reddit app credentials:
cp config/credentials.env .env # Edit .env with your credentials
The easiest way to use the scraper is through the command-line interface:
python scripts/scrape_reddit.py <subreddit> [options]Examples:
# Scrape 50 hot posts from r/technology
python scripts/scrape_reddit.py technology -l 50
# Scrape 25 top posts from r/programming from the last week
python scripts/scrape_reddit.py programming -l 25 -s top -t week
# Search for "python tutorial" in r/learnpython
python scripts/scrape_reddit.py learnpython -q "python tutorial" -l 10
# Export to CSV
python scripts/scrape_reddit.py python -l 20 --format csvThe interactive experience is now powered by Bubble Tea and ships as a Go
program. You can start it either directly via go run or through the provided helper script:
# Preferred
go run ./cmd/tui
# Or use the compatibility wrapper
python scripts/run_tui.pyHighlights:
- Real-time subreddit scraping with live status feedback
- Integrated research assistant that streams the top Reddit answers
- Activity log with timestamps for each operation
- Keyboard-friendly workflow for editing inputs and triggering actions
Keyboard shortcuts:
Tab/Shift+Tab: Move between input fieldsCtrl+S: Start scraping with current parametersCtrl+R: Run the research assistant for the active queryCtrl+CorEsc: Quit the application
After installation, you can use these convenient aliases:
# Launch the TUI interface
rescrape
# Use the CLI directly
rescrapectl <subreddit> [options]To use the aliases, add them to your shell configuration (already done):
~/.bashrcor~/.zshrc
After adding to your shell configuration, reload it:
source ~/.bashrc
# or
source ~/.zshrcYou can also use the scraper directly in your Python code:
from src.advanced_scraper import AdvancedRedditScraper
# Initialize the scraper
scraper = AdvancedRedditScraper()
# Scrape posts from a subreddit
posts = scraper.scrape_posts(
subreddit_name='python',
limit=50,
sort_by='hot'
)
# Export to your preferred format
scraper.export_to_json(posts, 'my_posts.json')
scraper.export_to_csv(posts, 'my_posts.csv')After scraping data, you can visualize it using the included visualization script:
python scripts/visualize_data.py my_posts.jsonThis will create multiple plots showing:
- Distribution of post scores
- Relationship between comments and scores
- Post frequency over time
- Top subreddits by post count
The enhanced rescrape now includes direct export to Obsidian for note-taking and research organization:
# Export research results via the CLI after running a query
python -m src.cli research "Rust async patterns" --export rust-async --summary
# Or programmatically:
from src.obsidian_export import ObsidianExporter
exporter = ObsidianExporter(vault_path="/path/to/your/obsidian/vault")
exporter.export_research_to_obsidian(
research_results=my_results,
query="How to implement LLM fine-tuning",
filename="llm-fine-tuning-research"
)This feature allows you to seamlessly integrate your Reddit research into your Obsidian knowledge base with proper metadata and tagging.
The project includes example scripts demonstrating various use cases:
python scripts/use_cases_demo.pyThis script demonstrates:
- Market research (analyzing technology mentions)
- Sentiment analysis preparation
- Content aggregation across multiple subreddits
- Academic research data collection
The project includes comprehensive test coverage with multiple test suites:
# Run all tests
python -m pytest tests/
# Run tests with coverage report
python -m pytest --cov=src tests/
# Run the test suite using the provided script
./scripts/run_tests.shThe current code coverage is:
- advanced_scraper.py: 58%
- main.py: 28%
- sops_integration.py: 56%
- cli.py: not yet covered (new helper powering the Bubble Tea TUI)
- Overall: 47%
### Environment Variables
The following environment variables can be set in your `.env` file:
- `REDDIT_CLIENT_ID`: Your Reddit app's client ID
- `REDDIT_CLIENT_SECRET`: Your Reddit app's client secret
- `REDDIT_USER_AGENT`: User agent string for the API
- `REDDIT_SUBREDDIT`: Default subreddit to scrape (default: python)
- `REDDIT_LIMIT`: Number of posts to retrieve (default: 100)
- `REDDIT_SORT_BY`: Sorting method (default: hot)
- `REDDIT_TIME_FILTER`: Time filter for top posts (default: all)
- `REDDIT_SEARCH_QUERY`: Optional search query
### SOPS Integration
This project supports retrieving credentials from SOPS-encrypted files. The scraper will automatically attempt to load Reddit credentials from the LAB infrastructure's SOPS-encrypted secrets file if available. If SOPS is not available or configured, it will fall back to using the `.env` file.
To use SOPS integration:
1. Ensure SOPS is installed on your system
2. Set the `SOPS_AGE_KEY_FILE` environment variable to point to your age key file
3. Store your Reddit credentials in a SOPS-encrypted file (default: `/home/miko/LAB/secrets/global.enc.env`)
4. The scraper will automatically load credentials from SOPS when available
## Advanced Features
### Export Formats
The scraper can export data in multiple formats:
- JSON: For data analysis and preservation
- CSV: For spreadsheet applications
- Excel: For detailed analysis with formatting
### Rate Limiting
The scraper respects Reddit's API rate limits. For extensive scraping, consider implementing additional delays between requests.
### Data Fields
The scraper collects comprehensive post data including:
- Post ID, title, author, score, and comment count
- URLs and permalinks
- Creation timestamps
- Post content (selftext)
- Subreddit name
- NSFW flags (over_18)
- Flair information
- Top-level comments
- Upvote ratios
- Crosspost information
- Gilding information
## Security and Privacy
- Store your Reddit credentials securely in the `.env` file
- Never commit the `.env` file to version control
- The `.env` file is included in `.gitignore` by default
- Follow Reddit's API terms of service when scraping
## Research Assistant Features
The enhanced rescrape includes a research assistant that can intelligently find answers to technical questions:
### Research Assistant Interface
- Accessible through both TUI and command-line
- Understands technical queries and expands them with related terms
- Uses semantic search to find conceptually related content
- Scores posts based on authority, engagement, and technical content indicators
- Combines traditional keyword matching with modern semantic understanding
- Provides ranked results with relevance scores
### Technical Content Detection
- Identifies posts with code snippets, technical terms, and implementation details
- Prioritizes authoritative sources with high karma in specific domains
- Uses temporal relevance to surface recent information for fast-evolving fields
- Analyzes comment threads for additional insights
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## License
This project is licensed under the MIT License.