diff --git a/.gitignore b/.gitignore index b8c3ff8..24f1f0d 100644 --- a/.gitignore +++ b/.gitignore @@ -71,3 +71,5 @@ htmlcov/ coverage.xml *.cover /example-site/* +/cache/* +/results/* diff --git a/README-test-environment.md b/README-test-environment.md index a481bcb..6fcf4a5 100644 --- a/README-test-environment.md +++ b/README-test-environment.md @@ -1,5 +1,7 @@ # Web Scraper Test Environment +[← Back to README](README.md) + This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure. ## Generated Test Site diff --git a/README.md b/README.md index a63fe4a..f36315b 100644 --- a/README.md +++ b/README.md @@ -3,84 +3,19 @@ [![Python Tests](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml/badge.svg)](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml) [![Coverage](https://codecov.io/gh/johnburbridge/scraper/branch/main/graph/badge.svg)](https://codecov.io/gh/johnburbridge/scraper) -## Objectives -* Given a URL, recursively crawl its links - * Store the response - * Parse the response extracting new links - * Visit each link and repeat the operations above -* Cache the results to avoid duplicative requests -* Optionally, specify the maximum recursion depth -* Optionally, specify whether to allow requests to other subdomains or domains -* Optimize the process to leverage all available processors +A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options. -## Design +## Documentation -### 1. Architecture Components +- [Project Overview and Features](docs/project.md) +- [Development Guide](docs/develop.md) +- [Test Environment Documentation](README-test-environment.md) -The project will be structured with these core components: - -1. **Crawler** - Main component that orchestrates the crawling process -2. **RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts -3. **ResponseParser** - Parses HTML responses to extract links -4. **Cache** - Stores visited URLs and their responses -5. **LinkFilter** - Filters links based on domain/subdomain rules -6. **TaskManager** - Manages parallel execution of crawling tasks - -### 2. Caching Strategy - -For the caching requirement: - -- **In-memory cache**: Fast but limited by available RAM -- **File-based cache**: Persistent but slower -- **Database cache**: Structured and persistent, but requires setup - -We'll start with a simple in-memory cache using Python's built-in `dict` for development, then expand to a persistent solution like SQLite for production use. - -### 3. Concurrency Model - -For optimizing to leverage all available processors: - -- **Threading**: Good for I/O bound operations like web requests -- **Multiprocessing**: Better for CPU-bound tasks -- **Async I/O**: Excellent for many concurrent I/O operations - -We'll use `asyncio` with `aiohttp` for making concurrent requests, as web scraping is primarily I/O bound. - -### 4. URL Handling and Filtering - -For domain/subdomain filtering: -- Use `urllib.parse` to extract and compare domains -- Implement a configurable rule system (allow/deny lists) -- Handle relative URLs properly by converting them to absolute - -### 5. Depth Management - -For recursion depth: -- Track depth as a parameter passed to each recursive call -- Implement a max depth check before proceeding with crawling -- Consider breadth-first vs. depth-first strategies - -### 6. Error Handling & Politeness - -Additional considerations: -- Robust error handling for network issues and malformed HTML -- Rate limiting to avoid overwhelming servers -- Respect for `robots.txt` rules -- User-agent identification - -### 7. Data Storage - -For storing the crawled data: -- Define a clear structure for storing URLs and their associated content -- Consider what metadata to keep (status code, headers, timestamps) - -## User Guide - -### Installation +## Installation 1. Clone the repository: ```bash -git clone https://github.com/your-username/scraper.git +git clone https://github.com/johnburbridge/scraper.git cd scraper ``` @@ -95,7 +30,7 @@ source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt ``` -### Basic Usage +## Basic Usage To start crawling a website: @@ -105,7 +40,7 @@ python main.py https://example.com This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links). -### Command Line Options +## Command Line Options The scraper supports the following command-line arguments: @@ -128,7 +63,7 @@ The scraper supports the following command-line arguments: | `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) | | `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) | -### Examples +## Examples #### Crawl with a specific depth limit: ```bash @@ -159,74 +94,3 @@ python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots ```bash python main.py https://example.com --delay 1.0 ``` - -## Testing - -The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes. - -### Test Environment Features - -- 400+ HTML pages in a hierarchical structure -- Maximum depth of 5 levels -- Navigation links between pages at different levels -- Proper `robots.txt` and `sitemap.xml` files -- Random metadata on pages for testing extraction - -### Setting Up the Test Environment - -1. Make sure Docker and Docker Compose are installed and running. - -2. Generate the test site (if not already done): -```bash -./venv/bin/python generate_test_site.py -``` - -3. Start the Nginx server: -```bash -docker-compose up -d -``` - -4. The test site will be available at http://localhost:8080 - -### Running Tests Against the Test Environment - -#### Basic crawl: -```bash -python main.py http://localhost:8080 --depth 2 -``` - -#### Test with sitemap parsing: -```bash -python main.py http://localhost:8080 --use-sitemap -``` - -#### Test robots.txt handling: -```bash -# Default behavior respects robots.txt -python main.py http://localhost:8080 --depth 4 - -# Ignore robots.txt to crawl all pages -python main.py http://localhost:8080 --depth 4 --ignore-robots -``` - -#### Save the crawled results: -```bash -python main.py http://localhost:8080 --output-dir test_results -``` - -### Stopping the Test Environment - -To stop the Docker container: -```bash -docker-compose down -``` - -### Regenerating the Test Site - -If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run: - -```bash -./venv/bin/python generate_test_site.py -``` - -For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file. diff --git a/docs/develop.md b/docs/develop.md new file mode 100644 index 0000000..842eed5 --- /dev/null +++ b/docs/develop.md @@ -0,0 +1,185 @@ +# Development Guide + +[← Back to README](../README.md) + +This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project. + +## Setting Up a Development Environment + +### Prerequisites + +- Python 3.11 or higher +- Docker and Docker Compose (for integration testing) +- Git + +### Initial Setup + +1. Clone the repository: +```bash +git clone https://github.com/johnburbridge/scraper.git +cd scraper +``` + +2. Create and activate a virtual environment: +```bash +python -m venv venv +source venv/bin/activate # On Windows: venv\Scripts\activate +``` + +3. Install development dependencies: +```bash +pip install -r requirements-dev.txt +pip install -r requirements.txt +``` + +## Running Tests + +### Unit Tests + +To run all unit tests: +```bash +pytest +``` + +To run tests with coverage reporting: +```bash +pytest --cov=scraper --cov-report=term-missing +``` + +To run a specific test file: +```bash +pytest tests/test_crawler.py +``` + +### Integration Tests + +The project includes a Docker-based test environment that generates a controlled website for testing. + +1. Generate the test site: +```bash +python generate_test_site.py +``` + +2. Start the test environment: +```bash +docker-compose up -d +``` + +3. Run the scraper against the test site: +```bash +python main.py http://localhost:8080 --depth 2 +``` + +4. Stop the test environment when done: +```bash +docker-compose down +``` + +### Alternative Test Server + +If Docker is unavailable, you can use the Python-based test server: + +```bash +python serve_test_site.py +``` + +This will start a local HTTP server on port 8080 serving the same test site. + +## Code Quality Tools + +### Linting + +To check code quality with flake8: +```bash +flake8 scraper tests +``` + +### Type Checking + +To run type checking with mypy: +```bash +mypy scraper +``` + +### Code Formatting + +To format code with black: +```bash +black scraper tests +``` + +## Debugging + +### Verbose Output + +To enable verbose logging: +```bash +python main.py https://example.com -v +``` + +### Profiling + +To profile the crawler's performance: +```bash +python -m cProfile -o crawler.prof main.py https://example.com --depth 1 +python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)" +``` + +## Test Coverage + +Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage: + +1. Check current coverage gaps: +```bash +pytest --cov=scraper --cov-report=term-missing +``` + +2. Target untested functions or code paths with new tests +3. Verify coverage improvement after adding tests + +## Project Structure + +``` +scraper/ # Main package directory +├── __init__.py # Package initialization +├── cache_manager.py # Cache implementation +├── callbacks.py # Callback functions for crawled pages +├── crawler.py # Main crawler class +├── request_handler.py # HTTP request/response handling +├── response_parser.py # HTML parsing and link extraction +├── robots_parser.py # robots.txt parsing and checking +└── sitemap_parser.py # sitemap.xml parsing + +tests/ # Test suite +├── __init__.py +├── conftest.py # pytest fixtures +├── test_cache.py # Tests for cache_manager.py +├── test_crawler.py # Tests for crawler.py +├── test_request_handler.py +├── test_response_parser.py +├── test_robots_parser.py +└── test_sitemap_parser.py + +docs/ # Documentation +├── project.md # Project overview and features +└── develop.md # Development guide + +.github/workflows/ # CI configuration +``` + +## Contributing + +### Pull Request Process + +1. Create a new branch for your feature or bugfix +2. Implement your changes with appropriate tests +3. Ensure all tests pass and coverage doesn't decrease +4. Submit a pull request with a clear description of the changes + +### Coding Standards + +- Follow PEP 8 style guidelines +- Include docstrings for all functions, classes, and modules +- Add type hints to function signatures +- Keep functions focused on a single responsibility +- Write tests for all new functionality \ No newline at end of file diff --git a/docs/project.md b/docs/project.md new file mode 100644 index 0000000..e362443 --- /dev/null +++ b/docs/project.md @@ -0,0 +1,76 @@ +# Scraper Project Overview + +[← Back to README](../README.md) + +## Objectives +* Given a URL, recursively crawl its links + * Store the response + * Parse the response extracting new links + * Visit each link and repeat the operations above +* Cache the results to avoid duplicative requests +* Optionally, specify the maximum recursion depth +* Optionally, specify whether to allow requests to other subdomains or domains +* Optimize the process to leverage all available processors + +## Architecture Components + +The project is structured with these core components: + +1. **Crawler** - Main component that orchestrates the crawling process +2. **RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts +3. **ResponseParser** - Parses HTML responses to extract links +4. **Cache** - Stores visited URLs and their responses +5. **LinkFilter** - Filters links based on domain/subdomain rules +6. **SitemapParser** - Extracts URLs from site's sitemap.xml +7. **RobotsParser** - Interprets and follows robots.txt directives +8. **Callbacks** - Processes crawled pages (console output, JSON files, etc.) + +## Caching Strategy + +The scraper implements a persistent SQLite-based cache: + +- **Schema**: Stores URLs, content, headers, status codes, and timestamps +- **Expiry**: Configurable TTL for cache entries +- **Performance**: Fast lookups and efficient storage +- **Disk-based**: Persists between runs for incremental crawling + +## Concurrency Model + +For optimizing to leverage all available processors: + +- **Async I/O**: Uses `asyncio` for many concurrent I/O operations +- **Task Management**: Dynamic task creation and limiting +- **Rate Limiting**: Configurable delay between requests +- **Resource Control**: Respects system limitations + +## URL Handling and Filtering + +For domain/subdomain filtering: +- **Domain Isolation**: Restricts crawling to the target domain by default +- **Subdomain Control**: Configurable inclusion/exclusion of subdomains +- **URL Normalization**: Resolves relative URLs to absolute paths +- **URL Filtering**: Skips binary files and unwanted file types + +## Depth Management + +For recursion depth: +- **Level Tracking**: Maintains depth of each page in the crawl +- **Depth Limiting**: Stops at configured maximum depth +- **Breadth-First Approach**: Ensures thorough coverage at each level + +## Politeness Features + +Implements web crawler etiquette: +- **Robots.txt Support**: Respects website crawler policies +- **Rate Limiting**: Configurable delay between requests +- **Proper User-Agent**: Identifies itself appropriately +- **Sitemap Usage**: Can use the site's sitemap.xml for discovery +- **Error Handling**: Backs off on server errors + +## Data Processing + +Flexible options for handling crawled data: +- **Console Output**: Displays crawled pages in terminal +- **JSON Storage**: Saves pages as structured data files +- **Custom Callbacks**: Extensible system for custom processing +- **Statistics**: Provides detailed crawl statistics \ No newline at end of file