johnburbridge · johnburbridge · Mar 19, 2025 · Mar 18, 2025 · Mar 19, 2025
diff --git a/.gitignore b/.gitignore
@@ -71,3 +71,5 @@ htmlcov/
 coverage.xml
 *.cover
 /example-site/*
+/cache/*
+/results/*
diff --git a/README-test-environment.md b/README-test-environment.md
@@ -1,5 +1,7 @@
 # Web Scraper Test Environment
 
+[← Back to README](README.md)
+
 This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.
 
 ## Generated Test Site

diff --git a/README.md b/README.md
@@ -3,84 +3,19 @@
 [![Python Tests](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml/badge.svg)](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml)
 [![Coverage](https://codecov.io/gh/johnburbridge/scraper/branch/main/graph/badge.svg)](https://codecov.io/gh/johnburbridge/scraper)
 
-## Objectives
-* Given a URL, recursively crawl its links
-  * Store the response
-  * Parse the response extracting new links
-  * Visit each link and repeat the operations above
-* Cache the results to avoid duplicative requests
-* Optionally, specify the maximum recursion depth
-* Optionally, specify whether to allow requests to other subdomains or domains
-* Optimize the process to leverage all available processors
+A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.
 
-## Design
+## Documentation
 
-### 1. Architecture Components
+- [Project Overview and Features](docs/project.md)
+- [Development Guide](docs/develop.md)
+- [Test Environment Documentation](README-test-environment.md)
 
-The project will be structured with these core components:
-
-1. **Crawler** - Main component that orchestrates the crawling process
-2. **RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts
-3. **ResponseParser** - Parses HTML responses to extract links
-4. **Cache** - Stores visited URLs and their responses
-5. **LinkFilter** - Filters links based on domain/subdomain rules
-6. **TaskManager** - Manages parallel execution of crawling tasks
-
-### 2. Caching Strategy
-
-For the caching requirement:
-
-- **In-memory cache**: Fast but limited by available RAM
-- **File-based cache**: Persistent but slower
-- **Database cache**: Structured and persistent, but requires setup
-
-We'll start with a simple in-memory cache using Python's built-in `dict` for development, then expand to a persistent solution like SQLite for production use.
-
-### 3. Concurrency Model
-
-For optimizing to leverage all available processors:
-
-- **Threading**: Good for I/O bound operations like web requests
-- **Multiprocessing**: Better for CPU-bound tasks
-- **Async I/O**: Excellent for many concurrent I/O operations
-
-We'll use `asyncio` with `aiohttp` for making concurrent requests, as web scraping is primarily I/O bound.
-
-### 4. URL Handling and Filtering
-
-For domain/subdomain filtering:
-- Use `urllib.parse` to extract and compare domains
-- Implement a configurable rule system (allow/deny lists)
-- Handle relative URLs properly by converting them to absolute
-
-### 5. Depth Management
-
-For recursion depth:
-- Track depth as a parameter passed to each recursive call
-- Implement a max depth check before proceeding with crawling
-- Consider breadth-first vs. depth-first strategies
-
-### 6. Error Handling & Politeness
-
-Additional considerations:
-- Robust error handling for network issues and malformed HTML
-- Rate limiting to avoid overwhelming servers
-- Respect for `robots.txt` rules
-- User-agent identification
-
-### 7. Data Storage
-
-For storing the crawled data:
-- Define a clear structure for storing URLs and their associated content
-- Consider what metadata to keep (status code, headers, timestamps)
-
-## User Guide
-
-### Installation
+## Installation
 
 1. Clone the repository:
 ```bash
-git clone https://github.com/your-username/scraper.git
+git clone https://github.com/johnburbridge/scraper.git
 cd scraper
 ```
 
@@ -95,7 +30,7 @@ source venv/bin/activate  # On Windows: venv\Scripts\activate
 pip install -r requirements.txt
 ```
 
-### Basic Usage
+## Basic Usage
 
 To start crawling a website:
 
@@ -105,7 +40,7 @@ python main.py https://example.com
 
 This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).
 
-### Command Line Options
+## Command Line Options
 
 The scraper supports the following command-line arguments:
 
@@ -128,7 +63,7 @@ The scraper supports the following command-line arguments:
 | `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) |
 | `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) |
 
-### Examples
+## Examples
 
 #### Crawl with a specific depth limit:
 ```bash
@@ -159,74 +94,3 @@ python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
 ```bash
 python main.py https://example.com --delay 1.0
 ```
-
-## Testing
-
-The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes.
-
-### Test Environment Features
-
-- 400+ HTML pages in a hierarchical structure
-- Maximum depth of 5 levels
-- Navigation links between pages at different levels
-- Proper `robots.txt` and `sitemap.xml` files
-- Random metadata on pages for testing extraction
-
-### Setting Up the Test Environment
-
-1. Make sure Docker and Docker Compose are installed and running.
-
-2. Generate the test site (if not already done):
-```bash
-./venv/bin/python generate_test_site.py
-```
-
-3. Start the Nginx server:
-```bash
-docker-compose up -d
-```
-
-4. The test site will be available at http://localhost:8080
-
-### Running Tests Against the Test Environment
-
-#### Basic crawl:
-```bash
-python main.py http://localhost:8080 --depth 2
-```
-
-#### Test with sitemap parsing:
-```bash
-python main.py http://localhost:8080 --use-sitemap
-```
-
-#### Test robots.txt handling:
-```bash
-# Default behavior respects robots.txt
-python main.py http://localhost:8080 --depth 4 
-
-# Ignore robots.txt to crawl all pages
-python main.py http://localhost:8080 --depth 4 --ignore-robots
-```
-
-#### Save the crawled results:
-```bash
-python main.py http://localhost:8080 --output-dir test_results
-```
-
-### Stopping the Test Environment
-
-To stop the Docker container:
-```bash
-docker-compose down
-```
-
-### Regenerating the Test Site
-
-If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:
-
-```bash
-./venv/bin/python generate_test_site.py
-```
-
-For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.
diff --git a/docs/develop.md b/docs/develop.md
@@ -0,0 +1,185 @@
+# Development Guide
+
+[← Back to README](../README.md)
+
+This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project.
+
+## Setting Up a Development Environment
+
+### Prerequisites
+
+- Python 3.11 or higher
+- Docker and Docker Compose (for integration testing)
+- Git
+
+### Initial Setup
+
+1. Clone the repository:
+```bash
+git clone https://github.com/johnburbridge/scraper.git
+cd scraper
+```
+
+2. Create and activate a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+3. Install development dependencies:
+```bash
+pip install -r requirements-dev.txt
+pip install -r requirements.txt
+```
+
+## Running Tests
+
+### Unit Tests
+
+To run all unit tests:
+```bash
+pytest
+```
+
+To run tests with coverage reporting:
+```bash
+pytest --cov=scraper --cov-report=term-missing
+```
+
+To run a specific test file:
+```bash
+pytest tests/test_crawler.py
+```
+
+### Integration Tests
+
+The project includes a Docker-based test environment that generates a controlled website for testing.
+
+1. Generate the test site:
+```bash
+python generate_test_site.py
+```
+
+2. Start the test environment:
+```bash
+docker-compose up -d
+```
+
+3. Run the scraper against the test site:
+```bash
+python main.py http://localhost:8080 --depth 2
+```
+
+4. Stop the test environment when done:
+```bash
+docker-compose down
+```
+
+### Alternative Test Server
+
+If Docker is unavailable, you can use the Python-based test server:
+
+```bash
+python serve_test_site.py
+```
+
+This will start a local HTTP server on port 8080 serving the same test site.
+
+## Code Quality Tools
+
+### Linting
+
+To check code quality with flake8:
+```bash
+flake8 scraper tests
+```
+
+### Type Checking
+
+To run type checking with mypy:
+```bash
+mypy scraper
+```
+
+### Code Formatting
+
+To format code with black:
+```bash
+black scraper tests
+```
+
+## Debugging
+
+### Verbose Output
+
+To enable verbose logging:
+```bash
+python main.py https://example.com -v
+```
+
+### Profiling
+
+To profile the crawler's performance:
+```bash
+python -m cProfile -o crawler.prof main.py https://example.com --depth 1
+python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)"
+```
+
+## Test Coverage
+
+Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage:
+
+1. Check current coverage gaps:
+```bash
+pytest --cov=scraper --cov-report=term-missing
+```
+
+2. Target untested functions or code paths with new tests
+3. Verify coverage improvement after adding tests
+
+## Project Structure
+
+```
+scraper/                 # Main package directory
+├── __init__.py          # Package initialization
+├── cache_manager.py     # Cache implementation
+├── callbacks.py         # Callback functions for crawled pages
+├── crawler.py           # Main crawler class
+├── request_handler.py   # HTTP request/response handling
+├── response_parser.py   # HTML parsing and link extraction
+├── robots_parser.py     # robots.txt parsing and checking
+└── sitemap_parser.py    # sitemap.xml parsing
+
+tests/                   # Test suite
+├── __init__.py
+├── conftest.py          # pytest fixtures
+├── test_cache.py        # Tests for cache_manager.py
+├── test_crawler.py      # Tests for crawler.py
+├── test_request_handler.py
+├── test_response_parser.py
+├── test_robots_parser.py
+└── test_sitemap_parser.py
+
+docs/                    # Documentation
+├── project.md           # Project overview and features
+└── develop.md           # Development guide
+
+.github/workflows/       # CI configuration
+```
+
+## Contributing
+
+### Pull Request Process
+
+1. Create a new branch for your feature or bugfix
+2. Implement your changes with appropriate tests
+3. Ensure all tests pass and coverage doesn't decrease
+4. Submit a pull request with a clear description of the changes
+
+### Coding Standards
+
+- Follow PEP 8 style guidelines
+- Include docstrings for all functions, classes, and modules
+- Add type hints to function signatures
+- Keep functions focused on a single responsibility
+- Write tests for all new functionality