Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,5 @@ htmlcov/
coverage.xml
*.cover
/example-site/*
/cache/*
/results/*
2 changes: 2 additions & 0 deletions README-test-environment.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Web Scraper Test Environment

[← Back to README](README.md)

This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.

## Generated Test Site
Expand Down
156 changes: 10 additions & 146 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,84 +3,19 @@
[![Python Tests](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml/badge.svg)](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml)
[![Coverage](https://codecov.io/gh/johnburbridge/scraper/branch/main/graph/badge.svg)](https://codecov.io/gh/johnburbridge/scraper)

## Objectives
* Given a URL, recursively crawl its links
* Store the response
* Parse the response extracting new links
* Visit each link and repeat the operations above
* Cache the results to avoid duplicative requests
* Optionally, specify the maximum recursion depth
* Optionally, specify whether to allow requests to other subdomains or domains
* Optimize the process to leverage all available processors
A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.

## Design
## Documentation

### 1. Architecture Components
- [Project Overview and Features](docs/project.md)
- [Development Guide](docs/develop.md)
- [Test Environment Documentation](README-test-environment.md)

The project will be structured with these core components:

1. **Crawler** - Main component that orchestrates the crawling process
2. **RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts
3. **ResponseParser** - Parses HTML responses to extract links
4. **Cache** - Stores visited URLs and their responses
5. **LinkFilter** - Filters links based on domain/subdomain rules
6. **TaskManager** - Manages parallel execution of crawling tasks

### 2. Caching Strategy

For the caching requirement:

- **In-memory cache**: Fast but limited by available RAM
- **File-based cache**: Persistent but slower
- **Database cache**: Structured and persistent, but requires setup

We'll start with a simple in-memory cache using Python's built-in `dict` for development, then expand to a persistent solution like SQLite for production use.

### 3. Concurrency Model

For optimizing to leverage all available processors:

- **Threading**: Good for I/O bound operations like web requests
- **Multiprocessing**: Better for CPU-bound tasks
- **Async I/O**: Excellent for many concurrent I/O operations

We'll use `asyncio` with `aiohttp` for making concurrent requests, as web scraping is primarily I/O bound.

### 4. URL Handling and Filtering

For domain/subdomain filtering:
- Use `urllib.parse` to extract and compare domains
- Implement a configurable rule system (allow/deny lists)
- Handle relative URLs properly by converting them to absolute

### 5. Depth Management

For recursion depth:
- Track depth as a parameter passed to each recursive call
- Implement a max depth check before proceeding with crawling
- Consider breadth-first vs. depth-first strategies

### 6. Error Handling & Politeness

Additional considerations:
- Robust error handling for network issues and malformed HTML
- Rate limiting to avoid overwhelming servers
- Respect for `robots.txt` rules
- User-agent identification

### 7. Data Storage

For storing the crawled data:
- Define a clear structure for storing URLs and their associated content
- Consider what metadata to keep (status code, headers, timestamps)

## User Guide

### Installation
## Installation

1. Clone the repository:
```bash
git clone https://github.com/your-username/scraper.git
git clone https://github.com/johnburbridge/scraper.git
cd scraper
```

Expand All @@ -95,7 +30,7 @@ source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

### Basic Usage
## Basic Usage

To start crawling a website:

Expand All @@ -105,7 +40,7 @@ python main.py https://example.com

This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).

### Command Line Options
## Command Line Options

The scraper supports the following command-line arguments:

Expand All @@ -128,7 +63,7 @@ The scraper supports the following command-line arguments:
| `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) |
| `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) |

### Examples
## Examples

#### Crawl with a specific depth limit:
```bash
Expand Down Expand Up @@ -159,74 +94,3 @@ python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
```bash
python main.py https://example.com --delay 1.0
```

## Testing

The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes.

### Test Environment Features

- 400+ HTML pages in a hierarchical structure
- Maximum depth of 5 levels
- Navigation links between pages at different levels
- Proper `robots.txt` and `sitemap.xml` files
- Random metadata on pages for testing extraction

### Setting Up the Test Environment

1. Make sure Docker and Docker Compose are installed and running.

2. Generate the test site (if not already done):
```bash
./venv/bin/python generate_test_site.py
```

3. Start the Nginx server:
```bash
docker-compose up -d
```

4. The test site will be available at http://localhost:8080

### Running Tests Against the Test Environment

#### Basic crawl:
```bash
python main.py http://localhost:8080 --depth 2
```

#### Test with sitemap parsing:
```bash
python main.py http://localhost:8080 --use-sitemap
```

#### Test robots.txt handling:
```bash
# Default behavior respects robots.txt
python main.py http://localhost:8080 --depth 4

# Ignore robots.txt to crawl all pages
python main.py http://localhost:8080 --depth 4 --ignore-robots
```

#### Save the crawled results:
```bash
python main.py http://localhost:8080 --output-dir test_results
```

### Stopping the Test Environment

To stop the Docker container:
```bash
docker-compose down
```

### Regenerating the Test Site

If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:

```bash
./venv/bin/python generate_test_site.py
```

For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.
185 changes: 185 additions & 0 deletions docs/develop.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Development Guide

[← Back to README](../README.md)

This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project.

## Setting Up a Development Environment

### Prerequisites

- Python 3.11 or higher
- Docker and Docker Compose (for integration testing)
- Git

### Initial Setup

1. Clone the repository:
```bash
git clone https://github.com/johnburbridge/scraper.git
cd scraper
```

2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install development dependencies:
```bash
pip install -r requirements-dev.txt
pip install -r requirements.txt
```

## Running Tests

### Unit Tests

To run all unit tests:
```bash
pytest
```

To run tests with coverage reporting:
```bash
pytest --cov=scraper --cov-report=term-missing
```

To run a specific test file:
```bash
pytest tests/test_crawler.py
```

### Integration Tests

The project includes a Docker-based test environment that generates a controlled website for testing.

1. Generate the test site:
```bash
python generate_test_site.py
```

2. Start the test environment:
```bash
docker-compose up -d
```

3. Run the scraper against the test site:
```bash
python main.py http://localhost:8080 --depth 2
```

4. Stop the test environment when done:
```bash
docker-compose down
```

### Alternative Test Server

If Docker is unavailable, you can use the Python-based test server:

```bash
python serve_test_site.py
```

This will start a local HTTP server on port 8080 serving the same test site.

## Code Quality Tools

### Linting

To check code quality with flake8:
```bash
flake8 scraper tests
```

### Type Checking

To run type checking with mypy:
```bash
mypy scraper
```

### Code Formatting

To format code with black:
```bash
black scraper tests
```

## Debugging

### Verbose Output

To enable verbose logging:
```bash
python main.py https://example.com -v
```

### Profiling

To profile the crawler's performance:
```bash
python -m cProfile -o crawler.prof main.py https://example.com --depth 1
python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)"
```

## Test Coverage

Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage:

1. Check current coverage gaps:
```bash
pytest --cov=scraper --cov-report=term-missing
```

2. Target untested functions or code paths with new tests
3. Verify coverage improvement after adding tests

## Project Structure

```
scraper/ # Main package directory
├── __init__.py # Package initialization
├── cache_manager.py # Cache implementation
├── callbacks.py # Callback functions for crawled pages
├── crawler.py # Main crawler class
├── request_handler.py # HTTP request/response handling
├── response_parser.py # HTML parsing and link extraction
├── robots_parser.py # robots.txt parsing and checking
└── sitemap_parser.py # sitemap.xml parsing

tests/ # Test suite
├── __init__.py
├── conftest.py # pytest fixtures
├── test_cache.py # Tests for cache_manager.py
├── test_crawler.py # Tests for crawler.py
├── test_request_handler.py
├── test_response_parser.py
├── test_robots_parser.py
└── test_sitemap_parser.py

docs/ # Documentation
├── project.md # Project overview and features
└── develop.md # Development guide

.github/workflows/ # CI configuration
```

## Contributing

### Pull Request Process

1. Create a new branch for your feature or bugfix
2. Implement your changes with appropriate tests
3. Ensure all tests pass and coverage doesn't decrease
4. Submit a pull request with a clear description of the changes

### Coding Standards

- Follow PEP 8 style guidelines
- Include docstrings for all functions, classes, and modules
- Add type hints to function signatures
- Keep functions focused on a single responsibility
- Write tests for all new functionality
Loading
Loading