Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,4 @@ htmlcov/
.coverage.*
coverage.xml
*.cover
/example-site/*
72 changes: 72 additions & 0 deletions README-test-environment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Web Scraper Test Environment

This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.

## Generated Test Site

A test website with the following characteristics has been generated:
- 400+ HTML pages in a hierarchical structure
- Maximum depth of 5 levels
- Navigation links between pages at different levels
- Proper `robots.txt` and `sitemap.xml` files
- Random metadata on pages for testing extraction

## Directory Structure

- `example-site/` - Contains all the generated HTML files and resources
- `index.html` - Homepage
- `page*.html` - Top-level pages
- `section*/` - Section directories with their own pages
- `robots.txt` - Contains crawler directives with some intentionally disallowed pages
- `sitemap.xml` - XML sitemap with all publicly available pages

- `nginx/` - Contains Nginx configuration
- `nginx.conf` - Server configuration with directory listing enabled

- `docker-compose.yml` - Docker Compose configuration for running Nginx

- `generate_test_site.py` - Script that generated the test site

## Running the Test Environment

1. Make sure Docker and Docker Compose are installed and running
2. Start the Nginx server:
```
docker-compose up -d
```
3. The test site will be available at http://localhost:8080

## Testing the Scraper

You can test your scraper against this environment with:

```
python main.py http://localhost:8080 --depth 3
```

Additional test commands:

- Test with sitemap parsing:
```
python main.py http://localhost:8080 --use-sitemap
```

- Test with robots.txt consideration:
```
python main.py http://localhost:8080 --respect-robots-txt
```

## Site Characteristics for Testing

- The site contains a mix of pages that link to subpages
- Some deeper pages (depth >= 3) are disallowed in robots.txt
- Pages have consistent navigation but varying depth
- The sitemap includes all non-disallowed pages with metadata

## Regenerating the Test Site

If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:

```
./venv/bin/python generate_test_site.py
```
157 changes: 157 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,160 @@ Additional considerations:
For storing the crawled data:
- Define a clear structure for storing URLs and their associated content
- Consider what metadata to keep (status code, headers, timestamps)

## User Guide

### Installation

1. Clone the repository:
```bash
git clone https://github.com/your-username/scraper.git
cd scraper
```

2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

### Basic Usage

To start crawling a website:

```bash
python main.py https://example.com
```

This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).

### Command Line Options

The scraper supports the following command-line arguments:

| Option | Description |
|--------|-------------|
| `url` | The URL to start crawling from (required) |
| `-h, --help` | Show help message and exit |
| `-d, --depth DEPTH` | Maximum recursion depth (default: 3) |
| `--allow-external` | Allow crawling external domains |
| `--no-subdomains` | Disallow crawling subdomains |
| `-c, --concurrency CONCURRENCY` | Maximum concurrent requests (default: 10) |
| `--no-cache` | Disable caching |
| `--cache-dir CACHE_DIR` | Directory for cache storage |
| `--delay DELAY` | Delay between requests in seconds (default: 0.1) |
| `-v, --verbose` | Enable verbose logging |
| `--output-dir OUTPUT_DIR` | Directory to save results as JSON files |
| `--print-pages` | Print scraped pages to console |
| `--ignore-robots` | Ignore robots.txt rules |
| `--use-sitemap` | Use sitemap.xml for URL discovery |
| `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) |
| `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) |

### Examples

#### Crawl with a specific depth limit:
```bash
python main.py https://example.com --depth 5
```

#### Allow crawling external domains:
```bash
python main.py https://example.com --allow-external
```

#### Save crawled pages to a specific directory:
```bash
python main.py https://example.com --output-dir results
```

#### Use sitemap for discovery with a longer timeout:
```bash
python main.py https://example.com --use-sitemap --sitemap-timeout 60
```

#### Maximum performance for a large site:
```bash
python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
```

#### Crawl site slowly to avoid rate limiting:
```bash
python main.py https://example.com --delay 1.0
```

## Testing

The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes.

### Test Environment Features

- 400+ HTML pages in a hierarchical structure
- Maximum depth of 5 levels
- Navigation links between pages at different levels
- Proper `robots.txt` and `sitemap.xml` files
- Random metadata on pages for testing extraction

### Setting Up the Test Environment

1. Make sure Docker and Docker Compose are installed and running.

2. Generate the test site (if not already done):
```bash
./venv/bin/python generate_test_site.py
```

3. Start the Nginx server:
```bash
docker-compose up -d
```

4. The test site will be available at http://localhost:8080

### Running Tests Against the Test Environment

#### Basic crawl:
```bash
python main.py http://localhost:8080 --depth 2
```

#### Test with sitemap parsing:
```bash
python main.py http://localhost:8080 --use-sitemap
```

#### Test robots.txt handling:
```bash
# Default behavior respects robots.txt
python main.py http://localhost:8080 --depth 4

# Ignore robots.txt to crawl all pages
python main.py http://localhost:8080 --depth 4 --ignore-robots
```

#### Save the crawled results:
```bash
python main.py http://localhost:8080 --output-dir test_results
```

### Stopping the Test Environment

To stop the Docker container:
```bash
docker-compose down
```

### Regenerating the Test Site

If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:

```bash
./venv/bin/python generate_test_site.py
```

For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.
11 changes: 11 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: '3'

services:
nginx:
image: nginx:alpine
ports:
- "8080:80"
volumes:
- ./example-site:/usr/share/nginx/html
- ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
restart: always
Loading