johnburbridge · johnburbridge · Mar 18, 2025 · Mar 17, 2025 · Mar 17, 2025 · Mar 17, 2025
diff --git a/.gitignore b/.gitignore
@@ -70,3 +70,4 @@ htmlcov/
 .coverage.*
 coverage.xml
 *.cover
+/example-site/*
diff --git a/README-test-environment.md b/README-test-environment.md
@@ -0,0 +1,72 @@
+# Web Scraper Test Environment
+
+This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.
+
+## Generated Test Site
+
+A test website with the following characteristics has been generated:
+- 400+ HTML pages in a hierarchical structure
+- Maximum depth of 5 levels
+- Navigation links between pages at different levels
+- Proper `robots.txt` and `sitemap.xml` files
+- Random metadata on pages for testing extraction
+
+## Directory Structure
+
+- `example-site/` - Contains all the generated HTML files and resources
+  - `index.html` - Homepage
+  - `page*.html` - Top-level pages
+  - `section*/` - Section directories with their own pages
+  - `robots.txt` - Contains crawler directives with some intentionally disallowed pages
+  - `sitemap.xml` - XML sitemap with all publicly available pages
+
+- `nginx/` - Contains Nginx configuration
+  - `nginx.conf` - Server configuration with directory listing enabled
+
+- `docker-compose.yml` - Docker Compose configuration for running Nginx
+
+- `generate_test_site.py` - Script that generated the test site
+
+## Running the Test Environment
+
+1. Make sure Docker and Docker Compose are installed and running
+2. Start the Nginx server:
+   ```
+   docker-compose up -d
+   ```
+3. The test site will be available at http://localhost:8080
+
+## Testing the Scraper
+
+You can test your scraper against this environment with:
+
+```
+python main.py http://localhost:8080 --depth 3
+```
+
+Additional test commands:
+
+- Test with sitemap parsing:
+  ```
+  python main.py http://localhost:8080 --use-sitemap
+  ```
+
+- Test with robots.txt consideration:
+  ```
+  python main.py http://localhost:8080 --respect-robots-txt
+  ```
+
+## Site Characteristics for Testing
+
+- The site contains a mix of pages that link to subpages
+- Some deeper pages (depth >= 3) are disallowed in robots.txt
+- Pages have consistent navigation but varying depth
+- The sitemap includes all non-disallowed pages with metadata
+
+## Regenerating the Test Site
+
+If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:
+
+```
+./venv/bin/python generate_test_site.py
+``` 
diff --git a/README.md b/README.md
@@ -70,3 +70,160 @@ Additional considerations:
 For storing the crawled data:
 - Define a clear structure for storing URLs and their associated content
 - Consider what metadata to keep (status code, headers, timestamps)
+
+## User Guide
+
+### Installation
+
+1. Clone the repository:
+```bash
+git clone https://github.com/your-username/scraper.git
+cd scraper
+```
+
+2. Create and activate a virtual environment:
+```bash
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+3. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+### Basic Usage
+
+To start crawling a website:
+
+```bash
+python main.py https://example.com
+```
+
+This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).
+
+### Command Line Options
+
+The scraper supports the following command-line arguments:
+
+| Option | Description |
+|--------|-------------|
+| `url` | The URL to start crawling from (required) |
+| `-h, --help` | Show help message and exit |
+| `-d, --depth DEPTH` | Maximum recursion depth (default: 3) |
+| `--allow-external` | Allow crawling external domains |
+| `--no-subdomains` | Disallow crawling subdomains |
+| `-c, --concurrency CONCURRENCY` | Maximum concurrent requests (default: 10) |
+| `--no-cache` | Disable caching |
+| `--cache-dir CACHE_DIR` | Directory for cache storage |
+| `--delay DELAY` | Delay between requests in seconds (default: 0.1) |
+| `-v, --verbose` | Enable verbose logging |
+| `--output-dir OUTPUT_DIR` | Directory to save results as JSON files |
+| `--print-pages` | Print scraped pages to console |
+| `--ignore-robots` | Ignore robots.txt rules |
+| `--use-sitemap` | Use sitemap.xml for URL discovery |
+| `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) |
+| `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) |
+
+### Examples
+
+#### Crawl with a specific depth limit:
+```bash
+python main.py https://example.com --depth 5
+```
+
+#### Allow crawling external domains:
+```bash
+python main.py https://example.com --allow-external
+```
+
+#### Save crawled pages to a specific directory:
+```bash
+python main.py https://example.com --output-dir results
+```
+
+#### Use sitemap for discovery with a longer timeout:
+```bash
+python main.py https://example.com --use-sitemap --sitemap-timeout 60
+```
+
+#### Maximum performance for a large site:
+```bash
+python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
+```
+
+#### Crawl site slowly to avoid rate limiting:
+```bash
+python main.py https://example.com --delay 1.0
+```
+
+## Testing
+
+The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes.
+
+### Test Environment Features
+
+- 400+ HTML pages in a hierarchical structure
+- Maximum depth of 5 levels
+- Navigation links between pages at different levels
+- Proper `robots.txt` and `sitemap.xml` files
+- Random metadata on pages for testing extraction
+
+### Setting Up the Test Environment
+
+1. Make sure Docker and Docker Compose are installed and running.
+
+2. Generate the test site (if not already done):
+```bash
+./venv/bin/python generate_test_site.py
+```
+
+3. Start the Nginx server:
+```bash
+docker-compose up -d
+```
+
+4. The test site will be available at http://localhost:8080
+
+### Running Tests Against the Test Environment
+
+#### Basic crawl:
+```bash
+python main.py http://localhost:8080 --depth 2
+```
+
+#### Test with sitemap parsing:
+```bash
+python main.py http://localhost:8080 --use-sitemap
+```
+
+#### Test robots.txt handling:
+```bash
+# Default behavior respects robots.txt
+python main.py http://localhost:8080 --depth 4 
+
+# Ignore robots.txt to crawl all pages
+python main.py http://localhost:8080 --depth 4 --ignore-robots
+```
+
+#### Save the crawled results:
+```bash
+python main.py http://localhost:8080 --output-dir test_results
+```
+
+### Stopping the Test Environment
+
+To stop the Docker container:
+```bash
+docker-compose down
+```
+
+### Regenerating the Test Site
+
+If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:
+
+```bash
+./venv/bin/python generate_test_site.py
+```
+
+For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,11 @@
+version: '3'
+
+services:
+  nginx:
+    image: nginx:alpine
+    ports:
+      - "8080:80"
+    volumes:
+      - ./example-site:/usr/share/nginx/html
+      - ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
+    restart: always