Skip to content

Conversation

@nipsysdev
Copy link
Owner

Closes #17

Pull Request Summary: Access Now Crawling Configuration Implementation

Overview

This PR implements crawling configuration for Access Now (AN), adding it as a new supported source for the Ethos crawler. The implementation follows the existing pattern of source configurations and includes both listing and content extraction capabilities.

Key Changes

1. New Source Configuration: Access Now

  • File: src/config/sources/an.ts
  • Added complete configuration for Access Now website crawling
  • Configured listing page extraction with:
  • Configured content page extraction with:
    • Container selector for main content
    • Field extraction for title, content, and author

2. Source Integration

  • File: src/config/sources/index.ts
  • Added Access Now source to the main sources array
  • Available as source ID "an"

3. Core Type Updates

  • File: src/core/types.ts
  • Extended type definitions to support the new source configuration

4. Comprehensive Testing

  • Files: Multiple test files updated
  • Added integration tests in src/tests/integration/an-integration.test.ts
  • Added fixture data for content validation
  • Updated exclusion logic testing

5. Documentation Update

  • File: README.md
  • Updated supported sources list to include Access Now

Technical Details

Listing Page Extraction Improvements

The implementation includes enhanced error handling and validation in the listing page extraction process:

  • Better field extraction with exclusion selector support
  • Improved error reporting for missing required fields
  • Enhanced filtering logic with detailed reason tracking

Content Exclusion Logic

Access Now configuration includes specific exclusion rules to filter out:

  • External content (identified by post-grid-item--external-icon)
  • Press releases (paths containing "accessnow.org/press-release")
  • Guides (paths containing "accessnow.org/guide")

Rate Limiting

Implemented a 10-second delay in pagination to prevent IP blocking by Access Now's anti-crawling mechanisms.

Test Coverage

  • Added integration tests for both listing and content page crawling
  • Created fixture data for content validation
  • Verified successful extraction of titles, URLs, dates, and content
  • Tested pagination navigation

@nipsysdev nipsysdev self-assigned this Sep 15, 2025
@nipsysdev nipsysdev added the enhancement New feature or request label Sep 15, 2025
@nipsysdev nipsysdev merged commit 6d2d601 into main Sep 15, 2025
5 checks passed
@nipsysdev nipsysdev deleted the feat/access_now_crawling_config branch September 15, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add AccessNow listing crawler configuration

2 participants