Skip to content

Conversation

@johnburbridge
Copy link
Owner

No description provided.

…mory and persistent cache

- Modified clear_expired() to track unique URLs using a set

- Changed SQL query to fetch URLs instead of just count

- Updated test to verify cache state without relying on has() method

- Ensures consistent behavior across Python versions
- Add RobotsParser class for parsing robots.txt files

- Add SitemapParser class for parsing sitemap.xml files

- Update Crawler to respect robots.txt and use sitemaps

- Add command line options for robots.txt and sitemaps

- Add unit tests for both parsers

- Add lxml dependency for XML parsing
- Add max_subsitemaps parameter to limit number of subsitemaps processed

- Add overall_timeout parameter to control maximum processing time

- Implement concurrent processing of subsitemaps using asyncio

- Update command line options to control sitemap processing

- Update tests to work with enhanced sitemap parser
@johnburbridge johnburbridge merged commit 6e4036a into main Mar 18, 2025
2 checks passed
johnburbridge added a commit that referenced this pull request Mar 21, 2025
Update repository references from johnburbridge/scraper to spiralhous…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants