Skip to content

Conversation

@Agamya-Samuel
Copy link
Contributor

@Agamya-Samuel Agamya-Samuel commented Apr 15, 2025

Pull Request Summary

Reference: #5

Historical Quote Backfill Tool

Summary

This PR adds a utility application to backfill historical quotes from Wikiquote's archives (dating back to 2007) into the Quote of the Day database. The tool handles multiple HTML formats from different time periods, processes quotes concurrently using asyncio, and ensures proper date formatting.

Features

  • Scrapes historical quotes from Wikiquote API - monthly archives (2007-present)
  • Handles three different HTML parsing formats (pre-2012, Feb-Mar 2012, Apr 2012+)
  • Processes URLs concurrently with asyncio for improved performance
  • Creates quote entries with standardized date format (YYYY-MM-DD)
  • Handles edge cases (missing authors/quotes, URL validation)
  • Includes detailed configuration file with archive URLs

Implementation Details

  • Created modular structure with clear separation of concerns
  • Implemented robust error handling and logging
  • Added thorough documentation with detailed comments
  • Included comprehensive configuration file with all archive URLs
  • Used intelligent parser selection based on date to handle format changes

Testing

  • Verified extraction from different time periods with different HTML formats
  • Ensured all quotes are properly formatted with correct dates
  • Confirmed error handling for invalid or inaccessible URLs

Currently this Tool is able to SUCCESSFULLY extract Quotes from 2007 to 2012-05-06
It extracts quotes for further Dates also, but some quotes are missed (need to handle every case, where page html format changes)

…y directly from the WikiQuote API. Updated the function name for clarity and added detailed documentation explaining the rationale behind this approach. The new implementation ensures users receive the latest quote based on UTC time, accommodating different timezones. If the quote is not found in the database, it will be added automatically.
…king of quote entries.

- Updated the add_quote_to_db function to set these fields to the current UTC time upon creation. This change enhances the database schema and ensures accurate timestamps for each quote entry.
…hance structure.

- Updated Quote and QuoteCreate schemas for clarity and maintainability.
… and update docstring for improved understanding of the API behavior.
… for improved timestamp tracking.

- Updated the extract_quote function to include created_at and updated_at fields, both set to the current UTC time. This change enhances the data model by providing accurate timestamps for when quotes are created and updated, ensuring better tracking and management of quote entries.
- Created core functionality to process and extract quotes from Wikiquote.
- Added configuration file for quote URLs by year and month.
- Implemented main application entry point and asynchronous processing of quote URLs.
- Introduced utility functions for loading configuration, validating URLs, and parsing quote data.
- Set up logging for better debugging and error tracking.

This commit lays the foundation for the Historical Featured Quotes App, enabling the extraction and storage of quotes from specified URLs.
…ions and features for historical quotes population
Merge pull request indictechcom#6 from Agamya-Samuel/feature/db-integration
@kcvelaga kcvelaga requested a review from Jayprakash-SE June 5, 2025 14:11
- Introduced API_HEADERS to include User-Agent and Accept headers, preventing 403 errors during API requests.
- Added extract_quotes_from_api_response function to process MediaWiki API responses, extracting quotes and determining the year from the title or HTML content.
- Improved error handling for API responses, including status code checks and JSON parsing validation.
- Improved create_multiple_quotes function to handle string date conversion, duplicate checking, and error logging.
- Added detailed logging for created, skipped, and errored quotes during batch processing.
- Refactored quote processing in backfill_historical_featured_quotes_app to ensure proper session management and error handling.
- Ensured that missing fields in quotes are handled gracefully before database insertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant