Skip to content

Conversation

@m26dvd
Copy link
Contributor

@m26dvd m26dvd commented Dec 10, 2025

BREAKING CHANGES
fix: #1500 - Renfrewshire Council - This scraper has been re-developed and now uses the UPRN value of your address only. It also no longer requires Selenium

Fixes
fix: #1771 - South Lanarkshire Council
fix: #1777 - Newark and Sherwood District Council
fix: #1760 - Folkstone and Hythe District Council

Summary by CodeRabbit

  • New Features

    • Expanded UPRN-based lookup support for Renfrewshire, Newark and Sherwood, and additional councils.
  • Bug Fixes

    • Improved collection data retrieval for Folkestone and Hythe, Renfrewshire, and South Lanarkshire councils.
  • Documentation

    • Updated usage guides and command-line parameters for multiple councils; some now require UPRN instead of postcode. Added Isle of Anglesey council documentation.

✏️ Tip: You can customize this high-level summary in your review settings.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.81%. Comparing base (83cce53) to head (4b293ef).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1772   +/-   ##
=======================================
  Coverage   86.81%   86.81%           
=======================================
  Files           9        9           
  Lines        1138     1138           
=======================================
  Hits          988      988           
  Misses        150      150           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fix: robbrad#1771 -  South Lanarkshire Council
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 10, 2025

📝 Walkthrough

Walkthrough

This PR addresses issues across four council scrapers by replacing legacy approaches with modern data retrieval methods: removing Selenium web driver configuration from Folkestone and Hythe, introducing UPRN-based API fetching for Newark and Sherwood, replacing Selenium with direct HTTP and ICS parsing for Renfrewshire, adding error guards for South Lanarkshire, and updating documentation for affected councils.

Changes

Cohort / File(s) Summary
Council Scraper: Folkestone and Hythe
uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py
Replaced HTML section-based parsing with dual API GET requests to two endpoints; now iterates over service-card articles to extract bin types and collection dates; simplified data extraction logic.
Council Scraper: Renfrewshire
uk_bin_collection/uk_bin_collection/councils/RenfrewshireCouncil.py
Eliminated Selenium-based automation and Cloudflare workarounds; introduced direct HTTP approach with UPRN-based view URL construction; replaced HTML parsing with ICS file retrieval and icalevents parsing; extracts events within 365 days and maps to bin types.
Council Scraper: Newark and Sherwood
uk_bin_collection/uk_bin_collection/councils/NewarkAndSherwoodDC.py
Added UPRN validation via check_uprn(); constructs remote URI with pid parameter; performs HTTP GET on constructed URL instead of parsing provided page content; downstream date extraction logic preserved.
Council Scraper: South Lanarkshire
uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
Introduced guard clause to skip rows where schedule_type is "Area"; prevents index out of range error when area names lack spaces; minor date formatting adjustments.
Test Configuration
uk_bin_collection/tests/input.json
Removed web_driver field from Folkestone and Hythe entry; migrated Renfrewshire from address-based fields (house_number, paon, postcode) to UPRN-based (uprn: "123118848"); updated URLs and wiki notes.
Documentation
wiki/Councils.md
Added Isle of Anglesey entry; updated Cumberland, Epsom and Ewell, Merton, Renfrewshire, and North Hertfordshire entries with new URLs and parameter configurations (shifted from postcode/web_driver to UPRN-based parameters where applicable).

Sequence Diagrams

sequenceDiagram
    participant Client
    participant CouncilSite as Council Website
    participant ICSServer as ICS Server
    participant Parser as ICS Parser

    rect rgb(220, 245, 220)
    Note over Client,Parser: New Renfrewshire Flow
    Client->>CouncilSite: GET view URL<br/>(with UPRN)
    CouncilSite-->>Client: HTML page with ICS link
    Client->>Client: Locate ICS link on page
    Client->>ICSServer: GET ICS file
    ICSServer-->>Client: ICS calendar data
    Client->>Parser: Parse ICS events<br/>(next 365 days)
    Parser-->>Client: Bin type + collection date pairs
    Client->>Client: Sort by collectionDate
    end

    rect rgb(245, 220, 220)
    Note over Client: Old Renfrewshire Flow (Selenium + Cloudflare bypass)
    Client->>CouncilSite: Initialize browser via Selenium
    CouncilSite-->>Client: Cloudflare challenge
    Client->>Client: Cloudflare bypass logic
    Client->>CouncilSite: Dynamic interaction + wait
    CouncilSite-->>Client: Rendered HTML
    Client->>Client: Parse HTML sections
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • dp247

Poem

🐰 Four councils fixed with nimble code,
No more Selenium's heavy load!
ICS files hop and skip with glee,
UPRNs fetch with API spree,
Bin schedules hopping ever free! 🗑️✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'fix: Council Fix Pack - January 2026' is vague and generic, using 'Fix Pack' terminology that doesn't convey specific information about the changeset. Consider a more specific title that highlights the primary change, such as 'fix: Update council scrapers for Renfrewshire, South Lanarkshire, Newark and Sherwood, and Folkestone and Hythe'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The code changes directly address all four linked issues: Renfrewshire scraper reworked to remove Selenium (#1500), South Lanarkshire parsing logic fixed (#1771), Newark and Sherwood scraper updated (#1777), and Folkestone and Hythe scraper rewritten (#1760).
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the four mentioned councils and updating related documentation; no out-of-scope modifications are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

m26dvd and others added 3 commits December 15, 2025 12:01
@m26dvd m26dvd marked this pull request as ready for review December 26, 2025 23:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
wiki/Councils.md (1)

85-86: Remove duplicate Cumberland entry from contents list and clarify the two council sections.

The contents list at lines 85–86 contains a duplicate - [Cumberland](#cumberland) entry. Additionally, the file has two separate sections both titled "### Cumberland" (one for CumberlandAllerdaleCouncil using postcode/house number parameters, another for CumberlandCouncil using UPRN). Since Allerdale merged into Cumberland, clarify which entry users should use—either remove the legacy CumberlandAllerdaleCouncil entry, add a deprecation note, or use distinct section headings to differentiate them.

🧹 Nitpick comments (3)
uk_bin_collection/uk_bin_collection/councils/RenfrewshireCouncil.py (2)

39-55: Consider error handling for ICS parsing.

The events() function from icalevents could fail if the ICS file is malformed or the URL is unreachable. While the current implementation will naturally propagate exceptions, consider whether more specific error messages would improve debugging.

Potential improvement
-        # Parse ICS calendar
-        upcoming_events = events(ics_url, start=now, end=future)
+        # Parse ICS calendar
+        try:
+            upcoming_events = events(ics_url, start=now, end=future)
+        except Exception as e:
+            raise ValueError(
+                f"Failed to parse ICS calendar for UPRN {user_uprn}: {str(e)}"
+            ) from e

This would provide clearer error messages if ICS parsing fails, though the existing behavior of propagating the original exception is also acceptable.


57-59: Use date_format constant for consistency.

The hardcoded date format string "%d/%m/%Y" should ideally reference the date_format constant to ensure consistency across the codebase.

🔎 Suggested improvement
         bindata["bins"].sort(
-            key=lambda x: datetime.strptime(x.get("collectionDate"), "%d/%m/%Y")
+            key=lambda x: datetime.strptime(x.get("collectionDate"), date_format)
         )

This ensures that if date_format is ever changed, the sorting logic remains consistent.

uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py (1)

4-4: Static analysis flagged star imports.

Static analysis tools flag datetime and date_format as potentially undefined because they're imported via star import from common. While this is a codebase-wide pattern, explicit imports would improve clarity and help static analysis tools.

Optional refactor to explicit imports
+from datetime import datetime
 from uk_bin_collection.uk_bin_collection.common import *

Note: This assumes date_format is intentionally provided by the common module's star export.

Also applies to: 38-38, 42-42

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 83cce53 and f14840a.

📒 Files selected for processing (6)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/NewarkAndSherwoodDC.py
  • uk_bin_collection/uk_bin_collection/councils/RenfrewshireCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py
  • wiki/Councils.md
🧰 Additional context used
🧬 Code graph analysis (1)
uk_bin_collection/uk_bin_collection/councils/NewarkAndSherwoodDC.py (1)
uk_bin_collection/uk_bin_collection/common.py (1)
  • check_uprn (67-78)
🪛 Ruff (0.14.10)
uk_bin_collection/uk_bin_collection/councils/NewarkAndSherwoodDC.py

20-20: check_uprn may be undefined, or defined from star imports

(F405)


25-25: requests may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/RenfrewshireCouncil.py

22-22: check_uprn may be undefined, or defined from star imports

(F405)


28-28: Probable use of requests call without timeout

(S113)


35-37: Avoid specifying long messages outside the exception class

(TRY003)


53-53: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py

96-96: date_format may be undefined, or defined from star imports

(F405)

uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py

38-38: datetime may be undefined, or defined from star imports

(F405)


42-42: date_format may be undefined, or defined from star imports

(F405)

🔇 Additional comments (11)
wiki/Councils.md (6)

160-160: ✓ Isle of Anglesey entry is well integrated.

The new council entry is properly alphabetized in the contents list and follows the documentation format. The note clarifying that either UPRN or postcode/house number can be passed, and that the URL is not used, is helpful.

Also applies to: 2148-2158


1646-1654: ✓ Epsom and Ewell parameters and Selenium requirement properly documented.

The updated command, parameter list, and explicit note about Selenium requirement are clear and consistent with other Selenium-dependent councils in the documentation.


3141-3147: ✓ Renfrewshire scraper modernization properly documented.

The updated URL and parameter set (UPRN-only, no Selenium) align with the stated PR objective. The removal of -p and -w parameters and the updated note clearly communicate the breaking change. Users transitioning from the old approach will need to adjust their setup.


1753-1762: ✓ Folkestone and Hythe entry is properly documented.

The UPRN-based approach without Selenium is clearly presented and consistent with other modernized council entries.


2526-2532: ✓ Merton and North Hertfordshire entries updated consistently.

Both councils are now documented with the UPRN-based approach (-s -u parameters) and updated URLs. The changes align with the PR's modernization strategy and are properly documented.

Also applies to: 2809-2815


1-2: Verify that input.json has been updated to match these documentation changes.

This file is marked as auto-generated from uk_bin_collection/tests/input.json. The marked changes should be reflected in input.json to ensure they persist after regeneration. If these changes were made directly to Councils.md without updating the source file, they will be lost during the next regeneration.

Confirm that all council entry updates (Isle of Anglesey, Cumberland, Epsom and Ewell, Merton, North Hertfordshire, Renfrewshire) have corresponding updates in input.json.

uk_bin_collection/uk_bin_collection/councils/SouthLanarkshireCouncil.py (2)

74-75: LGTM! Guard prevents index error on "Area" rows.

This addition directly addresses issue #1771 by skipping rows where schedule_type is "Area", which likely lack the expected table cell structure that would cause the "list index out of range" error at line 77 when accessing .split(" ")[1].


95-97: LGTM! Improved readability.

The reformatting of the date formatting statement improves code readability with no functional changes.

uk_bin_collection/tests/input.json (1)

1973-1980: LGTM! Test data updated for uprn-based flow.

The test configuration correctly reflects the migration from Selenium-based scraping to the uprn-based ICS retrieval approach:

  • Replaced house_number, paon, and postcode with uprn
  • Removed web_driver requirement
  • Updated wiki_note to reference FindMyAddress for UPRN lookup
uk_bin_collection/uk_bin_collection/councils/RenfrewshireCouncil.py (2)

1-5: LGTM! Imports updated for ICS-based approach.

The imports correctly reflect the migration from Selenium to a direct HTTP + ICS parsing approach:

  • requests for HTTP calls
  • icalevents for ICS calendar parsing
  • datetime utilities for date handling

21-23: LGTM! UPRN parameter handling is correct.

The code properly extracts and validates the UPRN parameter using the check_uprn helper function from the common module.

Comment on lines +26 to +28
session = requests.session()
response = session.get(URI1)
response = session.get(URI2)
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add HTTP error handling and clarify the purpose of URI1.

The response from the first GET request to URI1 is immediately overwritten by the second request to URI2. If URI1 is required to establish session state (e.g., cookies), this should be documented. If it's unnecessary, remove it to reduce latency.

Additionally, both GET requests lack error handling—network failures or non-2xx responses will raise unhandled exceptions and break the integration.

🔎 Proposed fix with error handling
 # Make the GET request
 session = requests.session()
-response = session.get(URI1)
-response = session.get(URI2)
+session.get(URI1)  # Initialize session state if needed
+response = session.get(URI2)
+response.raise_for_status()  # Raise HTTPError for bad status codes
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
session = requests.session()
response = session.get(URI1)
response = session.get(URI2)
session = requests.session()
session.get(URI1) # Initialize session state if needed
response = session.get(URI2)
response.raise_for_status() # Raise HTTPError for bad status codes
🤖 Prompt for AI Agents
In
uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py
around lines 26 to 28, the first session.get(URI1) call is immediately
overwritten by session.get(URI2) (losing the first response) and neither request
has any HTTP/network error handling; decide whether URI1 is required for session
state (if so, add a comment explaining its purpose and preserve its response or
at least its effect on session cookies) or remove the unnecessary call to reduce
latency, and add robust error handling for both requests by wrapping each
session.get in a try/except for requests.exceptions.RequestException, using a
timeout, and calling response.raise_for_status() (or handling non-2xx responses)
so network errors and HTTP errors are caught and handled gracefully instead of
crashing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that response is reset as this is what the website requires to function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dp247 Yes, you're correct! The first request to URI1 is setting up the session state (likely cookies or other session data) that the second request to URI2 requires to function properly. The response from the first request isn't used—only its side effect on the session object matters.

This is a common pattern when dealing with web applications where an API endpoint requires prior session initialization from visiting the main page first.

However, I'd still recommend:

  1. Adding a comment to document this behavior for future maintainers:
# Make the GET request
session = requests.session()
session.get(URI1)  # Initialize session state (cookies) required by URI2
response = session.get(URI2)
  1. Adding error handling to make the integration more robust against network failures or HTTP errors:
session = requests.session()
session.get(URI1)  # Initialize session state (cookies) required by URI2
response = session.get(URI2)
response.raise_for_status()  # Raise HTTPError for bad status codes

The raise_for_status() call will help catch issues early if the API returns error responses, making debugging easier if the integration breaks again in the future.

Comment on lines +33 to +44
for collection in collections:
bin_type = collection.find("h3")
bin_type = bin_type.text
next_collection = collection.find("p", {"class": "service-next"})
date_str = next_collection.text.split(":")[1].split("(")[0].strip()
dt = datetime.strptime(date_str, "%A, %d %B %Y")

dict_data = {
"type": bin_type.strip(),
"collectionDate": dt.strftime(date_format),
}
bindata["bins"].append(dict_data)
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add defensive error handling for HTML parsing and date extraction.

The extraction loop assumes the HTML structure is always present and correctly formatted. If the structure changes or elements are missing, unhandled exceptions will occur:

  • Lines 34-35: collection.find("h3") can return None, causing AttributeError when accessing .text
  • Lines 36-37: collection.find("p", ...) can return None, and the string split operations assume specific format
  • Line 38: datetime.strptime() can raise ValueError if the date format differs

Since this fix addresses issue #1760 where the integration became unavailable, robust error handling is essential to prevent similar failures.

🔎 Proposed fix with defensive checks
 collections = soup.find_all("article", {"class": "service-card"})
 for collection in collections:
-    bin_type = collection.find("h3")
-    bin_type = bin_type.text
-    next_collection = collection.find("p", {"class": "service-next"})
-    date_str = next_collection.text.split(":")[1].split("(")[0].strip()
-    dt = datetime.strptime(date_str, "%A, %d %B %Y")
-
-    dict_data = {
-        "type": bin_type.strip(),
-        "collectionDate": dt.strftime(date_format),
-    }
-    bindata["bins"].append(dict_data)
+    bin_type_elem = collection.find("h3")
+    if not bin_type_elem:
+        continue
+    
+    next_collection = collection.find("p", {"class": "service-next"})
+    if not next_collection or ":" not in next_collection.text:
+        continue
+    
+    try:
+        date_str = next_collection.text.split(":")[1].split("(")[0].strip()
+        dt = datetime.strptime(date_str, "%A, %d %B %Y")
+    except (ValueError, IndexError):
+        continue
+
+    dict_data = {
+        "type": bin_type_elem.text.strip(),
+        "collectionDate": dt.strftime(date_format),
+    }
+    bindata["bins"].append(dict_data)
🧰 Tools
🪛 Ruff (0.14.10)

38-38: datetime may be undefined, or defined from star imports

(F405)


42-42: date_format may be undefined, or defined from star imports

(F405)

🤖 Prompt for AI Agents
In
uk_bin_collection/uk_bin_collection/councils/FolkstoneandHytheDistrictCouncil.py
around lines 33 to 44, the loop assumes HTML nodes and date format always exist;
add defensive parsing: null-check the result of collection.find("h3") and
collection.find("p", {"class":"service-next"}) before accessing .text, validate
the split result has the expected parts (guard against missing ":" or "("),
parse the date inside a try/except catching ValueError, and if any step fails
log a warning/debug with context (collection index or raw HTML) and skip
appending that bin entry instead of letting an exception bubble up; only build
and append dict_data when type text and parsed date are confirmed valid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't hurt to add in the suggested error handling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dp247, here's the complete code with defensive error handling to prevent failures when the HTML structure changes:

collections = soup.find_all("article", {"class": "service-card"})
for collection in collections:
    bin_type_elem = collection.find("h3")
    if not bin_type_elem:
        continue
    
    next_collection = collection.find("p", {"class": "service-next"})
    if not next_collection or ":" not in next_collection.text:
        continue
    
    try:
        date_str = next_collection.text.split(":")[1].split("(")[0].strip()
        dt = datetime.strptime(date_str, "%A, %d %B %Y")
    except (ValueError, IndexError):
        continue

    dict_data = {
        "type": bin_type_elem.text.strip(),
        "collectionDate": dt.strftime(date_format),
    }
    bindata["bins"].append(dict_data)

This approach:

  • Checks if h3 and p elements exist before accessing their text
  • Validates the expected string format (presence of :)
  • Wraps date parsing in try/except to handle format mismatches
  • Silently skips malformed entries rather than crashing the entire collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants