Skip to content

Conversation

@Mayureshpawar29
Copy link
Contributor

@Mayureshpawar29 Mayureshpawar29 commented Dec 4, 2025

Description

  • When scraping multiple URLs, different pages may have different HTML structures. Previously, if a container XPath didn't exist on one page, the entire pipeline would fail. This made it impossible to process mixed content where some pages have the container and others don't.

  • Test pipeline

tasks:
  - name: read_urls
    type: file
    path: test/pipelines/urls.txt
    
  - name: split
    type: split
    
  - name: prepare_request
    type: jq
    path: |
      {
        endpoint: .,
        method: "GET"
      }
      
  - name: fetch_html
    type: http
    
  - name: extract_data
    type: xpath
    ignore_missing: true  # <-- Test this flag
    container: "//*[@id='wrap']/div/div/div/div[3]/div[2]/div/div"
    fields:
      title: "./div"
      
  - name: output
    type: echo
    only_data: true

Test URLs (test/pipelines/urls.txt):

"https://www.example.com/"
"https://publicwww.com/"

  • Pipeline continues when ignore_missing: true and container missing

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation and I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have checked downstream dependencies (e.g. ExternalTaskSensors) by searching for DAG name elsewhere in the repo

@Mayureshpawar29 Mayureshpawar29 merged commit 846788c into main Dec 5, 2025
6 checks passed
@Mayureshpawar29 Mayureshpawar29 deleted the xpath-container-missing-fix branch December 5, 2025 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants