This script efficiently downloads all dataset metadata from Brazil's open data portal, dados.gov.br.
The metadata has many goodies such as direct links to the dataset downloads, file formats, tags, full description, etc.
It works around the API's 9999-item pagination limit by sequentially scraping smaller categories based on license type (cc-by, cc-zero, etc.). This ensures a successful download of (almost) all available metadata - scrapes 11600 out of 14666 total datasets at the time of writing this readme.
-
Prerequisites:
- Python 3.11+
- uv
-
Installation: Clone this repository, and use
uv syncto create the venv and install the necessary packagesgit clone https://github.com/pedrolabonia/dadosabertos-scraper.git cd dadosabertos-scraper uv sync -
Execution: Run the scraper using the
scrapecommand. All files will be saved to a single output directory.- Run with defaults(recommended):
uv run scrape
- Run with custom arguments:
uv run scrape --page_size 500 --concurrency 20 --output_dir ./my_data
- See all options:
uv run scrape --help
- Run with defaults(recommended):
Recommended a 90s timeout since the API can take a while.
| Argument | Default | Description |
|---|---|---|
--page_size |
500 |
Records to fetch per API request. |
--concurrency |
10 |
Max number of parallel download requests. |
--timeout |
90 |
Timeout in seconds for each HTTP request. |
--output_dir |
scraped_data |
Directory to save the output .json files. |