Author: Garin Wally; 2024-12-09 Developed with love and support from the Missoula Urban Transit District
BoltETL is a data-processing utility and ETL framework intended for solo data analysts and small teams. Currently built with pandas (until I'm more comfortable with polars) and Python (one of the world's most popular programming languages).
Transform diverse data sources into standardized, Pythonic objects through custom Datasource classes and a simple TOML configuration file - whether you're working with Excel reports, CSV files, or spatial data. Define exactly how your data is retrieved, updated, validated, and exported. Leverage high-performance "feather" and DuckDB filetypes for storing and caching both spatial and non-spatial data. Enjoy the simplicity of running your data-tasks using a well-documented-and dead-simple command line utility. No containers, no web servers, no enterprise infrastructure, just Python.
BoltETL is:
- Flexible, single-tool data solution
- Highly customizable
- Scalable for any quantity or complexity
- Free and Open Source
- Supports tabular and spatial data (using geopandas)
- Adaptable to your unique data challenges
BoltETL is not:
- A replacement for enterprise data pipelining
- An out-of-the-box solution
If you are unfamiliar with git, download this repository from the "Releases" page.
...
...
invoke no-shows 2024-01-01 2024-01-31
A datasource is a Python class that encapsulates:
- The path to, and the loaded raw data (one or more inputs)
- Loaded as
pandas.DataFrameorgeopandas.GeoDataFrameobjects
- Loaded as
- The processed data
- Loaded as
pandas.DataFrameorgeopandas.GeoDataFrame(spatial) objects
- Loaded as
- The logic for how to process, make fixes, and prepare it
- Writing and loading from "cached" or preprocessed data file for quick loading later
- e.g. "feather" files
- And other data-specific rules (WIP)
For example, let's look at Missoula County Parcels:
- We first import the
Parcelsdatasource, and instantiate it withparcels = Parcels() - We could call
parcels.download()to update the raw data from the source website- NOTE: that this
downloadmethod hasn't been implemented yet
- NOTE: that this
- Then call
parcels.extract()to pull data out of the shapefile and into theparcels.rawclass attribute- which is a
geopandas.GeoDataFramefor in-memory processing
- which is a
- Then the
transform()method is used, which codifies and executes what a human might have to do manually to pre-process the data. For example:- Since shapefiles truncate column names to 11 characters, we might want to rename them
- e.g.
parcels.rename({"PARCELID": "ParcelID"}, axis=1, inplace=True)
- e.g.
- We might want to reproject the "geometry" column to use feet rather than meters
- e.g.
parcels.to_crs("epsg:6515")
- e.g.
- We might want to re-cast the data types to something that works better and faster
- (e.g. string arrays like "PARCELID" use numpy object arrays by default which don't handle null values; so casting to pyarrow strings allows not only an 84% speed increase, but requires less type-casting boilerplate code)
- Other processing, etc.
- And at the end of this method, the processed data is returned as the
parcels.dataattribute (so for those following along, both raw data and processed data are available from the object)
- Since shapefiles truncate column names to 11 characters, we might want to rename them
In summary, downloading and processing Missoula County Parcels would look like this:
from datasources import Parcels
parcels = Parcels() # Initialize or instantiate object
parcels.download() # Get data (file) from source
parcels.extract() # Load raw data into memory
parcels.transform() # Do the processing
parcels.write_cache() # Cache the processed data for later
# Or
#parcels.update() # Which could handle the calls to download, extract, transform, and cacheand loading the processed and cached data would then look like this:
from datasources import Parcels
parcels = Parcels() # Initialize or instantiate object
parcels.read_cache()
# Then do whatever with it here...