Processes

Jump to bottom Edit New page

j edited this page Feb 24, 2015 · 9 revisions

Cleaning up data

Starting with all of the scraped data: scrape_combined.csv
Clean.py: Duplicates were removed by converting entire line into string, converting to a set (which removes duplicates) and outputting to clean.csv.
clean_addr.py: Given the clean.csv, the '/' in addresses was replaced with 'and'. Output = clean_addr.csv We were not able to find a quick and cheap way of geocoding our database with over 1 million addresses. To combat this, we leveraged the socrata data, which already contains geocoded addresses.
t_geocode.py: From the socrata data, t_geocode creates a dictionary, with key equal to the address and values containing lat and long. Using this dictionary we were able to geocode 897406 of 1032429 entries in our database, 87% of our database. Output = mapped.csv
split.py: Converts mapped.csv output into two files, one, mapped_non_gc.csv, contains non-geocoded entires and one, mapped_gc.csv contains the geocoded entries.
The header was added onto mapped_gc.csv and the final output is named scraped_geocoded.csv

Geocoding

Lots and lots and lots.

Geocoding as a service

Mapbox: Up to 50 queries may be included in a single batch geocoding request.
Geocoding in the browser
US Census geocoder
Data Science Toolkit
OpenCage Geocoder

Pelias

Postgres/PostGIS

Tools — Processes — Ideas — [Database] (https://github.com/jacquestardie/seattle/wiki/Database) — [RDS] (https://github.com/jacquestardie/seattle/wiki/RDS-Database) — [Deploy] (https://github.com/jacquestardie/seattle/wiki/Deploy)