-
Notifications
You must be signed in to change notification settings - Fork 4
Processes
j edited this page Feb 24, 2015
·
9 revisions
- Starting with all of the scraped data: scrape_combined.csv
- Clean.py: Duplicates were removed by converting entire line into string, converting to a set (which removes duplicates) and outputting to clean.csv.
- clean_addr.py: Given the clean.csv, the '/' in addresses was replaced with 'and'. Output = clean_addr.csv We were not able to find a quick and cheap way of geocoding our database with over 1 million addresses. To combat this, we leveraged the socrata data, which already contains geocoded addresses.
- t_geocode.py: From the socrata data, t_geocode creates a dictionary, with key equal to the address and values containing lat and long. Using this dictionary we were able to geocode 897406 of 1032429 entries in our database, 87% of our database. Output = mapped.csv
- split.py: Converts mapped.csv output into two files, one, mapped_non_gc.csv, contains non-geocoded entires and one, mapped_gc.csv contains the geocoded entries.
- The header was added onto mapped_gc.csv and the final output is named scraped_geocoded.csv
Lots and lots and lots.
- Mapbox: Up to 50 queries may be included in a single batch geocoding request.
- Geocoding in the browser
- US Census geocoder
- Data Science Toolkit
- OpenCage Geocoder
Tools — Processes — Ideas — [Database] (https://github.com/jacquestardie/seattle/wiki/Database) — [RDS] (https://github.com/jacquestardie/seattle/wiki/RDS-Database) — [Deploy] (https://github.com/jacquestardie/seattle/wiki/Deploy)