Skip to content
j edited this page Feb 24, 2015 · 9 revisions

Cleaning up data

  • Starting with all of the scraped data: scrape_combined.csv
  • Clean.py: Duplicates were removed by converting entire line into string, converting to a set (which removes duplicates) and outputting to clean.csv.
  • clean_addr.py: Given the clean.csv, the '/' in addresses was replaced with 'and'. Output = clean_addr.csv We were not able to find a quick and cheap way of geocoding our database with over 1 million addresses. To combat this, we leveraged the socrata data, which already contains geocoded addresses.
  • t_geocode.py: From the socrata data, t_geocode creates a dictionary, with key equal to the address and values containing lat and long. Using this dictionary we were able to geocode 897406 of 1032429 entries in our database, 87% of our database. Output = mapped.csv
  • split.py: Converts mapped.csv output into two files, one, mapped_non_gc.csv, contains non-geocoded entires and one, mapped_gc.csv contains the geocoded entries.
  • The header was added onto mapped_gc.csv and the final output is named scraped_geocoded.csv

Geocoding

Lots and lots and lots.

Geocoding as a service

Pelias

Postgres/PostGIS

Clone this wiki locally