Skip to content

Write data extractor task for airflow #4

@tdunning

Description

@tdunning

This should extract precipitation information from the grib files and produce

  • an index file of for finding neighboring grid points. This file should be in feather format and should map H3 indexes at resolution 6 ... 9 to a list of the grid points with that index. The list should contain the resolution 15 H3 index of each grid point.
  • a daily set of data files in feather format that contain hourly precipitation information for multiple grid points. Data should be allocated to data files by sorting by grid point H3 index and by time.
  • meta-data file that records which files contain which grid points

Questions:

  • How many grid points should be assigned to each file to achieve desired retrieval times for 100 days of data for a single point?
  • Should different grid points be partitioned by row group to improve read times?
  • How can data integrity be verified?
  • Can the meta-data be replaced with a deterministic mapping from nearest grid point to file name (something like mod of the hash)?
  • How can we best have a single index file that merges all observed grid points into a single index?
  • Should we be merging many days of data into single data files?

Links:

https://arrow.apache.org/docs/python/
https://github.com/agstack/weather-server/tree/main/experiments/s2-geohash

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions