-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.
Requirements (at least nice to have):
- smaller storage footprint
- easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
- note: this will probably require sorting the data by reverse domain name
- still fast to get the top-n ranking domains
- well-defined table schema including column descriptions
- example code how to use the new data format
- (optionally) store also the column holding the node IDs
- this would make the vertex file(s) obsolete
- could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (
.graph) format
- would allow to add more columns, e.g. indegrees and outdegrees, with little overhead
ColeMurray