Skip to content

Explore columnar storage format for webgraph rankings and node labels #7

@sebastian-nagel

Description

@sebastian-nagel

We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.

Requirements (at least nice to have):

  • smaller storage footprint
  • easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
    • note: this will probably require sorting the data by reverse domain name
  • still fast to get the top-n ranking domains
  • well-defined table schema including column descriptions
  • example code how to use the new data format
  • (optionally) store also the column holding the node IDs
    • this would make the vertex file(s) obsolete
    • could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (.graph) format
  • would allow to add more columns, e.g. indegrees and outdegrees, with little overhead

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions