Skip to content

Design: JSONB Repository

Eric Hanson edited this page Oct 28, 2024 · 4 revisions

This page discusses why it's better to represent commits as JSONB instead of normalized relational data.

Why JSONB?

The initial approach with pg_delta was to model the entire repository, track, stage and commit pipeline as relational data. This approach works functionally, but has a drawback: Performance. Each commit contains rows, which contains fields, which reference blob literal values. If the repository data structure is modeled relationally, a row has to be inserted into the database every time you track, stage and commit -- not just a row for every row in the commit, but a row for every field of every row in the commit. A table that has 10 columns and 100 rows would need to insert 100 rows for every row, plus 100 x 10 = 1000 rows for every field. Inserting this many rows is a hard bottleneck on performance and pg_delta will never be fast unless we change approaches.

So, I'm starting a new branch that handles things totally differently -- it stores commit contents in a jsonb column called manifest, so that creating a commit needs to only insert one row, with a complex jsonb data structure describing its contents.

Data Structures

1. delta.repository Table - contains all repositories.

Columns

  • name text - Repository name
  • head_commit_id uuid - The id of the commit that is currently being developed against. Next commit will use this commit as its parent, and all track, stage and diff functions compare the live db against the contents of this commit.
  • checkout_commit_id uuid - The commit that is currently checked out into the database, maintained by the delta.checkout() function.
  • tracked_rows_added jsonb - Array of row_id::text values
  • stage_rows_to_add jsonb - Array of row_id::text values? See Stage-by-Reference vs. Value
  • stage_rows_to_remove jsonb - Array of row_id::text values
  • stage_fields_to_change jsonb - Array of field_id::text values

stage_rows_added

Rows that have been staged for inclusion in the next commit, along with their value at stage time.

JSONB Object with key row_id::text and value of jsonb object with key column name and value of field value at stage time.

{
    "(shakespeare,character,{id},{Aaron})": {
        "id": "Aaron",
        "name": "Aaron",
        "abbrev": "AARON",
        "description": "a Moor, beloved by Tamora",
        "speech_count": 57
    },
    "(shakespeare,character,{id},{Bagot})": {
        "id": "Bagot",
        "name": "Bagot",
        "abbrev": "BAGOT",
        "description": "servant to King Richard II",
        "speech_count": 6
    }
}

stage_rows_deleted

JSONB Array of row_id::text values - rows that have been staged for deletion from the previous commit, to be excluded in the next commit

[
    "(shakespeare,character,{id},{Aaron})",
    "(shakespeare,character,{id},{Bagot})"
]

stage_fields_changed

JSONB Object, with key field_id::text and value of field's value at stage time.

{
    "(shakespeare,character,{id},{Aaron},name)": "Sally",
    "(shakespeare,character,{id},{Bagot},name)": "Joe"
}

2. delta.commit Table

Columns

  • commit_time text
  • message text
  • author_name text
  • author_email text
  • parent_id uuid - foreign key to the previous commit, or NULL for first commit.
  • merge_parent_id uuid - If this commit is a merge, references the second commit that is being merged with the parent_id commit.
  • manifest jsonb - commit contents

Manifest Structure

JSONB data structure that holds the following information:

  • meta.row_ids of the rows in the commit
  • the order in which rows should be inserted on checkout, honoring dependencies
  • for each row, the columns that were committed and their values / value hashes
  • hash values

Designing this data structure is tricky. Open questions are:

a) How is row order preserved?

  • In an array, which preserves order (because jsonb "does not preserve the order of object keys.")
  • As explicit integer value

b) How are column values stored?

  • As a jsonb object with keys for column names and hashed field values for the key val. Simple but inefficient, lots of redundancy.
  • As an array of value hashes, whose position corresponds to some kind of table/column manifest.

c) Where are blobs stored?

  • Manifest has a max size of 255MB (hard limit on max json size) so we want to keep this mostly metadata.
  • Storing them in the blob table means a row-insert for every field of every row in the commit. Slow!
  • JSONB object's key-val structure is a natural fit, but max size might require some chunking

Clone this wiki locally