Design: JSONB Repository

This page discusses why it's better to represent commits as JSONB instead of normalized relational data.

Why JSONB?

The initial approach with pg_delta was to model the entire repository, track, stage and commit pipeline as relational data. This approach works functionally, but has a drawback: Performance. Each commit contains rows, which contains fields, which reference blob literal values. If the repository data structure is modeled relationally, a row has to be inserted into the database every time you track, stage and commit -- not just a row for every row in the commit, but a row for every field of every row in the commit. A table that has 10 columns and 100 rows would need to insert 100 rows for every row, plus 100 x 10 = 1000 rows for every field. Inserting this many rows is a hard bottleneck on performance and pg_delta will never be fast unless we change approaches.

So, I'm starting a new branch that handles things totally differently -- it stores commit contents in a jsonb column called manifest, so that creating a commit needs to only insert one row, with a complex jsonb data structure describing its contents.

Data Structures

1. `delta.repository` Table - contains all repositories.

Columns

name text - Repository name
head_commit_id uuid - The id of the commit that is currently being developed against. Next commit will use this commit as its parent, and all track, stage and diff functions compare the live db against the contents of this commit.
checkout_commit_id uuid - The commit that is currently checked out into the database, maintained by the delta.checkout() function.
tracked_rows_added jsonb - Array of row_id::text values
stage_rows_to_add jsonb - Array of row_id::text values? See Stage-by-Reference vs. Value
stage_rows_to_remove jsonb - Array of row_id::text values
stage_fields_to_change jsonb - Array of field_id::text values

`stage_rows_added`

Rows that have been staged for inclusion in the next commit, along with their value at stage time.

JSONB Object with key row_id::text and value of jsonb object with key column name and value of field value at stage time.

{
    "(shakespeare,character,{id},{Aaron})": {
        "id": "Aaron",
        "name": "Aaron",
        "abbrev": "AARON",
        "description": "a Moor, beloved by Tamora",
        "speech_count": 57
    },
    "(shakespeare,character,{id},{Bagot})": {
        "id": "Bagot",
        "name": "Bagot",
        "abbrev": "BAGOT",
        "description": "servant to King Richard II",
        "speech_count": 6
    }
}

`stage_rows_deleted`

JSONB Array of row_id::text values - rows that have been staged for deletion from the previous commit, to be excluded in the next commit

[
    "(shakespeare,character,{id},{Aaron})",
    "(shakespeare,character,{id},{Bagot})"
]

`stage_fields_changed`

JSONB Object, with key field_id::text and value of field's value at stage time.

{
    "(shakespeare,character,{id},{Aaron},name)": "Sally",
    "(shakespeare,character,{id},{Bagot},name)": "Joe"
}

2. `delta.commit` Table

Columns

commit_time text
message text
author_name text
author_email text
parent_id uuid - foreign key to the previous commit, or NULL for first commit.
merge_parent_id uuid - If this commit is a merge, references the second commit that is being merged with the parent_id commit.
manifest jsonb - commit contents

Manifest Structure

JSONB data structure that holds the following information:

meta.row_ids of the rows in the commit
the order in which rows should be inserted on checkout, honoring dependencies
for each row, the columns that were committed and their values / value hashes
hash values

Designing this data structure is tricky. Open questions are:

a) How is row order preserved?

In an array, which preserves order (because jsonb "does not preserve the order of object keys.")
As explicit integer value

b) How are column values stored?

As a jsonb object with keys for column names and hashed field values for the key val. Simple but inefficient, lots of redundancy.
As an array of value hashes, whose position corresponds to some kind of table/column manifest.

c) Where are blobs stored?

Manifest has a max size of 255MB (hard limit on max json size) so we want to keep this mostly metadata.
Storing them in the blob table means a row-insert for every field of every row in the commit. Slow!
JSONB object's key-val structure is a natural fit, but max size might require some chunking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: JSONB Repository

Why JSONB?

Data Structures

1. `delta.repository` Table - contains all repositories.

Columns

`stage_rows_added`

`stage_rows_deleted`

`stage_fields_changed`

2. `delta.commit` Table

Columns

Manifest Structure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Design: JSONB Repository

Why JSONB?

Data Structures

1. delta.repository Table - contains all repositories.

Columns

stage_rows_added

stage_rows_deleted

stage_fields_changed

2. delta.commit Table

Columns

Manifest Structure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1. `delta.repository` Table - contains all repositories.

`stage_rows_added`

`stage_rows_deleted`

`stage_fields_changed`

2. `delta.commit` Table