Skip to content

Design: Repository Data Structure

Eric Hanson edited this page Oct 24, 2024 · 1 revision

This page discusses design decisions around a repository's data structures.

I. Field Literal Values

Field values are hashed and stored in the blob table. The idea is that this saves space because the same literal value never needs to be stored twice. This saves space when a field doesn't change across many commits, and when many commits contain the same value. However, it actually is less efficient when the hash is larger than the value, and also adds a layer of indirection which comes at a (fairly minimal) performance hit.

It also is sub-optimal when the database contains large values (say a huge source-code file) that contain many small changes. There's room for improvement by looking into chunking and/or diffing. Both incur a complexity and performance cost but could potentially save on disk space.

II. Tracking

III. Stage

Do we stage literal values, or keep the literal value in the db and just stage the fact that something has changed?

The answer is b. We can't stage values because dependency sequencing needs to happen in the live database, so what is in the db needs to be exactly what goes into the commit. So the stage doesn't contain any values, just a notation that when you commit, whatever value is in the database will be flagged for inclusion in the commit.

This diverges from git behavior around staging, but I don't know a way around it.

Another important point here is that whenever a row has an external dependency, it's required that that dependency be unchanged from it's commit (which will be a dependency of the repository), or this could screw up dependency ordering as well.

Conclusion: Stage by reference, not by value.

IV. Commit: Full-Snapshot vs Additive

When a new commit is created, what is stored in the database? The two main options are a) store a full snapshot of the commit's rows and fields, or b) store only a delta, what changed since the previous commit. Here is a description of each approach and their advantages and disadvantages.

a) Full-Snapshot

A commit is a full snapshot of every tracked row in the repository at commit-time.

commit()

  1. Create a new commit record, recording commit message, author, etc and pointing its parent_id to the repo's current head_commit_id.
  2. Compute the dep_sequence of the staged rows
  3. Create a commit_row record for every staged row, recording its position in the dep_sequence
  4. Insert a blob record for every field's value (or, alternately, check which changed fields' values do not have blob records and only insert those)
  5. Create a commit_field record for every field

checkout()

If the repository is not already checked out, then simply:

  1. Insert every row and it's field values into the database, in dep_sequence order

If the repository has already been checked out then we have two options, the lazy way or the complex way:

Naive:

  1. Delete the checkout
  2. Checkout all the rows

Complex:

  1. Compute the diff between the commit that is already checked out and the commit that is about to be checked out
  2. Insert any new rows. Update only the rows that have changed in the diff. Do this while honoring dep_sequence. Phew!

This approach is architecturally simple, and nothing needs to be computed to perform the checkout(), simply insert all the rows in the correct order. However it can potentially be inefficient in scenarios when a commit changes just a few rows in a repository with many rows, because the entire commit structure is rewritten even though most of it is duplication of the previous commit.

b) Additive/Delta-based

Each commit records the difference between it and its parent, storing only what has changed instead of a full snapshot.

If a commit only records changes, no commit (except the first one) actually contains a fully-sequenced contents snapshot, so we have to answer two questions:

  1. What is the contents of each commit?
  2. What order should the contents be checked out in, to honor dependencies.

commit()

With additive commits, the full contents of the commit doesn't need to be computed. Just record any rows added or deleted, and any fields changed.

Commits also need to record the dep_sequence in which rows should be checked out, to honor dependencies. However, since additive commits only record what changed, not the entire commit, this significantly complicates things because new rows need to be interjected into the right position in the dep_sequence, and field changes can change a row's dep_sequence if they change e.g. foreign keys or meta rows.

The commit's dep_sequence is reconstructed for checkout as follows:

  1. The first commit only contains new added rows, so its dep_sequence is recorded and complete.
  2. Subsequent commits will need to interject their changes into the previous commit's dep_sequence:
  • commit_row_added rows are inserted with a dep_sequence that fits with previous rows (yeesh)
  • commit_row_deleted rows don't affect dep_sequence (unless we use a linked list!), they just disappear
  • commit_field_changed fields can affect dep_sequence, if they change any foreign-keys or are in a cyclical meta row

dep_sequence would need to be some kind of sequence that can be added to indefinitely without having to re-number previous elements. An integer sequence wouldn't work because when you need to insert something between 2 and 3, you have to renumber the whole thing. Maybe gank some ideas from geohash, or trie, or use a linked list.

checkout()

The commit's contents are reconstructed for checkout as follows:

  1. Builds a commit ancestry chain beginning with the commit to be checked out, and traversing each commit's parent_id until it reaches the trunk commit.
  2. Aggregate all rows created at every level of the ancestry chain, except rows that were deleted by a more recent commit. These are the rows that are in the commit.
  3. Reconstruct the fields on these rows by taking the most recent commit_field_change record, per-row, per-column.

The commit's dep_sequence is reconstructed for checkout is TBD, depending on how commit works.

Conclusion:

With this approach, commit size varies based on the size of the change rather than the size of the commit as a whole. Commits are faster to create, because they're only recording change rather than full-state, which means less disk IO. They are however potentially slower on checkout, because rather than having a full snapshot of the commit's rows, it must be constructed by aggregating the changes in its ancestry, basically replaying every change since the beginning of time.

c) Hybrid

A middle-ground between the above two approaches is as follows:

Rows are full snapshot, but field changes are additive. This makes recording a new dep_sequence for each commit easy because at commit-time you just recompute the whole thing. Each commit contains the full manifest of row existence, but only fields that have changed are stored. This is pretty nice, because the vast majority of the manifest is made up of field records, not row records. Anecdotally, in the old bundle system, in the IDE bundle, rowset_row_field is 10x the size of rowset_row (and 2x the size of blob!!):

dev@aquameta:~/dev/aquameta/bundles/org.aquameta.core.ide$ ls -l
total 18012
-rw-rw-rw- 1 postgres postgres  5626774 Dec  3 14:34 blob.csv
-rw-r--r-- 1 postgres postgres       99 Dec  3 18:21 bundle.csv
-rw-r--r-- 1 postgres postgres    15056 Dec  3 18:21 commit.csv
-rw-r--r-- 1 postgres postgres     2368 Dec  3 18:21 rowset.csv
-rw-r--r-- 1 postgres postgres  1188392 Dec  3 18:21 rowset_row.csv
-rw-r--r-- 1 postgres postgres 11598468 Dec  3 18:21 rowset_row_field.csv

This approach is correct, or at least headed in the right direction. Field changes need to be additive, for sure. Recording dep_sequence per-commit is very challenging to get right in an additive way, and unless it is additive, we have to record the sequence of every row_id anyway, which means we already have to save every row_id, so might as well record the entire row_id manifest.

There is still some potential refinements of this idea. It just seems so wasteful to record the entire row_id manifest for every commit, when it only changed slightly. If it wasn't for dep_sequence, it would be fairly easy to do rows in an additive way as well, but the combination of the two makes things pretty rough.

Another thing that could be looked at in this space not using a single row per-row_id in the commit. It wouldn't be so horrible to record the entire row manifest if it was just a huge array or data structure that could be inserted as a single row. I think this is the next place to investigate.

d) Abandon Rows, Embrace JSONB

The root of the performance issues here is that we're writing a commit_row for every row in the repository. What if we didn't?

A commit is a single row in the commit table, that contains a jsonb object with the full manifest of the repository, and field value hashes which hash over blob. This would be so much faster to create, and would probably have minimal impact on status.

What is the structure of the JSON manifest?

Nested JSONB

schema -> relation -> rows -> fields

{
    "shakespeare": {
        "chapter": {
            "pk_column_names": ["a","b"],
            "rows": {
            }
        },
        "character": {
        },
        "character_work": {
        }
    },

    "widget": {
        "widget": {
            "pk_column_names": ["id"],
            "rows": [
                {
                    "pk_values": [ "018cfab5-70ad-7dd6-8f71-55e5a9ae8e8c" ],
                    "dep_sequence": 1,
                    "field_hashes": {
                        "id": "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
                        "name": "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
                        "css": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
                        "post_js": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
                        "help": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50"
                    }
                },
                {
                    "pk_values": [ "018cfab5-70ad-7dd6-8f71-55e5a9ae8e8c" ],
                    "dep_sequence": 27,
                    "field_hashes": {
                        "id": "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
                        "name": "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
                        "css": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
                        "post_js": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
                        "help": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50"
                    }
                },
            }
        }
    }
}

Flat Per-Row JSONB

Record commits as an array of rows. Record dep_sequence as a literal integer, rather than by array order. Seems easier to work with.

[
    {
        "schema_name": "widget",
        "relation_name": "widget",
        "pk_column_names": ["id"],
        "pk_values": ["018cfab5-70ad-7dd6-8f71-55e5a9ae8e8c"],
        "dep_sequence": 19,
        "columns": {
            "id":      "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
            "name":    "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
            "css":     "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
            "post_js": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
            "help":    "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50"
        }
    },
    {
        "schema_name": "widget",
        "relation_name": "widget",
        "pk_column_names": ["id"],
        "pk_values": ["018cfab5-70ad-7dd6-8f71-55e5a9ae8e8c"],
        "dep_sequence": 2,
        "columns": {
            "id":      "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
            "name":    "\\xf0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
            "css":     "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
            "post_js": "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50",
            "help":    "\\xb82eef4c7445be869ea581a6bcf9da58524f2c17784c1f63f0577bc4331b5f50"
        }
    },
    ...
]

One of these json objects is written for every commit, which is then just a single row insert (albeit a big one) instead of one row per commit_row per commit_field! Which in retrospect is just so inefficient I can't believe it.

Thoughts:

  • Can this be queried quickly? Jsonb is awesome, pretty sure the answer will be a heck yes.
  • We could do additive at the json level. This is neat: jsonb_diff. Probably not that necessary though. It's just one field, writing a few k per commit is not a big deal.
  • How is this exported to disk? The .csv approach is I guess fine if we must, but reading about git internals, there's a very tempting avenue to just export to a git repository and hijack their blob store.

I'm settled on jsonb, which reimagines pretty much everything in this document. Moving this train of thought over to JSONB Repository.

e) Other Ideas

  1. Compute dep_sequnce at checkout-time instead of at commit-time, so that once we have all the rows in the commit (which is plenty complex just on it's own), we recompute dep_sequence. This doesn't work for meta rows because they aren't yet in pg_depend. For table data, dep_sequence could be computed by inspecting the rows, but it would be crazy complex.

  2. Each commit can either be full-snapshot or additive. Let the user decide, potentially? Periodic full-snapshots are nice because every time one is created, it can essentially act as the root commit for checkout, it's a milestone and there's no need to look further back in history.

Clone this wiki locally