Skip to content

No compression of timestamps (t.tdb) after consolidation #2231

@peterculviner

Description

@peterculviner

I noticed there's no (default) compression on timestamps after consolidation, at least via the python api for sparse arrays. This wouldn't be an issue, but I also can't find a way to set a compression filter on these values (unlike user-defined attributes, dimensions, coordinates, and offsets). Am I missing a global default compression argument? I can't find any reference to one.

This becomes a source of file size with large sparse arrays.

For example:

import tiledb
import numpy as np
from itertools import product

array_path = 'test_array'

dim1 = tiledb.Dim(
    name="d1",
    domain=(0, 100),
    dtype=np.uint64,)
dim2 = tiledb.Dim(
    name="d2",
    domain=(0, 100),
    dtype=np.uint64,)

domain = tiledb.Domain(
    dim1, dim2)

attributes = [  # define attributes
    tiledb.Attr(
        name='attr1', dtype=np.dtype('uint64'), fill=0)]
schema = tiledb.ArraySchema(  # generate a schema
    domain=domain, attrs=attributes, sparse=True, allows_duplicates=True,
    coords_filters= [tiledb.filter.ZstdFilter(9)])
tiledb.Array.create(array_path, schema)
d1, d2 = np.asarray(list(product(range(0,100), range(0,100)))).T

array = tiledb.open(array_path, 'w')
# write 1
array[d1, d2] = {'attr1': np.full(10000, 1)}
# write 2
array[d1, d2] = {'attr1': np.full(10000, 2)}
array.close()

tiledb.consolidate(array_path)

Compare file sizes of a0.tdb and t.tdb in the consolidated fragment - they match suggesting they are both uncompressed uint64.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions