Skip to content

Conversation

@bplatz
Copy link
Contributor

@bplatz bplatz commented Aug 26, 2025

This sits on top of this PR: #1095
Which sits on top of this PR: #1096

All three PRs should be thought of as one package, as without this PR we might garbage collect index segments still relied on by other branches.

Summary

Introduces per-branch cuckoo filters to prevent deletion of index nodes still referenced by other branches during garbage collection. Uses 16-bit fingerprints from SHA-256 hashes with a chain design for dynamic growth. Integrates seamlessly with indexing and GC workflows.

Problem

  • Index segments are shared across branches via content-addressing
  • GC could delete segments still needed by other branches
  • Need fast, memory-efficient way to check cross-branch usage

Solution

Cuckoo filter implementation (fluree.db.indexer.cuckoo):

  • 16-bit fingerprints extracted from first 2 bytes of decoded base32 SHA-256
  • Chain growth: 100K segment capacity per filter, adds new filters at 90% full
  • FNV-1a 32-bit hash ensures CLJ/CLJS compatibility
  • CBOR binary storage at ledger/index/cuckoo/<branch>.cbor

Integration:

  • Indexing: Updates branch filter with new segments after index refresh
  • GC: Removes garbage from current branch filter, checks other branches before deletion
  • Caching: Loads other-branch filters once per GC run to minimize I/O

Performance

  • False positive rate: ~0.012% (1 in ~8,200) with 16-bit fingerprints
  • Memory efficiency: ~4.6 bytes per segment with realistic hash distribution
  • Scalability: Chain design avoids expensive rebuilds; batch operations minimize overhead
  • Filter size examples:
    • 10GB database (~50K segments): ~224KB
    • 1TB database (~5M segments): ~22MB

Documentation

See docs/cuckoo-filter-gc-strategy.md for detailed implementation notes.

@bplatz bplatz changed the base branch from main to feature/branching August 26, 2025 21:23
@bplatz bplatz changed the base branch from feature/branching to feature/rebase August 26, 2025 21:24
@bplatz bplatz force-pushed the feature/rebase branch 2 times, most recently from 495b4b5 to e4d3906 Compare September 2, 2025 15:57
- add cuckoo filter implementation with chain support for 100K+ segments
- integrate filters into index refresh and garbage collection processes
- use FNV-1a 32-bit hash for cross-platform determinism (CLJ/CLJS)
- implement proactive filter growth at 90% capacity threshold
- cache other-branch filters during GC to reduce I/O operations
- exclude garbage files from filter, only track actual index segments
- add comprehensive test suite for filter and chain operations
@bplatz bplatz force-pushed the feature/cuckoo-index-check branch from 7b8734d to f8b0d71 Compare September 2, 2025 16:55
- Integrated cuckoo filter operations with updated branch metadata structure
- Kept cuckoo filter copying on branch creation and deletion on branch delete
- Fixed typo: :db/unkown-ledger -> :db/unknown-ledger
- Added cuckoo and psot to index file lists in tests
- Preserved branch metadata flattening for index optimization
- Removed outdated cuckoo chain test suite and replaced it with integration tests for garbage collection and round-trip serialization.
- Added new tests for CBOR encoding/decoding to ensure data integrity during storage operations.
- Updated existing tests to utilize the new filter chain structure, ensuring compatibility with recent changes in the Cuckoo filter implementation.
- Enhanced edge case handling and collision detection tests to improve robustness.
- Adjusted assertions in the main test suite to reflect changes in filter structure and statistics.
@bplatz bplatz requested a review from a team October 2, 2025 23:19
@bplatz bplatz marked this pull request as ready for review October 2, 2025 23:19
@zonotope
Copy link
Contributor

zonotope commented Oct 7, 2025

See docs/cuckoo-filter-gc-strategy.md for detailed implementation notes.

This link is broken.

@bplatz
Copy link
Contributor Author

bplatz commented Oct 7, 2025

See docs/cuckoo-filter-gc-strategy.md for detailed implementation notes.

This link is broken.

Not really, it will work once the PR is merged and it is on the main branch... but the doc is part of the PR.

@bplatz bplatz marked this pull request as draft January 9, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants