Skip to content

Conversation

@Fengzdadi
Copy link
Contributor

Related Issue

Follow-up to #90

Implementation

Files Added/Modified

File Description
sampling/serde.go SerDe interface and built-in implementations
sampling/serde_test.go Serialization tests
sampling/reservoir_items_sketch.go Added ToByteArray() and FromSlice()

SerDe Interface

type ItemsSerDe[T any] interface {
    SerializeToBytes(items []T) []byte
    DeserializeFromBytes(data []byte, numItems int) ([]T, error)
    SizeOfItem() int
}

Built-in SerDe

Type SerDe Size
int64 Int64SerDe 8 bytes
int32 Int32SerDe 4 bytes
float64 Float64SerDe 8 bytes
string StringSerDe length prefix + content

Usage

// Serialize
bytes, _ := sketch.ToByteArray(sampling.Int64SerDe{})

// Deserialize
restored, _ := sampling.NewReservoirItemsSketchFromSlice[int64](bytes, sampling.Int64SerDe{})

Notes

  • Uses little-endian byte order for cross-language compatibility
  • Custom types: users implement ItemsSerDe[T] interface

Testing

All tests pass:

go test ./sampling/
ok      github.com/apache/datasketches-go/sampling

Copy link
Member

@proost proost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add compatability test cases? C++ and Java test cases should be included. you can find code in datasketches-cpp repo and datasket-java repo.

// Built-in implementations are provided for common types (int64, int32, string, float64).
type ItemsSerDe[T any] interface {
// SerializeToBytes converts items to a byte slice.
SerializeToBytes(items []T) []byte
Copy link
Member

@proost proost Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "T" can be any data type, returning ([]byte, error) is more good to me. So some users can handle if wrong input given when custom data type is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, returning an error is better for custom types. I'll update the interface.

@Fengzdadi
Copy link
Contributor Author

Can you add compatability test cases? C++ and Java test cases should be included. you can find code in datasketches-cpp repo and datasket-java repo.

Thanks for the review!

Great suggestion! I'll add test cases using binary data generated from Java/C++ implementations to verify cross-language compatibility. Will look into the test data from datasketches-cpp and datasketches-java repos.

@Fengzdadi
Copy link
Contributor Author

Hi @proost,

I've added compatibility tests, but have a few questions:

  1. Test data approach
    I created test data based on Java's PreambleUtil.java format specification (manually constructed hex strings rather than actual Java-generated binary files). For example:
// Empty sketch: k=10
hexData := "01020d000a000000"  // preamble_longs=1, serVer=2, familyID=13, k=10

Is this approach acceptable, or would you prefer tests using actual binary files generated by Java?


  1. C++ compatibility
    I noticed that C++ does not have ReservoirItemsSketch implemented yet. Found this TODO in var_opt_union_impl.hpp:
// TODO: extend to handle reservoir sampling

Should I proceed with Java-only compatibility tests for now?

@proost
Copy link
Member

proost commented Jan 7, 2026

@Fengzdadi

  1. Can you generate sketch files including go? And you can put sketches under https://github.com/apache/datasketches-go/tree/main/serialization_test_data/ directory

  2. OK, then let's skip C++ compatability test!

@Fengzdadi
Copy link
Contributor Author

Hi @proost,

I've added Go-generated reservoir test data files and updated the tests.

Current Status

Go side (this PR)

  • ✅ Generated reservoir_long_*.sk files in serialization_test_data/go_generated_files/
  • ✅ Added serialization tests that read these files

Java side

I noticed that Java doesn't have ReservoirCrossLanguageTest.java yet.

Looking at the Java repository, cross-language tests exist for:

  • VarOpt ✅
  • KLL ✅
  • HLL ✅
  • CPC ✅
  • Reservoir ❌ (missing)

Proposal

  1. Merge this PR with Go-generated test data
  2. I can submit a PR to datasketches-java to add ReservoirCrossLanguageTest.java
  3. After Java generates reservoir files, I'll update Go to add Java→Go compatibility tests

Does this approach work for you?

Copy link
Member

@proost proost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freakyzoidberg When Java, C++ cross language test cases are missing, how guarantee compatibility? can you give advice to us?

I think adding test cases in the C++ or Java and then finishing go version is reasonable to me. because we can avoid additional work if compatibility is broken.

@proost
Copy link
Member

proost commented Jan 9, 2026

@Fengzdadi

I think adding compatability test cases in this PR is more good direction. So i add cases in the apache/datasketches-java#714

Thank you for waiting!

cc. @freakyzoidberg

@Fengzdadi
Copy link
Contributor Author

@Fengzdadi

I think adding compatability test cases in this PR is more good direction. So i add cases in the apache/datasketches-java#714

Thank you for waiting!

cc. @freakyzoidberg

Thanks @proost! Really appreciate you adding the Java compatibility tests in #714!

I'll wait for that PR to be merged and the Java-generated [.sk] files to be available. Then I'll update this Go PR to add tests that read the Java-generated files for full cross-language compatibility.

Let me know if there's anything you'd like me to adjust in the meantime!

Copy link
Member

@proost proost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add test cases for java compatibility?

@Fengzdadi Fengzdadi force-pushed the feat-reservoir-items-sketch branch from 9d3b491 to f9f53d3 Compare January 12, 2026 21:40
@Fengzdadi
Copy link
Contributor Author

@proost Thanks for the review feedback! I've made the following updates:

  1. ✅ Renamed ToByteArray → ToSlice for naming consistency with NewReservoirItemsSketchFromSlice
  2. ✅ Added Java compatibility tests TestReservoirItemsSketch_JavaCompat with 36 test cases matching the files from your Java PR test: cross language test cases for reservoir sampling sketch datasketches-java#714

The tests currently skip (file not found) because the Java-generated .sk files haven't been synced to the Go repo yet. Once they're available, the tests will automatically run and validate cross-language compatibility.

- Generate 27 cross-language compatibility test files:
  - reservoir_items_long_*_go.sk (9 files)
  - reservoir_items_double_*_go.sk (9 files)
  - reservoir_items_string_*_go.sk (9 files)
- Use k=128 for empty/exact, k=32/64/128 for sampling (n=1000)
- Skip reservoir_longs_* (per issue apache#90: longs is Java legacy)
- Update compatibility tests to match new file naming

// TestReservoirItemsSketch_JavaCompat tests deserialization of Java-generated reservoir sketch files.
// These tests verify cross-language compatibility with files generated by datasketches-java.
func TestReservoirItemsSketch_JavaCompat(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done. can you include java sketch files in this pr?

…e compatibility

- Update ToSlice to produce Java-compatible binary format:
  - Use 2-long (16 bytes) preamble instead of 3-long (24 bytes)
  - Encode ResizeFactor X8 in high 2 bits of byte 0
  - Set EMPTY flag (0x04) in byte 3 for empty sketches
  - Remove explicit numSamples storage (implicit as min(n, k))

- Simplify NewReservoirItemsSketchFromSlice to only support Java format
- Add 27 Java-generated .sk files for cross-language compatibility tests
- Update Go-generated .sk files to match Java format exactly
- Update test generator to use 0-based indexing like Java

Verified: Go and Java produce identical bytes for empty and exact mode sketches
Copy link
Member

@proost proost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freakyzoidberg I think it is ready. can you check?

Copy link
Member

@proost proost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@proost proost merged commit 0dd2e23 into apache:main Jan 14, 2026
1 check passed
@Fengzdadi
Copy link
Contributor Author

Thanks @proost for the review and guidance on the cross-language test alignment! :)

@Fengzdadi Fengzdadi deleted the feat-reservoir-items-sketch branch January 15, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants