-
Notifications
You must be signed in to change notification settings - Fork 11
feat: add serialization support for ReservoirItemsSketch #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
proost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add compatability test cases? C++ and Java test cases should be included. you can find code in datasketches-cpp repo and datasket-java repo.
sampling/serde.go
Outdated
| // Built-in implementations are provided for common types (int64, int32, string, float64). | ||
| type ItemsSerDe[T any] interface { | ||
| // SerializeToBytes converts items to a byte slice. | ||
| SerializeToBytes(items []T) []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "T" can be any data type, returning ([]byte, error) is more good to me. So some users can handle if wrong input given when custom data type is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, returning an error is better for custom types. I'll update the interface.
Thanks for the review! Great suggestion! I'll add test cases using binary data generated from Java/C++ implementations to verify cross-language compatibility. Will look into the test data from datasketches-cpp and datasketches-java repos. |
|
Hi @proost, I've added compatibility tests, but have a few questions:
// Empty sketch: k=10
hexData := "01020d000a000000" // preamble_longs=1, serVer=2, familyID=13, k=10Is this approach acceptable, or would you prefer tests using actual binary files generated by Java?
// TODO: extend to handle reservoir samplingShould I proceed with Java-only compatibility tests for now? |
|
|
Hi @proost, I've added Go-generated reservoir test data files and updated the tests. Current StatusGo side (this PR)
Java sideI noticed that Java doesn't have Looking at the Java repository, cross-language tests exist for:
Proposal
Does this approach work for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freakyzoidberg When Java, C++ cross language test cases are missing, how guarantee compatibility? can you give advice to us?
I think adding test cases in the C++ or Java and then finishing go version is reasonable to me. because we can avoid additional work if compatibility is broken.
8aa0875 to
25583d8
Compare
|
I think adding compatability test cases in this PR is more good direction. So i add cases in the apache/datasketches-java#714 Thank you for waiting! cc. @freakyzoidberg |
Thanks @proost! Really appreciate you adding the Java compatibility tests in #714! I'll wait for that PR to be merged and the Java-generated [.sk] files to be available. Then I'll update this Go PR to add tests that read the Java-generated files for full cross-language compatibility. Let me know if there's anything you'd like me to adjust in the meantime! |
proost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add test cases for java compatibility?
9d3b491 to
f9f53d3
Compare
|
@proost Thanks for the review feedback! I've made the following updates:
The tests currently skip (file not found) because the Java-generated |
- Generate 27 cross-language compatibility test files: - reservoir_items_long_*_go.sk (9 files) - reservoir_items_double_*_go.sk (9 files) - reservoir_items_string_*_go.sk (9 files) - Use k=128 for empty/exact, k=32/64/128 for sampling (n=1000) - Skip reservoir_longs_* (per issue apache#90: longs is Java legacy) - Update compatibility tests to match new file naming
|
|
||
| // TestReservoirItemsSketch_JavaCompat tests deserialization of Java-generated reservoir sketch files. | ||
| // These tests verify cross-language compatibility with files generated by datasketches-java. | ||
| func TestReservoirItemsSketch_JavaCompat(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done. can you include java sketch files in this pr?
…e compatibility - Update ToSlice to produce Java-compatible binary format: - Use 2-long (16 bytes) preamble instead of 3-long (24 bytes) - Encode ResizeFactor X8 in high 2 bits of byte 0 - Set EMPTY flag (0x04) in byte 3 for empty sketches - Remove explicit numSamples storage (implicit as min(n, k)) - Simplify NewReservoirItemsSketchFromSlice to only support Java format - Add 27 Java-generated .sk files for cross-language compatibility tests - Update Go-generated .sk files to match Java format exactly - Update test generator to use 0-based indexing like Java Verified: Go and Java produce identical bytes for empty and exact mode sketches
proost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freakyzoidberg I think it is ready. can you check?
proost
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
Thanks @proost for the review and guidance on the cross-language test alignment! :) |
Related Issue
Follow-up to #90
Implementation
Files Added/Modified
sampling/serde.gosampling/serde_test.gosampling/reservoir_items_sketch.goSerDe Interface
Built-in SerDe
int64Int64SerDeint32Int32SerDefloat64Float64SerDestringStringSerDeUsage
Notes
ItemsSerDe[T]interfaceTesting
All tests pass: