-
Notifications
You must be signed in to change notification settings - Fork 8
feat: add xxhash64 #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an xxhash64 implementation to the hash module, matching the DataSketches Java behavior. The implementation provides a fast, non-cryptographic 64-bit hash function with seeded hashing support.
- Implements
XxHash64struct with the standardHashertrait - Provides both streaming API (
write/finish64) and convenience method (hash_u64) - Includes comprehensive test coverage with reference test vectors for verification
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/hash/xxhash.rs | New complete implementation of xxhash64 with constants, hasher struct, trait implementations, helper functions, and test suite |
| src/hash/mod.rs | Adds module declaration and public export for XxHash64 (with unused_imports attribute as hash module is currently internal-only) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn with_seed(seed: u64) -> Self { | ||
| XxHash64 { | ||
| seed, | ||
| total_len: 0, | ||
| v1: seed.wrapping_add(P1).wrapping_add(P2), | ||
| v2: seed.wrapping_add(P2), | ||
| v3: seed, | ||
| v4: seed.wrapping_sub(P1), | ||
| buffer: [0; 32], | ||
| buffer_len: 0, | ||
| } | ||
| } | ||
|
|
||
| pub fn finish64(&self) -> u64 { | ||
| let mut hash = if self.total_len >= 32 { | ||
| let mut acc = self.v1.rotate_left(1) | ||
| .wrapping_add(self.v2.rotate_left(7)) | ||
| .wrapping_add(self.v3.rotate_left(12)) | ||
| .wrapping_add(self.v4.rotate_left(18)); | ||
| acc = merge_round(acc, self.v1); | ||
| acc = merge_round(acc, self.v2); | ||
| acc = merge_round(acc, self.v3); | ||
| acc = merge_round(acc, self.v4); | ||
| acc | ||
| } else { | ||
| self.seed.wrapping_add(P5) | ||
| }; | ||
|
|
||
| hash = hash.wrapping_add(self.total_len); | ||
|
|
||
| let mut idx = 0; | ||
| let buf = &self.buffer[..self.buffer_len]; | ||
| while idx + 8 <= buf.len() { | ||
| let mut k1 = LE::read_u64(&buf[idx..idx + 8]); | ||
| k1 = k1.wrapping_mul(P2); | ||
| k1 = k1.rotate_left(31); | ||
| k1 = k1.wrapping_mul(P1); | ||
| hash ^= k1; | ||
| hash = hash.rotate_left(27).wrapping_mul(P1).wrapping_add(P4); | ||
| idx += 8; | ||
| } | ||
|
|
||
| if idx + 4 <= buf.len() { | ||
| let k1 = LE::read_u32(&buf[idx..idx + 4]) as u64; | ||
| hash ^= k1.wrapping_mul(P1); | ||
| hash = hash.rotate_left(23).wrapping_mul(P2).wrapping_add(P3); | ||
| idx += 4; | ||
| } | ||
|
|
||
| while idx < buf.len() { | ||
| let k1 = buf[idx] as u64; | ||
| hash ^= k1.wrapping_mul(P5); | ||
| hash = hash.rotate_left(11).wrapping_mul(P1); | ||
| idx += 1; | ||
| } | ||
|
|
||
| finalize(hash) | ||
| } | ||
|
|
||
| #[allow(dead_code)] | ||
| pub fn hash_u64(input: u64, seed: u64) -> u64 { | ||
| let mut hash = seed.wrapping_add(P5).wrapping_add(8); | ||
| let mut k1 = input; | ||
| k1 = k1.wrapping_mul(P2); | ||
| k1 = k1.rotate_left(31); | ||
| k1 = k1.wrapping_mul(P1); | ||
| hash ^= k1; | ||
| hash = hash.rotate_left(27).wrapping_mul(P1).wrapping_add(P4); | ||
| finalize(hash) | ||
| } |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public methods with_seed, finish64, and hash_u64 lack documentation comments. Public API methods should be documented to explain their purpose, parameters, and return values. Consider adding doc comments similar to how the struct itself is documented.
| } | ||
|
|
||
| #[test] | ||
| fn test_long_check() { |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name test_long_check is not descriptive. Consider renaming it to better describe what it tests, such as test_hash_u64_matches_write or test_hash_u64_consistency to make it clear that it's verifying the hash_u64 convenience method produces the same result as the write-based approach.
| fn test_long_check() { | |
| fn test_hash_u64_matches_write() { |
| fn merge_round(mut acc: u64, val: u64) -> u64 { | ||
| let mut v = val; | ||
| v = v.wrapping_mul(P2); | ||
| v = v.rotate_left(31); | ||
| v = v.wrapping_mul(P1); | ||
| acc ^= v; | ||
| acc.wrapping_mul(P1).wrapping_add(P4) |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intermediate mutable variable v is unnecessary here. The operations could be chained directly on the val parameter by making it mutable, or the operations could be chained in a single expression. This would simplify the code and improve readability.
| fn merge_round(mut acc: u64, val: u64) -> u64 { | |
| let mut v = val; | |
| v = v.wrapping_mul(P2); | |
| v = v.rotate_left(31); | |
| v = v.wrapping_mul(P1); | |
| acc ^= v; | |
| acc.wrapping_mul(P1).wrapping_add(P4) | |
| fn merge_round(acc: u64, val: u64) -> u64 { | |
| (acc ^ val.wrapping_mul(P2).rotate_left(31).wrapping_mul(P1)) | |
| .wrapping_mul(P1) | |
| .wrapping_add(P4) |
I think we should wait for sketches that need xxhash before shipping it ourselves. cc @tisonkun |
Signed-off-by: Chojan Shang <psiace@apache.org>
Agree. We can hold this PR for now. Whether or not publish the hash function is still under discussion, somehow datasketches-java and dataskecthes-cpp use them internally and datasketches-go clearly define the hash function internally. Seems mainly BloomFilter uses XxHash. @PsiACE you can take a look at #5 to see where we can have those sketches and xxhash as part of its implementation. |
|
I am working on a implementation of bloom filters at the moment. I can rework it to use this PR. |
Great! When it is desired, you can use |
I just found that we've implemented murmur3hash ourselves, while Java also implements xxhash.