Cuvs integration main #1

noblepaul · 2025-01-08T02:55:29Z

DO NOT merge , created to make review easier

…apache#14109) This commit fixes a bug where by prefetch may select the wrong memory segment for multi-segment slices. The issue was discovered when debugging a large test scenario, where the index input was backed by several memory segments. When sliced, a multi-segment index input uses an offset into the initial memory segment. This offset should be added to the prefetch offset to determine the absolute offset.

… not rearrange docids (apache#14122)

…atetimes in UTC (apache#14102) Co-authored-by: Shubham Sharma <shubhamvibrantiit@gmail.com>

…ache#14121)

… stable (apache#14117)

This commit adds overloads for primitive access to DirectIOIndexInput. Existing tests in TestDirectIOIndexInput already provide sufficient coverage for the changes in this PR.

This commit adds overloads for bulk retrieval to DirectIOIndexInput. The implementation of these methods is identical to that of BufferedIndexInput, and it already covered by existing tests.

…14081) This patch fixes incorrect URL links in NIOFSDirectory and FSDirectory.

* Use CDL to block threads to avoid flaky tests. * Update CHANGES.txt

…ace (apache#14113) Removing unnecessary ByteArrayDataInput allocations by resetting inplace Signed-off-by: Ankit Jain <akjain@amazon.com>

Cover all DataType

…ntroduce `Bits#applyMask`. (apache#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates apache#14133

@mkhludnev

Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.

This abstract class has currently one implementation so this removes this indirection.

Thank you!

* Publish build scans to develocity.apache.org * Update Develocity plugin versions * Use `DEVELOCITY_ACCESS_KEY` to authenticate to `develocity.apache.org`

This reverts commit 34a732f.

@seanmacavaney

### Description In some vector search cases, users may already know some documents that are likely related to a query. Let's support seeding HNSW's scoring stage with these documents, rather than using HNSW's hierarchical stage. An example use case is hybrid search, where both a traditional and vector search are performed. The top results from the traditional search are likely reasonable seeds for the vector search. Even when not performing hybrid search, traditional matching can often be faster than traversing the hierarchy, which can be used to speed up the vector search process (up to 2x faster for the same effectiveness), as was demonstrated in [this article](https://arxiv.org/abs/2307.16779) (full disclosure: seanmacavaney is an author of the article). The main changes are: - A new "seeded" focused knn collector and collector manager - Two new basic knn queries that expose using these specialized collectors for seeded entrypoint - `HnswGraphSearcher`, which bypasses the `findBestEntryPoint` step if seeds are provided. //cc @seanmacavaney Co-authored-by: Sean MacAvaney <smacavaney@bloomberg.com> Co-authored-by: Sean MacAvaney <sean.macavaney@gmail.com> Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

…me.apache.org apache#13647 apache#14144

…ache#14138) Implement IntersectVisitor#visit(IntsRef) in many of the current implementations and add BulkAdder#add(IntsRef) method. They should provide better performance due to less virtual method calls and more efficient bulk processing.

The error message is a bit different depending on whether you append to an existing `IndexingChain.PerField` object or to a new one.

The recent optimization from apache#14164 interfered in a bad way with a prior optimization.

…ther when the merge is below the min merge size. (apache#14166) This is essentially porting apache#266 to `LogMergePolicy`. By allowing more than `mergeFactor` segments to be merged together for small merges, the merge policy gets a lower write amplification and indexes have fewer small segments.

…he#14101)

After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time. Summary of the most significant changes: 1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other. 2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly. 3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices. 4. A standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters. A number of workarounds have been put in place, which will eventually be lifted. 1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter. 2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.

Ishan Chattopadhyaya and others added 30 commits January 7, 2025 21:17

Initial cut of CuVS into Lucene as a Codec in sandbox

b8a1162

Test fixes

0e9f6d4

fix for getFloatVectorValues

a95f084

Add some basic HNSW graph checks to CheckIndex (apache#13984)

11eb2c8

Add CHANGES entry for CheckIndex HNSW work (apache#14120)

5fd2e70

Fix test that was implicitly assuming simple writer config that would…

0169c1e

… not rearrange docids (apache#14122)

Updated releaseWizard.py to use timezone-aware objects to represent d…

3fad719

…atetimes in UTC (apache#14102) Co-authored-by: Shubham Sharma <shubhamvibrantiit@gmail.com>

Preserve max-conn when merging onto existing graph Fixes gh#14118 (ap…

2afc0a0

…ache#14121)

fix for gh#14110: improve BpVectorReordered heuristic to make it more…

b7c7fe0

… stable (apache#14117)

DirectIOIndexInput - add overloads for primitive access (apache#14107)

7c64217

This commit adds overloads for primitive access to DirectIOIndexInput. Existing tests in TestDirectIOIndexInput already provide sufficient coverage for the changes in this PR.

DirectIOIndexInput - add overloads for bulk retrieval (apache#14124)

60efc4a

This commit adds overloads for bulk retrieval to DirectIOIndexInput. The implementation of these methods is identical to that of BufferedIndexInput, and it already covered by existing tests.

Fix urls describing why NIOFS is not recommended for Windows (apache#…

2756cd9

…14081) This patch fixes incorrect URL links in NIOFSDirectory and FSDirectory.

Use CDL to block threads to avoid flaky tests. (apache#14116)

ee65e8f

* Use CDL to block threads to avoid flaky tests. * Update CHANGES.txt

fix apachegh-14123: Add null checks to SortingCodecReader (apache#14125)

1778377

Removing unnecessary ByteArrayDataInput allocations by resetting inpl…

6f9702e

…ace (apache#14113) Removing unnecessary ByteArrayDataInput allocations by resetting inplace Signed-off-by: Ankit Jain <akjain@amazon.com>

Cover all DataType (apache#14091)

c20e09e

Cover all DataType

Fixing precommit, ECJ, Rat, spotless, forbiddenApis etc.

9f0d3dd

Add CHANGES for apache#14133.

245acc8

Remove SingleValueDocValuesFieldUpdates abstract class (apache#14059)

c1cbb22

This abstract class has currently one implementation so this removes this indirection.

Complete the javadoc for DirectoryReader#indexExists (apache#14136)

b87757c

Thank you!

Publish build scans to develocity.apache.org (apache#14140)

34a732f

* Publish build scans to develocity.apache.org * Update Develocity plugin versions * Use `DEVELOCITY_ACCESS_KEY` to authenticate to `develocity.apache.org`

Revert "Publish build scans to develocity.apache.org (apache#14140)"

905efa9

This reverts commit 34a732f.

Temporarily skip tasks that point at datasets previously hosted at ho…

df7170e

…me.apache.org apache#13647 apache#14144

Publish build scans to develocity.apache.org (apache#14141)

16cd779

Fix TestFeatureField.testStoreTermVectors failure. (apache#14146)

5c91f15

The error message is a bit different depending on whether you append to an existing `IndexingChain.PerField` object or to a new one.

jpountz and others added 29 commits February 1, 2025 17:41

Fix refill logic in nextDoc(). (apache#14185)

b429c43

The recent optimization from apache#14164 interfered in a bad way with a prior optimization.

minimal update for the new cuvs-java api modifications

834e560

add filter cuvs service provider

3772c4c

Use github wf to add module labels for PR based on file changes (apac…

b13d37f

…he#14101)

Merge branch 'main' into cuvs-integration-main

80255fd

cleanup

8453bb1

itr : remove dep on commons lang3, fix visibility issues

2bce954

tidy

e62112e

expose knn format and update test

349c7aa

fix initialization of cuvSResources

8d8db0b

add CuVSVectorsFormat test

c9d454d

fix testWriterRamEstimate

ab6beae

add bug URLs

30206d6

Make CuVSKnnFloatVectorQuery public

8e9fe16

assertion and test

34afa24

plumb infoStream, and add indexType

8ae2515

fix default index TYPE

e04c2e7

fix workaround for tiny Cagra index

fbb0407

tidy

7f39c0c

fix bug where docs are deleted or empty

67ec96b

clamp intermediate graph degree

6e86c21

comment out log mesg

b1a84c2

make 32 the default GPU index threads

4dd1f88

remove LibraryException from the API, so consumers don't need cuvs-java

c4b5c29

De-allocate indexes once serialized.

8cf5087

de-allocate indices on the read size, when closed

3837a10

Fixing scoring normalization for search spanning multiple segments

e4e1b15

punAhuja force-pushed the cuvs-integration-main branch from d6c2def to e4e1b15 Compare April 9, 2025 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuvs integration main #1

Cuvs integration main #1

Uh oh!

noblepaul commented Jan 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Cuvs integration main #1

Are you sure you want to change the base?

Cuvs integration main #1

Uh oh!

Conversation

noblepaul commented Jan 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants