Skip to content

Conversation

@noblepaul
Copy link

DO NOT merge , created to make review easier

Ishan Chattopadhyaya and others added 30 commits January 7, 2025 21:17
…apache#14109)

This commit fixes a bug where by prefetch may select the wrong memory segment for multi-segment slices.

The issue was discovered when debugging a large test scenario, where the index input was backed by several memory segments. When sliced, a multi-segment index input uses an offset into the initial memory segment. This offset should be added to the prefetch offset to determine the absolute offset.
…atetimes in UTC (apache#14102)


Co-authored-by: Shubham Sharma <shubhamvibrantiit@gmail.com>
This commit adds overloads for primitive access to DirectIOIndexInput.

Existing tests in TestDirectIOIndexInput already provide sufficient coverage for the changes in this PR.
This commit adds overloads for bulk retrieval to DirectIOIndexInput. The implementation of these methods is identical to that of BufferedIndexInput, and it already covered by existing tests.
…14081)

This patch fixes incorrect URL links in NIOFSDirectory and FSDirectory.
* Use CDL to block threads to avoid flaky tests.

* Update CHANGES.txt
…ace (apache#14113)

Removing unnecessary ByteArrayDataInput allocations by resetting inplace

Signed-off-by: Ankit Jain <akjain@amazon.com>
…ntroduce `Bits#applyMask`. (apache#14134)

Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet`
efficiently as soon as there are live docs. So this commit remove this argument
and instead introduces a new `Bits#applyMask` API that helps clear bits in a
bit set when the corresponding doc ID is not live.

Relates apache#14133
Bit sets can be faster at advancing and more storage-efficient on dense blocks
of postings. This is not a new idea, @mkhludnev proposed something similar a
long time ago apache#6116.

@msokolov recently brought up (apache#14080) that such an encoding has become
especially appealing with the introduction of the
`DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring
disjunctions and dense conjunctions now take advantage of it. Indeed, if
postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the
postings bits into the bits that are used as an intermediate representation of
matches of the query.
This abstract class has currently one implementation so this removes this indirection.
* Publish build scans to develocity.apache.org

* Update Develocity plugin versions

* Use `DEVELOCITY_ACCESS_KEY` to authenticate to `develocity.apache.org`
### Description

In some vector search cases, users may already know some documents that are likely related to a query. Let's support seeding HNSW's scoring stage with these documents, rather than using HNSW's hierarchical stage.

An example use case is hybrid search, where both a traditional and vector search are performed. The top results from the traditional search are likely reasonable seeds for the vector search. Even when not performing hybrid search, traditional matching can often be faster than traversing the hierarchy, which can be used to speed up the vector search process (up to 2x faster for the same effectiveness), as was demonstrated in [this article](https://arxiv.org/abs/2307.16779) (full disclosure: seanmacavaney is an author of the article).

The main changes are:
 - A new "seeded" focused knn collector and collector manager
 - Two new basic knn queries that expose using these specialized collectors for seeded entrypoint
 - `HnswGraphSearcher`, which bypasses the `findBestEntryPoint` step if seeds are provided.


//cc @seanmacavaney

Co-authored-by: Sean MacAvaney <smacavaney@bloomberg.com>
Co-authored-by: Sean MacAvaney <sean.macavaney@gmail.com>
Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…ache#14138)

Implement IntersectVisitor#visit(IntsRef) in many of the current implementations and add
BulkAdder#add(IntsRef) method. They should provide better performance due to less virtual 
method calls and more efficient bulk processing.
The error message is a bit different depending on whether you append to
an existing `IndexingChain.PerField` object or to a new one.
jpountz and others added 29 commits February 1, 2025 17:41
The recent optimization from apache#14164 interfered in a bad way with a prior
optimization.
…ther when the merge is below the min merge size. (apache#14166)

This is essentially porting apache#266 to `LogMergePolicy`. By allowing more than
`mergeFactor` segments to be merged together for small merges, the merge policy
gets a lower write amplification and indexes have fewer small segments.
After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time.

Summary of the most significant changes:

1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other.
2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly.
3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices.
4. A standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters.

A number of workarounds have been put in place, which will eventually be lifted.

1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter.
2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.
@punAhuja punAhuja force-pushed the cuvs-integration-main branch from d6c2def to e4e1b15 Compare April 9, 2025 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.