forked from apache/lucene
-
Notifications
You must be signed in to change notification settings - Fork 0
Cuvs integration main #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
noblepaul
wants to merge
89
commits into
main
Choose a base branch
from
cuvs-integration-main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…apache#14109) This commit fixes a bug where by prefetch may select the wrong memory segment for multi-segment slices. The issue was discovered when debugging a large test scenario, where the index input was backed by several memory segments. When sliced, a multi-segment index input uses an offset into the initial memory segment. This offset should be added to the prefetch offset to determine the absolute offset.
… not rearrange docids (apache#14122)
…atetimes in UTC (apache#14102) Co-authored-by: Shubham Sharma <shubhamvibrantiit@gmail.com>
This commit adds overloads for primitive access to DirectIOIndexInput. Existing tests in TestDirectIOIndexInput already provide sufficient coverage for the changes in this PR.
This commit adds overloads for bulk retrieval to DirectIOIndexInput. The implementation of these methods is identical to that of BufferedIndexInput, and it already covered by existing tests.
…14081) This patch fixes incorrect URL links in NIOFSDirectory and FSDirectory.
* Use CDL to block threads to avoid flaky tests. * Update CHANGES.txt
…ace (apache#14113) Removing unnecessary ByteArrayDataInput allocations by resetting inplace Signed-off-by: Ankit Jain <akjain@amazon.com>
Cover all DataType
…ntroduce `Bits#applyMask`. (apache#14134) Most `DocIdSetIterator` implementations can no longer implement `#intoBitSet` efficiently as soon as there are live docs. So this commit remove this argument and instead introduces a new `Bits#applyMask` API that helps clear bits in a bit set when the corresponding doc ID is not live. Relates apache#14133
Bit sets can be faster at advancing and more storage-efficient on dense blocks of postings. This is not a new idea, @mkhludnev proposed something similar a long time ago apache#6116. @msokolov recently brought up (apache#14080) that such an encoding has become especially appealing with the introduction of the `DocIdSetIterator#loadIntoBitSet` API, and the fact that non-scoring disjunctions and dense conjunctions now take advantage of it. Indeed, if postings are stored in a bit set, `#loadIntoBitSet` would just need to OR the postings bits into the bits that are used as an intermediate representation of matches of the query.
This abstract class has currently one implementation so this removes this indirection.
* Publish build scans to develocity.apache.org * Update Develocity plugin versions * Use `DEVELOCITY_ACCESS_KEY` to authenticate to `develocity.apache.org`
This reverts commit 34a732f.
### Description In some vector search cases, users may already know some documents that are likely related to a query. Let's support seeding HNSW's scoring stage with these documents, rather than using HNSW's hierarchical stage. An example use case is hybrid search, where both a traditional and vector search are performed. The top results from the traditional search are likely reasonable seeds for the vector search. Even when not performing hybrid search, traditional matching can often be faster than traversing the hierarchy, which can be used to speed up the vector search process (up to 2x faster for the same effectiveness), as was demonstrated in [this article](https://arxiv.org/abs/2307.16779) (full disclosure: seanmacavaney is an author of the article). The main changes are: - A new "seeded" focused knn collector and collector manager - Two new basic knn queries that expose using these specialized collectors for seeded entrypoint - `HnswGraphSearcher`, which bypasses the `findBestEntryPoint` step if seeds are provided. //cc @seanmacavaney Co-authored-by: Sean MacAvaney <smacavaney@bloomberg.com> Co-authored-by: Sean MacAvaney <sean.macavaney@gmail.com> Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…ache#14138) Implement IntersectVisitor#visit(IntsRef) in many of the current implementations and add BulkAdder#add(IntsRef) method. They should provide better performance due to less virtual method calls and more efficient bulk processing.
The error message is a bit different depending on whether you append to an existing `IndexingChain.PerField` object or to a new one.
The recent optimization from apache#14164 interfered in a bad way with a prior optimization.
…ther when the merge is below the min merge size. (apache#14166) This is essentially porting apache#266 to `LogMergePolicy`. By allowing more than `mergeFactor` segments to be merged together for small merges, the merge policy gets a lower write amplification and indexes have fewer small segments.
After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time. Summary of the most significant changes: 1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other. 2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly. 3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices. 4. A standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters. A number of workarounds have been put in place, which will eventually be lifted. 1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter. 2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.
d6c2def to
e4e1b15
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DO NOT merge , created to make review easier