-
Notifications
You must be signed in to change notification settings - Fork 502
feat(python): expose search_filter in scanner #5506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
2f3a37d to
5d27211
Compare
|
hi @BubbleCal , could you help review this when you have time, thanks very much~ |
BubbleCal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
just need to add some doc strings then feel free to merge
| def filter( | ||
| self, | ||
| filter: Union[ | ||
| str, pa.compute.Expression, FullTextQuery, VectorSearchQuery, dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some doc strings for this param, esp for dict case
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this a "search filter" instead of a "query filter"? I think of queries as both "filtered scans" and "searches". This filter both:
- Only applies to searches (can't use this on a filtered scan)
- Is a search itself (e.g. a VectorSearchQuery or FullTextQuery)
2740551 to
2d0dc31
Compare
2d0dc31 to
12ed9b0
Compare
|
Hi @westonpace , pr has been updated, please review when you have time, thanks very much! |
| filter=VectorSearchQuery( | ||
| "vector", | ||
| np.array([12, 17, 300, 10], dtype=np.float32), | ||
| 5, | ||
| 20, | ||
| True, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't actually know what happens in this case. Is the full text search applied after the vector search? Basically just re-ranking the vector search results?
Or is there some equivalent to "at least one token matches" that is being used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the full text search applied after the vector search? Basically just re-ranking the vector search results?
Yes, in this case the vector search is used as a filter for full text search. But it doesn't mean the full text search is always applied after the vector search. The Scanner::prefilter determines whether the vector search filter is used before/after full text search.
If prefiter=false, we first do a full text search and get rows with score > 0, then we do refine filter using flat knn to get the top 5 nearest rows.
If prefilter=true, we first do a vector search and get 5 rows, then do a flat full text search based on the 5 rows. The full text search will drop the unmatched rows and re-rank the results by score.
Adding python api to support using fts as a filter for vector search, or using vector_query as a filter for fts search. Related to #4928.