GetIndexCatalogue instruction #800

philtweir · 2025-04-03T08:07:46Z

Only recently came to pagefind and it is amazing, thank you!

As in #371, my use-case is geospatial, and using pagefind for text search works really well except that I really need to be able to filter by a map-window. In short, we have 30-60k records to search and while leaflet.js cluster markers can handle that, geographically windowing the search is critical for the tool to be usable.

Is this an XY problem?: The two ideas I had were, (1) adding a (taxonomy-based) locality filter, but the reality is that would be hard to do in a useful way given how many localities there are and unintuitive, (2) searching the full result set with the geographic window (thanks @indus for highlighting flatbush) and intersecting the results. The second does work, but without being able to map pagefind hashes to any lookup outside pagefind, that means loading all matching results via result.data() and using metadata. I cannot take the first N results because they may not contain O(N) matches in the geospatial window - users are likely to wish to search very generic terms in a very small area, so loading every fragment to filter down to 5 results generates a huge amount of traffic and kills the browser. As such, the minimum viable option seems to be adding GetIndexCatalogue so that I can intersect flathub results with pagefind results before loading any fragments. Happy for other simpler alternatives.

I saw in #371 that this is the same direction @bglw is intending but as I could not see a PR and it was a blocker for me, I went ahead and did a minimal patch. There is clearly more work:

currently, it returns the entire (encoded) fragment for every fragment (may or may not be desirable)
other index information perhaps should be returned (?)
only calls needed to add it to the node wrapper are present
I could not get main to run (an issue where fragment requests get sent as single-element arrays with the fragment hash and 404, which I thought someone might have a quick explanation for) so this is tested on top of v1.3.0 tag
currently, there is an issue with lindera (Old dictionnary links are no more available breaking the old versions lindera/lindera#467) where the old dictionary links no longer exist, so I could not cargo build --extended in pagefind/pagefind
tests and documentation required

However, happy to tidy up if the direction is good - if there is equivalent work-in-progress or there is a better approach, feel free to close and I will keep using this only until it lands!

…with content

bglw · 2025-04-07T10:36:54Z

👋 @philtweir — welcome!

No work is in progress yet for this, so I would love to work with you to get this tidied up and merged in 🙂

Notes on your notes:

currently, it returns the entire (encoded) fragment for every fragment (may or may not be desirable)
- Probably not desirable, mainly for size. The fragments contain all indexed text for a page, so text heavy sites will find this particularly large. What data do you need returned for each record? We might want to change the PagefindIndexes type to contain the IntermediaryPageData struct rather than just the encoded_data.
other index information perhaps should be returned (?)
- Indeed! My first thought would be information on the number of index chunks, and what the word boundaries are. I don't know exactly the utility, but someone will I'm sure 😄
only calls needed to add it to the node wrapper are present
- We will need this added to the Python wrapper for parity. I'm happy to chip in for some of these tasks also, since updating multiple wrappers is a big ask.
I could not get main to run (an issue where fragment requests get sent as single-element arrays with the fragment hash and 404, which I thought someone might have a quick explanation for) so this is tested on top of v1.3.0 tag
- Hmm, main should be running. A few commits ago the playground work did change all of the WASM<>JS communication, so my first recommendation would be to clear out all build artifacts and ensure everything is rebuilt on the latest commit. Let me know if you hit this again after that.
currently, there is an issue with lindera (Old dictionnary links are no more available breaking the old versions lindera/lindera#467) where the old dictionary links no longer exist, so I could not cargo build --extended in pagefind/pagefind
- This will be patched up in Add title attribute to default UI input for access #798 — but feel free to mirror that change (updating charabia to v0.9.3) on this branch.
tests and documentation required
- Happy to help with guidance or implementation for this 🙂

philtweir · 2025-04-07T20:47:28Z

Thanks @bglw !

Probably not desirable, mainly for size. The fragments contain all indexed text for a page, so text heavy sites will find this particularly large. What data do you need returned for each record? We might want to change the PagefindIndexes type to contain the IntermediaryPageData struct rather than just the encoded_data.

Only the metadata, so this was simply to be least invasive on the Rust end (and this is for batch runs, so I am less concerned about the size). However, keeping the IntermediaryPageData and passing that back instead of the encoded_data seems to address the primary use-case with much less overhead and given that the metadata can hold arbitrary key-value pairs, the user can store their own ID for looking up the original data if they want.

Indeed! My first thought would be information on the number of index chunks, and what the word boundaries are. I don't know exactly the utility, but someone will I'm sure 😄

Sounds good!

We will need this added to the Python wrapper for parity. I'm happy to chip in for some of these tasks also, since updating multiple wrappers is a big ask.

Of course, and thanks - no objection to doing that, but that went beyond my immediate need, so left it as a todo for this discussion (having assumed it would be essential for merge).

Hmm, main should be running. A few commits ago the playground work did change all of the WASM<>JS communication, so my first recommendation would be to clear out all build artifacts and ensure everything is rebuilt on the latest commit. Let me know if you hit this again after that.

Grand, will do - I did do a build from a clean repo clone the same day that I submitted this, so it is perhaps more likely I have got some steps out of sequence (again, was keen to check before spending more time but know to take another crack now).

#798 — but feel free to mirror that change (updating charabia to v0.9.3) on this branch.

Great - not sure how I missed that - classic searching everywhere except the problem packagename in the issue tracker 🙄.

tests and documentation required
Happy to help with guidance or implementation for this 🙂

Thank you!

(WIP) Introduce GetIndexCatalogue instruction to retrieve all hashes …

0469e2b

…with content

philtweir marked this pull request as draft April 3, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetIndexCatalogue instruction #800

GetIndexCatalogue instruction #800

Uh oh!

philtweir commented Apr 3, 2025 •

edited

Loading

Uh oh!

bglw commented Apr 7, 2025

Uh oh!

philtweir commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GetIndexCatalogue instruction #800

Are you sure you want to change the base?

GetIndexCatalogue instruction #800

Uh oh!

Conversation

philtweir commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bglw commented Apr 7, 2025

Uh oh!

philtweir commented Apr 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

philtweir commented Apr 3, 2025 •

edited

Loading