feat: support ray distributed IVF index builder #67

chenghao-guo · 2025-12-30T06:26:58Z

close #66

Depends on this PR: lance-format/lance#5117

New Public API: lance_ray.create_index, is introduced as the primary entry point for building distributed vector indices, currently support distributed IVF_FLAT, IVF_SQ, and IVF_PQ indices.

The new create_index function orchestrates a multi-phase workflow:

Global Training: It uses existing lance.IndicesBuilder to train IVF centroids and, if applicable, PQ codebooks on a sample of the dataset.
Distributed Task Execution: Per-fragment index building tasks are distributed across a pool of Ray workers. Each worker receives the pre-trained models and processes a subset of the data fragments.
Metadata Finalization: After all fragment-level indices are built, the main process merges the metadata and commits the new index to the dataset manifest.

chenghao-guo · 2025-12-30T09:57:52Z

Testing Environment

We conducted tests using the imagenet_train dataset, which consists of 1,281,167 images. These images were embedded using a 2048-dimension doubao-embedding model, resulting in a dataset size of approximately 140GB after converting to lance.

Single-machine index building
- Machine: 4 cores, 16GB RAM
- Storage: object-store for original dataset
Distributed setup
- Topology: 1 head node + 4 worker nodes
- Each node: 4 cores, 16GB RAM
- Purpose: use very cheap machines to extend overall runtime and evaluate performance improvements of distributed index building on low-config machines
Index building parameters
- num_partitions = 1000
- Data split into 253 fragments
- Distributed processing: 4 workers in parallel

Testing Results

Unit: m = minute

IVF_SQ Group (Tested on S3)

Configuration	Global Train IVF Time	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	11.25m	25.7m	36.95m
Single Machine `4C-16GB`	11.5m	137.5m	149.0m

IVF_FLAT Group

Configuration	Global Train IVF	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	10.68m	25.82m	36.5m
Single Machine `4C-16GB`	11.8m	OOM Killed	OOM Killed

IVF_PQ Group

Configuration	Global Train IVF Time	Global PQ Training Time	Index Builder Time	Total Time
Distributed `(1 + 4) * 4c-16GB`	10.53m	180.4m	48m	238.9m
Single Machine `4C-16GB`	11.48m	171m	460.5m	642.98m

Overall Observation

Since it works on S3, network speed and S3 I/O may vary during the process. Sometimes it can take hours, but mainly due to the reasons below.

The distributed index was built using 4-core, 16GB machines (4 workers in total). We also checked if performance could scale linearly. The main reasons for the performance acceleration are:

Avoiding backpressure throttle exceeded in IO on a single machine.
We observed that when building an index on a low-performance machine, the lance process kept reporting:

"backpressure throttle exceeded in IO"
Distributed assignment allows using CPU resources from worker nodes to parallelize the process.

Conclusion

The test was carried out with a 4-core, 16GB machine and an S3 object-store. The results clearly show that the distributed setup significantly outperforms the single-machine setup in terms of index building time, especially for the Distribute Builder Time metric, for scaling linearly or better.

We can estimate speedup as:
For example, IVF_SQ total time≈4.03×
Nearly scaling linearly

Limitation

Since PQ (Product Quantization) depends on:

global training of the Inverted File (IVF), and
PQ codebook training,

both of which are carried out on a single machine, if the training of the PQ codebook takes a long time, the acceleration effect of IVF-PQ may not be satisfactory. However, it is more suitable for a larger number of tables.

The most well-balanced and superior method appears to be IVF_SQ (Inverted File with Scalar Quantization), as this method can generally guarantee good recall and support distributed parallelism.

github-actions bot added the enhancement New feature or request label Dec 30, 2025

chenghao-guo self-assigned this Dec 30, 2025

feat: support ray distributed IVF index builder

e271a7c

chenghao-guo force-pushed the ivf_index branch from 8d9705e to e271a7c Compare December 30, 2025 06:28

test: skip testing distribute ivf index when pylance version<2.0.0

a9f730e

chenghao-guo marked this pull request as ready for review December 30, 2025 07:04

chenghao-guo mentioned this pull request Dec 30, 2025

feat: support create vector index distributedly lance-format/lance#5117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support ray distributed IVF index builder #67

feat: support ray distributed IVF index builder #67

Uh oh!

chenghao-guo commented Dec 30, 2025

Uh oh!

chenghao-guo commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: support ray distributed IVF index builder #67

Are you sure you want to change the base?

feat: support ray distributed IVF index builder #67

Uh oh!

Conversation

chenghao-guo commented Dec 30, 2025

Uh oh!

chenghao-guo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing Environment

Testing Results

IVF_SQ Group (Tested on S3)

IVF_FLAT Group

IVF_PQ Group

Overall Observation

Conclusion

Limitation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenghao-guo commented Dec 30, 2025 •

edited

Loading