Skip to content

Conversation

@chenghao-guo
Copy link
Collaborator

close #66

Depends on this PR: lance-format/lance#5117

  • New Public API: lance_ray.create_index, is introduced as the primary entry point for building distributed vector indices, currently support distributed IVF_FLAT, IVF_SQ, and IVF_PQ indices.

The new create_index function orchestrates a multi-phase workflow:

  1. Global Training: It uses existing lance.IndicesBuilder to train IVF centroids and, if applicable, PQ codebooks on a sample of the dataset.
  2. Distributed Task Execution: Per-fragment index building tasks are distributed across a pool of Ray workers. Each worker receives the pre-trained models and processes a subset of the data fragments.
  3. Metadata Finalization: After all fragment-level indices are built, the main process merges the metadata and commits the new index to the dataset manifest.

@github-actions github-actions bot added the enhancement New feature or request label Dec 30, 2025
@chenghao-guo chenghao-guo self-assigned this Dec 30, 2025
@chenghao-guo chenghao-guo marked this pull request as ready for review December 30, 2025 07:04
@chenghao-guo
Copy link
Collaborator Author

chenghao-guo commented Dec 30, 2025

Testing Environment

We conducted tests using the imagenet_train dataset, which consists of 1,281,167 images. These images were embedded using a 2048-dimension doubao-embedding model, resulting in a dataset size of approximately 140GB after converting to lance.

  • Single-machine index building

    • Machine: 4 cores, 16GB RAM
    • Storage: object-store for original dataset
  • Distributed setup

    • Topology: 1 head node + 4 worker nodes
    • Each node: 4 cores, 16GB RAM
    • Purpose: use very cheap machines to extend overall runtime and evaluate performance improvements of distributed index building on low-config machines
  • Index building parameters

    • num_partitions = 1000
    • Data split into 253 fragments
    • Distributed processing: 4 workers in parallel

Testing Results

  • Unit: m = minute

IVF_SQ Group (Tested on S3)

Configuration Global Train IVF Time Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 11.25m 25.7m 36.95m
Single Machine 4C-16GB 11.5m 137.5m 149.0m

IVF_FLAT Group

Configuration Global Train IVF Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 10.68m 25.82m 36.5m
Single Machine 4C-16GB 11.8m OOM Killed OOM Killed

IVF_PQ Group

Configuration Global Train IVF Time Global PQ Training Time Index Builder Time Total Time
Distributed (1 + 4) * 4c-16GB 10.53m 180.4m 48m 238.9m
Single Machine 4C-16GB 11.48m 171m 460.5m 642.98m

Overall Observation

Since it works on S3, network speed and S3 I/O may vary during the process. Sometimes it can take hours, but mainly due to the reasons below.

The distributed index was built using 4-core, 16GB machines (4 workers in total). We also checked if performance could scale linearly. The main reasons for the performance acceleration are:

  1. Avoiding backpressure throttle exceeded in IO on a single machine.
    We observed that when building an index on a low-performance machine, the lance process kept reporting:

    "backpressure throttle exceeded in IO"

  2. Distributed assignment allows using CPU resources from worker nodes to parallelize the process.

Conclusion

The test was carried out with a 4-core, 16GB machine and an S3 object-store. The results clearly show that the distributed setup significantly outperforms the single-machine setup in terms of index building time, especially for the Distribute Builder Time metric, for scaling linearly or better.

We can estimate speedup as:
For example, IVF_SQ total time​≈4.03×
Nearly scaling linearly


Limitation

Since PQ (Product Quantization) depends on:

  • global training of the Inverted File (IVF), and
  • PQ codebook training,

both of which are carried out on a single machine, if the training of the PQ codebook takes a long time, the acceleration effect of IVF-PQ may not be satisfactory. However, it is more suitable for a larger number of tables.

The most well-balanced and superior method appears to be IVF_SQ (Inverted File with Scalar Quantization), as this method can generally guarantee good recall and support distributed parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support build IVF index distributively in ray

1 participant