Project status and future challenges #30
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goal
Implement an OpenVINO backend for llama.cpp.
Current Status
The following tools work with the OpenVINO backend on CPU and GPU:
llama-simple,llama-run,llama-cli,llama-server,llama-bench,llama-perplexity.Performance: On GPU, llama-bench results are close to Vulkan performance for larger models. See llama.cpp-ov bench (backend buffer).
Quantization:
NPU limitations:
llama-server -np > 1(multiple parallel sequences)llama-perplexity -b 512or smallerKey Problems
Other backends operate at the kernel level, while the OpenVINO backend operates at the graph level.
Root cause: OpenVINO is an AOT (ahead-of-time) framework, but llama.cpp doesn't have a graph compilation step.
Problem 1: Static ggml cgraph vs. Dynamic OpenVINO IR
For each token, llama.cpp builds a cgraph and delegates execution to backends. See appendix for an example.
Each cgraph has fixed shapes, but the graph structure changes every inference step:
VIEWoffsets for KV cache depend on past token count, see appendix for details)Other backends execute the ops in the cgraph, reading/writing directly to tensor data pointers (addresses in buffers allocated by the backend). The OpenVINO backend must convert the cgraph to an OpenVINO IR graph. Since OpenVINO is AOT, the IR must be compiled before execution, and compilation is expensive—we can't afford to recompile at every inference step.
Current solution: Build a dynamic IR with symbolic shapes (e.g.,
inp_tokensshape[1,1,1,-1]) for CPU and GPU, and extract changing values as extra inputs (e.g.,attention_sizefor slicing KV cache). The compiled graph is then cached.Limitations:
Problem 2: Buffer Management
Other backends allocate buffers for weights, KV cache, and compute tensors; kernels read/write directly to these buffers.
In the OpenVINO backend:
ov::Constantnodes pointing to the weight buffers and use them in the IR graphAppendix
cgraph example
View dimensions of cache_k/cache_v change based on past token length
When the past token length crosses 256 tokens, the shapes of
cache_k_l0andcache_v_l0change from[64, 8, 256, 1]to[64, 8, 512, 1].