-
Notifications
You must be signed in to change notification settings - Fork 5
(FOR CI) Xsn/server data race #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughReplaced server_context::get_info() with an expanded server_context::get_meta(); moved response generation to queue-based constructors and a create_response() factory; changed LoRA handling from vector to std::map<int,float>; made tokens_to_str vocab-centric; introduced request ownership for streaming HTTP responses and priority posting for tasks. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP as HTTP Handler
participant Routes as server_routes
participant Queue as Task Queue
participant Worker as Background Worker
participant ResGen as server_res_generator
Client->>HTTP: POST /v1/completions (streaming)
HTTP->>HTTP: allocate request (unique_ptr)
HTTP->>Routes: routes.create_response(bypass_sleep)
Routes->>ResGen: create response generator (queues + sleep params)
HTTP->>ResGen: post_task(task, front=false) -- moves request into handler
HTTP->>Client: send initial headers (200)
Worker->>Queue: fetch next task
Worker->>Worker: process task (use model/vocab, per-request LORA)
Worker->>Queue: post_result(result with index)
ResGen->>Queue: wait_for_all / has_next
Queue-->>ResGen: deliver ordered results by index
ResGen->>Client: stream chunk(s)
alt more chunks
ResGen->>Queue: wait_for_next
Queue-->>ResGen: next result
ResGen->>Client: stream chunk
end
Note right of ResGen: request shared ownership captured by streaming lambda\nand reset on completion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used📓 Path-based instructions (2)**/*.{cpp,h,hpp}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{cpp,h,hpp,py}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
🧠 Learnings (2)📓 Common learnings📚 Learning: 2025-12-20T21:18:14.768ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
🔇 Additional comments (20)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server-task.cpp (1)
1151-1157: Static variable into_json_anthropiccauses state leakage across requests.The
static bool text_block_startedvariable at line 1154 persists across different requests/invocations, which can cause incorrect behavior when handling multiple concurrent streaming requests. Each request should have its own state.🔎 Proposed fix
Consider moving this state into the result object or tracking it differently. The state should be per-request, not global. Since this is in a partial result's
to_json, the tracking should be part of the result state or derived fromn_decodedandoaicompat_msg_diffscontent.json server_task_result_cmpl_partial::to_json_anthropic() { json events = json::array(); bool first = (n_decoded == 1); - static bool text_block_started = false; + // Determine if text block was started by checking if we have prior content + // This needs to be tracked in the result state, not as a static variable + bool text_block_started = false; // TODO: track this in task_result_state if (first) { text_block_started = false;Note: A complete fix requires tracking
text_block_startedin the per-request state (e.g., intask_result_state).
🧹 Nitpick comments (3)
tools/server/server-queue.cpp (1)
390-408: Correct batch result handling with index validation.The
wait_for_allimplementation:
- Pre-allocates results vector to correct size (line 393)
- Uses
result->indexfor proper placement (line 404)- Includes bounds check and duplicate detection assertions (lines 405-406)
The
batch_res.results.clear()on line 392 is redundant sinceresultsis default-initialized empty, but it's harmless.🔎 Optional: Remove redundant clear()
server_response_reader::batch_response server_response_reader::wait_for_all(const std::function<bool()> & should_stop) { batch_response batch_res; - batch_res.results.clear(); batch_res.results.resize(id_tasks.size());tools/server/server-common.cpp (1)
118-129: Consider validating lora id before insertion.The function accepts
id = -1as a default when parsing fails, but inserts it into the map without validation. A negative id might indicate a parsing error or invalid input.🔎 Proposed validation
std::map<int, float> parse_lora_request(const json & data) { std::map<int, float> lora; // set value for (const auto & entry : data) { int id = json_value(entry, "id", -1); + if (id < 0) { + throw std::invalid_argument("lora entry must have a valid 'id' field"); + } float scale = json_value(entry, "scale", 0.0f); lora[id] = scale; } return lora; }tools/server/server-context.cpp (1)
3071-3082: Consider a cleaner pattern for preventing accidental ctx_server access.The
bool server_ctx;shadowing trick (lines 3076-3078) is creative but fragile. A future maintainer might remove it thinking it's dead code. Consider alternative approaches:
- Extract sleeping-safe handlers into a separate helper that doesn't capture
ctx_server- Use a static analysis annotation or
[[maybe_unused]]with a more descriptive nameThe current approach works but could be more robust.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
tools/cli/cli.cpptools/server/server-common.cpptools/server/server-common.htools/server/server-context.cpptools/server/server-context.htools/server/server-http.cpptools/server/server-queue.cpptools/server/server-queue.htools/server/server-task.cpptools/server/server-task.htools/server/server.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server.cpptools/server/server-queue.htools/server/server-http.cpptools/cli/cli.cpptools/server/server-common.cpptools/server/server-task.htools/server/server-common.htools/server/server-context.cpptools/server/server-queue.cpptools/server/server-task.cpptools/server/server-context.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server.cpptools/server/server-queue.htools/server/server-http.cpptools/cli/cli.cpptools/server/server-common.cpptools/server/server-task.htools/server/server-common.htools/server/server-context.cpptools/server/server-queue.cpptools/server/server-task.cpptools/server/server-context.h
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).
Applied to files:
tools/server/server-queue.htools/server/server-context.cpptools/server/server-queue.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
tools/server/server-common.cpptools/server/server-common.htools/server/server-task.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical
Applied to files:
tools/server/server-common.cpp
🧬 Code graph analysis (4)
tools/server/server-queue.h (1)
tools/server/server-queue.cpp (5)
post_task(332-339)post_task(332-332)task(421-421)post_tasks(341-351)post_tasks(341-341)
tools/server/server-common.h (1)
tools/server/server-common.cpp (8)
parse_lora_request(118-129)parse_lora_request(118-118)tokens_to_str(1426-1433)tokens_to_str(1426-1426)tokens_to_str(1435-1438)tokens_to_str(1435-1435)tokens_to_str(1440-1442)tokens_to_str(1440-1440)
tools/server/server-context.cpp (3)
tools/server/server-common.cpp (8)
are_lora_equal(131-144)are_lora_equal(131-133)lora_should_clear_cache(104-116)lora_should_clear_cache(104-106)oaicompat_chat_params_parse(832-1071)oaicompat_chat_params_parse(832-835)format_prompt_rerank(1639-1681)format_prompt_rerank(1639-1644)common/common.cpp (6)
lora(1171-1173)lora(1171-1171)common_token_to_piece(1550-1554)common_token_to_piece(1550-1550)common_token_to_piece(1556-1570)common_token_to_piece(1556-1556)tools/server/server-common.h (1)
json_value(35-47)
tools/server/server-task.cpp (1)
tools/server/server-common.cpp (2)
parse_lora_request(118-129)parse_lora_request(118-118)
🔇 Additional comments (49)
tools/server/server-common.h (2)
110-110: LGTM! Simplified LoRA parsing API.The signature change from
std::vector<common_adapter_lora_info>tostd::map<int, float>simplifies the API by directly returning an id-to-scale mapping. This aligns with the PR's goal of streamlining LoRA handling and is consistent with the implementation inserver-common.cpp(lines 117-128).
326-326: LGTM! Vocab-centric token conversion overload.Adding this overload that accepts
const llama_vocab *directly enables token-to-string conversion without requiring a full context pointer. This supports the PR's thread-safety improvements by allowing conversion using vocabulary metadata rather than context state.tools/server/server-queue.h (4)
8-8: LGTM!Required include for the new
std::vector<std::function<void(bool)>>callback collection.
29-31: LGTM! Multiple sleeping state callbacks support.Changing from a single callback to a vector allows multiple components to register for sleeping state notifications. The comment at lines 84-85 correctly notes these must be registered before
start_loop()is called, which ensures thread-safety during iteration.
100-102: LGTM!Using
push_backto accumulate callbacks aligns with the vector-based storage.
177-180: LGTM! Priority task posting support.The new
frontparameter enables high-priority task insertion, useful for time-sensitive operations. The implementation inserver-queue.cppcorrectly handles index assignment for batched tasks.tools/server/server-task.h (9)
9-9: LGTM!Required include for
std::mapusage in LoRA adapter handling.
27-27: LGTM!New task type
SERVER_TASK_TYPE_GET_LORAcomplements the existingSERVER_TASK_TYPE_SET_LORAfor read operations.
65-65: LGTM! Map-based LoRA configuration.Using
std::map<int, float>for adapter ID to scale mapping is cleaner than the previous vector-based approach and aligns with the simplifiedparse_lora_requestAPI.
110-113: Acknowledged TODO for future refactoring.The
size_ttype is appropriate for array indexing. The TODO noting plans to move index mapping toresponse_readeris a reasonable future improvement.
145-145: LGTM!Consistent with the
task_params.loratype change tostd::map<int, float>.
155-159: Good change for thread-safety: decoupling from context.Replacing
llama_context * ctxwithconst llama_vocab * vocabplusint n_ctx_slotreduces dependency on the full context object. This supports the PR's goal of preventing data races from HTTP threads by using stable vocabulary data rather than mutable context state.
170-178: LGTM! Simplified create_child signature.Removing the
idxparameter simplifies the API since index management is now handled elsewhere (at task posting time per theserver-queue.cppimplementation).
219-221: Acknowledged TODO for future refactoring.Moving the
indexfield to the baseserver_task_resultstruct centralizes batched task mapping. The TODO indicates this is a stepping stone toward a cleaner mapping approach inresponse_reader.
445-454: LGTM! New result type for GET_LORA tasks.The
server_task_result_get_lorastruct properly encapsulates LoRA adapter information including the info, invocation string, and tokens. The nestedlorastruct provides good organization.tools/server/server-context.h (6)
12-41: LGTM! Expanded metadata structure.The renamed
server_context_metastruct (fromserver_context_info) provides comprehensive model and vocabulary metadata. This expansion supports the meta-driven routing approach and reduces the need to access context state directly from HTTP handlers.
60-61: Good thread-safety documentation.The comment clarifying that
get_llama_context()is not thread-safe and should only be used from the main thread is valuable for preventing accidental misuse.
66-68: LGTM! Thread-safety documented.The
get_meta()accessor replacesget_info()with appropriate thread-safety documentation.
76-83: Simplified constructor and guarded update pattern.The simplified constructor and
update_meta()method provide a clear pattern for metadata initialization. The comment at line 80 correctly documents the thread-safety constraint.
110-115: LGTM!Updated signature to include
filesparameter for multimodal support.
121-129: LGTM! Proper encapsulation with const correctness.
- Using
std::unique_ptr<const server_context_meta>allows late initialization while preserving const semantics for the metadata.- The
const server_context_impl &reference enforces read-only access to the server context from routes.- Adding
queue_tasksandqueue_resultsreferences enables proper task/response flow without exposing the full context.tools/cli/cli.cpp (1)
219-219: LGTM!Updated to use the new
get_meta()API. All accessed fields (has_inp_image,has_inp_audio,build_info,model_name) are available in the expandedserver_context_metastructure.tools/server/server.cpp (2)
122-122: LGTM - Cleaner routes initialization.Removing the readiness predicate from the constructor simplifies the initialization and decouples route creation from HTTP readiness state. The route handlers will be protected by the middleware that checks
is_readybefore dispatching requests.
255-256: Good fix for data race prevention.Calling
routes.update_meta(ctx_server)after successful model load ensures that metadata is populated only when the model context is fully initialized. Combined withis_ready.store(true)on the next line, this guarantees clients won't access incomplete metadata.tools/server/server-task.cpp (5)
35-37: LGTM - Clean map-based lora serialization.The iteration over
this->loraas a map with{first, second}pairs is consistent with the newstd::map<int, float>representation fromparse_lora_request.
147-151: Good API change to vocab-centric approach.Changing from
llama_context*tollama_vocab*with explicitn_ctx_slotparameter:
- Reduces coupling to full context object
- Prevents potential data races by not accessing context during parameter parsing
- Makes the slot-specific context size explicit rather than derived
This aligns with the PR objective of preventing data races from HTTP threads.
222-230: Simplified lora parsing.The new
parse_lora_requestreturns a map directly from the JSON data without requiring the base adapter vector. The empty map fallback on line 229 is appropriate when no lora is specified.
242-248: Correct use of n_ctx_slot for penalty defaults.Using
n_ctx_slotinstead ofllama_n_ctx(ctx)ensures the penalty window defaults are based on the slot's context size, which is semantically correct for per-slot processing.
1325-1347: LGTM - Well-structured lora result serialization.The
to_json()implementation properly:
- Iterates through loras with correct indexing
- Includes all required fields (id, path, scale, task_name, prompt_prefix)
- Conditionally adds alora fields only when tokens are present
tools/server/server-http.cpp (3)
181-195: Correct 503 handling during server loading.Returning 503 with a JSON error for non-HTML endpoints when the server is not ready is the correct behavior. The comment explains the rationale: preventing data races and inconsistent states by blocking all endpoint access during loading.
336-371: Good request lifetime management for streaming.The changes properly address request/response lifecycle during streaming:
server_http_req_ptr(unique_ptr) provides clear ownership semantics- Both request and response are converted to
shared_ptrfor the streaming path (lines 345-346)- The
on_completelambda captures both and resets them to ensure proper destruction orderThis fixes the data race where the httplib request object could be destroyed before the response stream completes.
373-398: LGTM - Consistent request allocation pattern.Both
getandposthandlers now:
- Allocate request as
unique_ptr- Dereference for handler call
- Move ownership into
process_handler_responseThis ensures the request outlives any streaming response.
tools/server/server-queue.cpp (3)
166-181: LGTM - Multi-callback sleeping state handling.Iterating over
callback_sleeping_statecollection allows multiple observers to be notified when entering/exiting sleep state. This is cleaner than a single callback when multiple components need to react to state changes.
332-351: Proper task index initialization and front parameter.Key improvements:
task.index = 0for single task (line 334)tasks[i].index = ifor batch tasks (line 346)frontparameter enables priority posting for cancellation tasksThe explicit index assignment ensures results can be correctly associated with their originating tasks.
376-378: Index-based result association.Using
result->indexto locate the correct state for updating is consistent with the task index assignment. TheGGML_ASSERT(idx < states.size())provides a safety check.tools/server/server-common.cpp (1)
1426-1442: LGTM - Clean vocab-centric token-to-string conversion.The changes properly:
- Update the private template helper to accept
const llama_vocab*- Add public overload that extracts vocab from context via
llama_get_model+llama_model_get_vocab- Add direct vocab-based public overload for callers that already have the vocab
This aligns with the broader vocab-centric refactoring in the PR.
tools/server/server-context.cpp (13)
509-532: LGTM - well-documented thread safety constraints.The friend declaration enables proper pimpl pattern access, and the destructor correctly handles the sleeping state to prevent double-free. The comments on lines 513-515 clearly document the thread-safety requirements for accessing the public pointers.
1051-1062: LGTM - per-request LoRA configuration implementation.The function correctly constructs a per-request LoRA list by copying the base adapters and applying scale overrides from the configuration map. Adapters not specified in the config are disabled (scale = 0.0f), which is the expected behavior for per-request LoRA.
2820-2829: LGTM - proper sleep state synchronization.The constructor correctly handles the sleeping state to prevent data races. When
sleep_idle_seconds < 0, sleeping is disabled so bypass is set. Otherwise,wait_until_no_sleep()ensures the server is awake before proceeding, addressing the data race concerns mentioned in the PR objectives.
3055-3065: LGTM - clean factory pattern and constructor initialization.The
create_response()factory properly passes the sleep parameters, and the constructor correctly initializes references to the internal components. This design decouples route handlers from the publicserver_contextinterface while maintaining proper access to the task and result queues.
3253-3295: LGTM - props endpoint correctly uses meta for thread-safe access.The handler properly uses the
metastruct for model/context information instead of accessingctx_serverdirectly, which aligns with the goal of preventing data races during sleeping state. Thequeue_tasks.is_sleeping()call should be thread-safe.
3695-3716: LGTM - data race fix for LoRA endpoint.This handler correctly addresses the data race by using the task queue to retrieve LoRA adapter information instead of directly accessing
params_base.lora_adapters. TheSERVER_TASK_TYPE_GET_LORAtask ensures the read happens on the main processing thread, preventing concurrent access issues from HTTP threads.
3718-3745: LGTM - consistent task-based pattern for setting LoRA adapters.The handler follows the same thread-safe pattern as
get_lora_adapters, using the task queue to modify LoRA settings. Theparse_lora_requestcall converts the request body to the expectedstd::map<int, float>format used byconstruct_lora_list.
2846-2857: LGTM - completions handler properly synchronizes with sleeping state.The
create_response()call ensures the server is awake before proceeding, making subsequentctx_serveraccess safe. The signature change to passserver_http_reqby const reference is appropriate.
3565-3566: LGTM - consistent use of vocab-based token-to-piece API.This confirms the
common_token_to_piece(vocab, token)signature is the standard API. The usage is correct and consistent with the function signature shown in the relevant code snippets.
3600-3601: LGTM - vocab-based detokenization.The usage of
tokens_to_str(ctx_server.vocab, tokens)is consistent with the vocab-centric API design throughout the refactored code.
3840-3841: LGTM - uses meta for pooling type check.The check correctly uses
meta->pooling_typeinstead of directly querying the context, consistent with the meta-centric design for accessing model properties.
2801-2802: No action needed. Both overloads ofcommon_token_to_pieceexist incommon/common.h: one acceptingconst llama_context*and another acceptingconst llama_vocab*. Lines 2801-2802 correctly use the context-based overload, while line 1825 correctly uses the vocab-based overload. Both usages are valid.Likely an incorrect or invalid review comment.
2972-2973: Request lifetime is safely managed during streaming, but the reference capture pattern is fragile.The lambda on line 2972 captures
&reqby reference and is stored inres->next. While this appears risky at first glance, the request object remains valid during streaming because:
- The
server_http_reqobject is converted to ashared_ptrinprocess_handler_response()(line ~343)- Both the request and response are kept alive as shared pointers during streaming via
chunked_content_provider()andon_complete()- The streaming callback is only invoked while these shared pointers are in scope
However, this pattern relies on implicit lifetime guarantees from cpp-httplib's streaming mechanism. To improve robustness and clarity, consider capturing the request by value or using
req.should_stopin a way that doesn't depend on reference capture semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tools/server/server-context.cpp (1)
1051-1062: Consider validating LORA adapter indices or documenting the behavior.The
construct_lora_listfunction silently ignores adapter indices in theconfigmap that are out of bounds (>=params_base.lora_adapters.size()). While this might be intentional for API flexibility, it could lead to silent failures that are hard to debug.Consider either:
- Adding validation to return an error if invalid indices are provided
- Adding a warning log when indices are ignored
- Documenting this behavior clearly in the function comment
🔎 Example: Add validation
std::vector<common_adapter_lora_info> construct_lora_list(const std::map<int, float> & config) { std::vector<common_adapter_lora_info> output = params_base.lora_adapters; // copy + // Validate config indices + for (const auto & [idx, scale] : config) { + if (idx < 0 || idx >= (int)output.size()) { + SRV_WRN("ignoring invalid lora adapter index %d (valid range: 0-%d)\n", idx, (int)output.size() - 1); + } + } for (size_t i = 0; i < output.size(); ++i) { auto it = config.find(i); if (it != config.end()) { output[i].scale = it->second; } else { output[i].scale = 0.0f; } } return output; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/server/server-context.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).
Applied to files:
tools/server/server-context.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ios-xcode-build
- GitHub Check: ggml-ci-x64-cpu-low-perf
- GitHub Check: android-build
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
🔇 Additional comments (2)
tools/server/server-context.cpp (2)
2822-2829: LGTM: Well-structured refactoring of response generation.The refactoring to centralize response generation through
create_response()and the bypass_sleep mechanism is well-designed:
- Clear separation: Endpoints that only need metadata use
bypass_sleep=true, while those needing context access usebypass_sleep=false- Consistent pattern: All route handlers follow the same pattern of calling
create_response()first- Safety: The
wait_until_no_sleep()call ensures ctx_server members aren't accessed during sleeping state- Documentation: The dummy variable pattern (lines 3077-3078, 3258-3259) helps document sleeping state constraints
The approach aligns with your preference for simplification (based on learnings) while maintaining safety.
Also applies to: 3055-3057, 3067-3082
3059-3065: No action required: meta initialization and thread-safety are properly designed.The
metamember is intentionally not initialized in the constructor. It usesstd::unique_ptrfor late initialization via theupdate_meta()method, which is called only whenctx_http.is_readyis false. Thread-safety is ensured by the atomicis_readyflag: handlers (which accessmeta) are only invoked afteris_readyis set to true, establishing proper happens-before semantics. The documented comments and atomic synchronization primitive confirm this design is intentional and correct.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/server/server-context.cpp (1)
2970-3014: Capturingreqby reference in the streaming lambda may cause use-after-free.The lambda at line 2972 captures
&req, but this lambda is stored inres->nextand may be invoked after the originalhandle_completions_implfunction returns. Ifreqgoes out of scope before the lambda is called, accessingreq.should_stop()(lines 2985, 3014) would be undefined behavior.Verify that
server_http_reqoutlives the streaming response. If the HTTP framework guaranteesreqremains valid for the duration of streaming, this is safe. Otherwise, capture necessary state by value or use a different mechanism.#!/bin/bash # Check the lifetime of server_http_req and how streaming responses are handled rg -n "server_http_req" --type cpp -A 5 -B 2 | head -100
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tools/server/server-context.cpptools/server/server-queue.cpptools/server/server-queue.h
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/server/server-queue.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/server/server-context.cpptools/server/server-queue.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/server/server-context.cpptools/server/server-queue.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-20T21:18:14.768Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 62
File: tools/server/server-context.cpp:591-603
Timestamp: 2025-12-20T21:18:14.768Z
Learning: In ngxson/llama.cpp server implementation, holding mutex_tasks during callback_sleeping_state() (which calls handle_sleeping_state) is intentional behavior. No tasks should be allowed to be pushed into the queue during sleeping state transition, even though this blocks all queue operations (post, defer, wait_until_no_sleep) for the duration of model unload/reload (potentially minutes).
Applied to files:
tools/server/server-context.cpptools/server/server-queue.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: ggml-ci-arm64-cpu-high-perf
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: ggml-ci-arm64-cpu-high-perf-sve
- GitHub Check: ggml-ci-arm64-cpu-kleidiai
- GitHub Check: ggml-ci-arm64-cpu-low-perf
- GitHub Check: ubuntu-22-cmake-sycl-fp16
- GitHub Check: ios-xcode-build
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-cpu-cmake (x64, ubuntu-22.04)
- GitHub Check: macOS-latest-cmake-x64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: server-windows
🔇 Additional comments (20)
tools/server/server-queue.cpp (4)
328-334: LGTM!The
post_taskfunction correctly initializestask.index = 0before posting and properly forwards thefrontparameter to the queue. The assertion ensures single-use semantics per reader.
337-346: LGTM!The
post_tasksfunction correctly assigns sequential indices to each task and forwards thefrontparameter. The index assignment before state creation ensures proper ordering for batch result collection.
370-375: LGTM!Using
result->indexdirectly is cleaner than a getter method. The bounds check at line 373 ensures safe access to the states vector.
386-406: LGTM!The batch result handling correctly:
- Pre-sizes the results vector to match
id_tasks.size()- Uses
result->indexfor placement with proper bounds assertion- Detects duplicate results with the nullptr check
The assertions at lines 401-402 provide good runtime validation for invariants.
tools/server/server-context.cpp (16)
509-532: LGTM!The restructuring of
server_context_implwith explicitfrienddeclaration and clear public/private separation improves encapsulation. The destructor correctly avoids double-free by checking the sleeping state before callingdestroy().
1051-1062: LGTM!The
construct_lora_listfunction correctly:
- Creates a copy of the base LoRA adapters
- Applies per-request scale overrides from the config map
- Defaults unspecified adapters to scale 0.0f
This is cleaner than the previous vector-based approach.
1064-1082: LGTM!The updated LoRA handling in
launch_slot_with_taskcorrectly uses the newconstruct_lora_listhelper and maintains the cache-clearing logic when LoRA configurations change.
1811-1836: Variable shadowing issue has been addressed.The inner loop variable is now correctly named
j(line 1824) instead ofi, which fixes the variable shadowing bug flagged in the previous review. The outer loop'siat line 1817 is no longer shadowed.
1837-1849: LGTM!The
SERVER_TASK_TYPE_SET_LORAcase correctly usesconstruct_lora_listwith the new map-based request format and provides useful logging.
2785-2813: LGTM!The
get_meta()function provides a clean, consolidated way to expose server metadata. The structure initialization is complete and properly accesses internal state through the impl pointer.
2818-2829: LGTM!The
server_res_generatorconstructor properly handles the bypass_sleep optimization by short-circuiting when sleeping is disabled (sleep_idle_seconds < 0).
3055-3065: LGTM!The
create_responsehelper andserver_routesconstructor properly initialize the routing infrastructure with references to the queue and server context.
3071-3082: Good defensive practice with the dummy variable.The pattern of declaring
bool server_ctx;with the comment "do NOT delete this line" is a clever way to prevent accidental use ofctx_serverin endpoints that should work during sleeping state. This provides compile-time safety.
3253-3293: LGTM!The
get_propsendpoint properly accesses metadata through themetapointer and correctly usesqueue_tasks.is_sleeping()to report sleeping state. The conditional addition ofchat_template_tool_useis clean.
3565-3566: Prefer vocab-based tokenization for consistency.The call to
common_token_to_piece(ctx_server.vocab, token)correctly uses the vocab directly rather than going through the context, which aligns with the refactoring mentioned in the AI summary.
3600-3601: Consistent with vocab-centric approach.
tokens_to_str(ctx_server.vocab, tokens)correctly uses the vocabulary directly, maintaining consistency with the refactoring.
3695-3716: LGTM!The
get_lora_adaptershandler correctly posts a task and waits for the result using the new task-based approach for LoRA information retrieval.
3718-3745: LGTM!The
post_lora_adaptershandler correctly usesparse_lora_requestto convert the request body and posts the task. The dynamic cast assertion ensures type safety.
2846-2857: The initialization flow is correct; no action needed.metais guaranteed to be initialized before route handlers execute.update_meta()is called immediately afterload_model()succeeds (line 255 of server.cpp) and beforectx_http.is_readyis set to true (line 256). Route handlers won't be called untilis_readyis true, ensuringmeta->slot_n_ctxandmeta->model_nameare always available when accessed.
3833-3840: Ensure consistent pooling_type checks between reranking and embeddings endpoints.Line 3617 checks
params.pooling_typefor the reranking endpoint, while line 3840 checksmeta->pooling_typefor embeddings. These represent different values (user-configured vs. model's native type). Verify this intentional difference is correct: reranking validates user configuration, while embeddings validates model capability. If both should check the same value or if user-provided pooling overrides should be considered, update for consistency.
Make sure to read the contributing guidelines before submitting a PR
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.