Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17487

Make sure to read the contributing guidelines before submitting a PR

  • multi-transport MCP client
  • full agentic orchestrator
  • isolated, idempotent singleton initialization
  • typed SSE client
  • normalized tool-call accumulation pipeline
  • integrated reasoning, timings, previews, and turn-limit handling
  • complete UI section for MCP configuration
  • dedicated controls for relevant parameters
  • opt-in ChatService integration that does not interfere with existing flows

TODO: increase coupling with the UI for structured tool-call result rendering, including integrated display components and support for sending out-of-context images (persistence/storage still to be defined).

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Analysis Scope: PR #316 - MCP Client Integration for llama.cpp WebUI
Versions Compared: 930f177b-2868-453d-809a-8c06d2215f50 vs d55f4145-0a3a-4b89-9c31-ba206b13d74b


Summary

This PR introduces MCP client functionality exclusively in the WebUI frontend layer (TypeScript/Svelte). Analysis of the actual performance data shows zero measurable impact on core inference functions. All changes are isolated to browser-side JavaScript code with no modifications to the C++ inference engine. Power consumption measurements across all binaries show 0.0% change, confirming no performance regression in the compiled artifacts.

The code review identified 2,338 lines of new frontend code implementing agentic tool-calling workflows. The integration point in ChatService uses an opt-in pattern that bypasses the new code path when MCP is not configured, preserving existing behavior. No performance-critical functions from the project summary (llama_decode, llama_tokenize, llama_model_load_from_file, ggml_backend_graph_compute) were modified.

Function-level metrics for llama_decode show throughput of 69 ns in both versions with response time of 44,722,748 ns vs 44,722,492 ns (256 ns difference, 0.0006% change). The llama_tokenize function maintains 22 ns throughput with response time of 898,714 ns vs 898,716 ns (2 ns difference). These sub-microsecond variations are within measurement noise and indicate no functional changes to the inference pipeline.


Tokens per Second Impact: None. No inference functions modified.

Power Consumption: All binaries show 0.0% change (libllama.so: 228,744 nJ both versions).

Conclusion: This PR adds optional frontend functionality with zero performance impact on core inference operations.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from eec18ea to 7475023 Compare November 29, 2025 16:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 6b83243 to fa01de0 Compare December 23, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants