Skip to content

Conversation

@Chrisischris
Copy link

Implements the extra_args feature for /models/load that is already documented in the server README but was not yet implemented.

From the docs:

extra_args: (optional) an array of additional arguments to be passed to the model instance. Note: you must start the server with --models-allow-extra-args to enable this feature.

Changes

  • common/common.h: Add models_allow_extra_args param
  • common/arg.cpp: Add --models-allow-extra-args / --no-models-allow-extra-args flag
  • tools/server/server-models.h: Update load() signature to accept extra_args
  • tools/server/server-models.cpp: Parse extra_args from JSON body and append to child process args

See server README for usage documentation.

@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

We don't allow this function because it introduce too many security risk. Instead, please remove this from the docs.

Ref discussion: #17470

@ngxson ngxson closed this Dec 21, 2025
@Chrisischris
Copy link
Author

We don't allow this function because it introduce too many security risk. Instead, please remove this from the docs.

Ref discussion: #17470

I understand the security concern, which is why this implementation requires an explicit opt-in flag (--models-allow-extra-args). It's disabled by default - operators must consciously enable it when starting the server in trusted environments.

My use case: I'm building a system where I need to dynamically load models with different context sizes on demand. The preset file approach (--models-preset) only allows static, predefined context sizes per model - it can't handle loading the same model with 4k context for one request and 32k for another at runtime.

Without extra_args, there's no way to specify -c per model when calling /models/load. The only alternative would be restarting the entire server for each context size change, which defeats the purpose of multi-model mode.

Would you be open to keeping the feature with the opt-in flag, or is there an alternative approach you'd suggest for setting per-model context sizes dynamically?

@ngxson
Copy link
Collaborator

ngxson commented Dec 21, 2025

I don't think the function worth the risk even when hiding under a flag. We will soon be flooded by security reports that nits-pick this functionality.

Instead, we should allow only some selected params as @ServeurpersoCom suggested, with proper validations.

@Chrisischris
Copy link
Author

Got it I understand the concern, I'd be happy to submit a more scoped PR that only allows context_size (or a small set of validated parameters) in the /models/load request body, rather than arbitrary extra_args. Something like:

  {
    "model": "my-model",
    "context_size": 8192
  }

Would that approach be acceptable? And would you still want it behind an opt-in flag, or would validated/scoped parameters be safe enough to enable by default?

Happy to follow whatever direction you think is best for the project.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Dec 23, 2025

I'd be happy to submit a more scoped PR that only allows context_size

I'd certainly like to see this.

Candidates I'd be interested in personally, in order of preference:

  • ctx
  • n-gpu-layers
  • lora
  • jinja on/off

In the meantime I may just merge this PR locally, since I never run my LLMs accessible outside my localhost anyhow.
presets.ini could be edited programmatically and llama-server rebooted as an alternative, I suppose.

@ServeurpersoCom
Copy link
Collaborator

In the meantime I may just merge this PR locally, since I never run my LLMs accessible outside my localhost anyhow. presets.ini could be edited programmatically and llama-server rebooted as an alternative, I suppose.

Just a thought on the security architecture here. Those params (ctx, n-gpu-layers, lora) are what I'd call "cold-start" parameters, they need to be set when spawning the child process, not changed at runtime. This is different from hot params like temperature that a running process can absorb.
I think there's value in following the Apache/nginx pattern here: the router reads a preset file at startup (or on reload signal), validates it, then spawns children with those cold parameters. If a config is bad, it fails at spawn time in an isolated child, not mid-request via API.
The challenge with exposing cold params directly through the API is that validation becomes infrastructure-specific. Your VRAM limits, filesystem policies, and GPU topology are different from mine. llama.cpp can't realistically validate every admin's unique constraints.
Maybe the cleaner approach is a thin custom backend that validates cold params against your specific infrastructure rules, writes them to the preset file, then signals the router to reload. That way llama.cpp stays generic and your security policy lives where it belongs, in your own infrastructure layer.

@ngxson
Copy link
Collaborator

ngxson commented Dec 23, 2025

I don't get why we need to allow lora to be set dynamic from router. Aren't we already support lora config per-request for this?

If we add such feature asked in this PR, I doubt that people will soon flood us with PR/issues just to add params specific for their personal use case even when it's redundant and/or insecure. And when we add it, security researchers will start nitpicking it.

So I don't think such feature worth maintaining, the human effort is not worth it.


Probably support (manual) hot-reload preset file like Pascal said, which is pretty much aligned with nginx/apache approach.

There will be some edge cases like what if a model is removed from the ini file, what if global config changed, etc. Probably worth opening an issue to list out all of these cases before writing the code.

@ngxson
Copy link
Collaborator

ngxson commented Dec 23, 2025

On second thought, we can transfer the risk by asking user to specify the set of parameters that can be overwritten via api themself. This way it's completely up to user to decide which parameters are allowed to be changed via api.

I will have a look onto this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants