Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,8 @@ Lightspeed Core Stack (LCS) supports the large language models from the provider
| OpenAI | gpt-5, gpt-4o, gpt4-turbo, gpt-4.1, o1, o3, o4 | Yes | remote::openai | [1](examples/openai-faiss-run.yaml) [2](examples/openai-pgvector-run.yaml) |
| OpenAI | gpt-3.5-turbo, gpt-4 | No | remote::openai | |
| RHAIIS (vLLM)| meta-llama/Llama-3.1-8B-Instruct | Yes | remote::vllm | [1](tests/e2e/configs/run-rhaiis.yaml) |
| RHEL AI (vLLM)| meta-llama/Llama-3.1-8B-Instruct | Yes | remote::vllm | [1](tests/e2e/configs/run-rhelai.yaml) |
| Azure | gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3-mini, o4-mini | Yes | remote::azure | [1](examples/azure-run.yaml) |
| Azure | o1, o1-mini | No | remote::azure | |
| Azure | gpt-5, gpt-5-mini, gpt-5-nano, gpt-4o-mini, o3-mini, o4-mini, o1| Yes | remote::azure | [1](examples/azure-run.yaml) |
| Azure | gpt-5-chat, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o1-mini | No or limited | remote::azure | |

The "provider_type" is used in the llama stack configuration file when refering to the provider.

Expand Down
43 changes: 42 additions & 1 deletion docs/providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ The tables below summarize each provider category, containing the following atri
| meta-reference | inline | `accelerate`, `fairscale`, `torch`, `torchvision`, `transformers`, `zmq`, `lm-format-enforcer`, `sentence-transformers`, `torchao==0.8.0`, `fbgemm-gpu-genai==1.1.2` | ❌ |
| sentence-transformers | inline | `torch torchvision torchao>=0.12.0 --extra-index-url https://download.pytorch.org/whl/cpu`, `sentence-transformers --no-deps` | ❌ |
| anthropic | remote | `litellm` | ❌ |
| azure | remote | | ✅ |
| azure | remote | `litellm` | ✅ |
| bedrock | remote | `boto3` | ❌ |
| cerebras | remote | `cerebras_cloud_sdk` | ❌ |
| databricks | remote | — | ❌ |
Expand Down Expand Up @@ -64,6 +64,47 @@ Red Hat providers:
| RHAIIS (vllm) | 3.2.3 (on RHEL 9.20250429.0.4) | remote | `openai` | ✅ |
| RHEL AI (vllm) | 1.5.2 | remote | `openai` | ✅ |

### Azure Provider - Entra ID Authentication Guide

Lightspeed Core supports secure authentication using Microsoft Entra ID (formerly Azure Active Directory) for the Azure Inference Provider. This allows you to connect to Azure OpenAI without using API keys, by authenticating through your organization’s Azure identity.

#### Lightspeed Core configuration requirements

To enable Entra ID authentication, the `azure_entra_id` block must be included in your LCS configuration, and all three attributes — `tenant_id`, `client_id`, and `client_secret` — are required. The authentication will not work if any of them is missing:

```yaml
azure_entra_id:
tenant_id: ${env.AZURE_TENANT_ID}
client_id: ${env.AZURE_CLIENT_ID}
client_secret: ${env.AZURE_CLIENT_SECRET}
```
**Note:** We strongly recommend to load the secrets from environment variables or secrets.

#### Llama Stack Configuration Requirements

Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema — even though they are not used when Entra ID authentication is enabled. Specifically, the config block for the Azure provider must include `api_key`, `api_base`, and `api_version`.

While `api_key` is not used in Entra ID mode, it must still be set to a dummy value because Llama Stack validates its presence. The `api_base` and `api_version` fields remain required and are used in Entra ID authentication.

```yaml
inference:
- provider_id: azure
provider_type: remote::azure
config:
api_key: ${AZURE_API_KEY:=} # Required but not used for Entra ID
api_base: ${AZURE_API_BASE}
api_version: 2025-01-01-preview
```
**Note:** Llama Stack currently supports only static API key authentication through the LiteLLM SDK. Lightspeed extends this behavior by dynamically injecting Entra ID access tokens into each request, enabling full compatibility while maintaining schema compliance with the Llama Stack configuration.

#### Access Token Lifecycle and Managment
When the service starts or an inference request is made:
1. The system reads your Entra ID configuration.
1. It checks whether a valid access token already exists:
- If the token does not exist or the current token has expired, the system automatically requests a new token from Microsoft Entra ID using your credentials.
- If a valid token is still active, it is reused — no new request is made.
1. The access token grants access to Azure OpenAI Services.
1. Tokens are automatically refreshed as needed before they expire. Access tokens are typically valid for 1 hour, and this process happens entirely in the background without any manual action.

---

Expand Down
5 changes: 2 additions & 3 deletions examples/azure-run.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,9 @@ providers:
- provider_id: azure
provider_type: remote::azure
config:
api_key: ${env.AZURE_API_KEY}
api_key: ${env.AZURE_API_KEY:=}
api_base: https://ols-test.openai.azure.com/
api_version: 2024-02-15-preview
api_type: ${env.AZURE_API_TYPE:=}
api_version: 2025-01-01-preview
post_training:
- provider_id: huggingface
provider_type: inline::huggingface-gpu
Expand Down
30 changes: 30 additions & 0 deletions examples/lightspeed-stack-azure-entraid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Lightspeed Core Service (LCS)
service:
host: 0.0.0.0
port: 8080
auth_enabled: false
workers: 1
color_log: true
access_log: true
llama_stack:
# Uses a remote llama-stack service
# The instance would have already been started with a llama-stack-run.yaml file
use_as_library_client: false
# Alternative for "as library use"
# use_as_library_client: true
# library_client_config_path: <path-to-llama-stack-run.yaml-file>
url: http://localhost:8321
api_key: xyzzy
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"

authentication:
module: "noop"

azure_entra_id:
tenant_id: ${env.TENANT_ID}
client_id: ${env.CLIENT_ID}
client_secret: ${env.CLIENT_SECRET}
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ dependencies = [
# Used by authorization resolvers
"jsonpath-ng>=1.6.1",
"psycopg2-binary>=2.9.10",
"azure-core",
"azure-identity",
]


Expand Down
26 changes: 25 additions & 1 deletion src/app/endpoints/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@
from utils.transcripts import store_transcript
from utils.types import TurnSummary
from utils.token_counter import extract_and_update_token_metrics, TokenCounter
from authorization.azure_token_manager import AzureEntraIDTokenManager


logger = logging.getLogger("app.endpoints.handlers")
router = APIRouter(tags=["query"])
Expand Down Expand Up @@ -287,13 +289,35 @@ async def query_endpoint_handler_base( # pylint: disable=R0914
try:
check_tokens_available(configuration.quota_limiters, user_id)
# try to get Llama Stack client
client = AsyncLlamaStackClientHolder().get_client()
client_holder = AsyncLlamaStackClientHolder()
client = client_holder.get_client()
llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
await client.models.list(),
*evaluate_model_hints(
user_conversation=user_conversation, query_request=query_request
),
)

azure_token_manager = AzureEntraIDTokenManager()
if (
provider_id == "azure"
and azure_token_manager.is_entra_id_configured
and azure_token_manager.is_token_expired
):
azure_token_manager.refresh_token()
azure_config = next(
p.config
for p in await client.providers.list()
if p.provider_type == "remote::azure"
)

client = client_holder.get_client_with_updated_azure_headers(
access_token=azure_token_manager.access_token,
api_base=str(azure_config.get("api_base")),
api_version=str(azure_config.get("api_version")),
)
client_holder.set_client(client)

summary, conversation_id, referenced_documents, token_usage = (
await retrieve_response_func(
client,
Expand Down
25 changes: 24 additions & 1 deletion src/app/endpoints/streaming_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from authentication import get_auth_dependency
from authentication.interface import AuthTuple
from authorization.middleware import authorize
from authorization.azure_token_manager import AzureEntraIDTokenManager
from client import AsyncLlamaStackClientHolder
from configuration import configuration
from constants import DEFAULT_RAG_TOOL, MEDIA_TYPE_JSON, MEDIA_TYPE_TEXT
Expand Down Expand Up @@ -757,13 +758,35 @@ async def streaming_query_endpoint_handler( # pylint: disable=too-many-locals,t

try:
# try to get Llama Stack client
client = AsyncLlamaStackClientHolder().get_client()
client_holder = AsyncLlamaStackClientHolder()
client = client_holder.get_client()
llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
await client.models.list(),
*evaluate_model_hints(
user_conversation=user_conversation, query_request=query_request
),
)

azure_token_manager = AzureEntraIDTokenManager()
if (
provider_id == "azure"
and azure_token_manager.is_entra_id_configured
and azure_token_manager.is_token_expired
):
azure_token_manager.refresh_token()
azure_config = next(
p.config
for p in await client.providers.list()
if p.provider_type == "remote::azure"
)

client = client_holder.get_client_with_updated_azure_headers(
access_token=azure_token_manager.access_token,
api_base=str(azure_config.get("api_base")),
api_version=str(azure_config.get("api_version")),
)
client_holder.set_client(client)

response, conversation_id = await retrieve_response(
client,
llama_stack_model_id,
Expand Down
94 changes: 94 additions & 0 deletions src/authorization/azure_token_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
"""Authentication module for Azure Entra ID Credentials."""

import logging
import time
from typing import Optional

from azure.core.credentials import AccessToken
from azure.core.exceptions import ClientAuthenticationError
from azure.identity import ClientSecretCredential

from configuration import AzureEntraIdConfiguration
from utils.types import Singleton

logger = logging.getLogger(__name__)

TOKEN_EXPIRATION_LEEWAY = 30 # seconds


class AzureEntraIDTokenManager(metaclass=Singleton):
"""Microsoft Token cache for Azure OpenAI provider.

Manages and caches Microsoft Entra ID access tokens for the Azure OpenAI provider.
Handles token storage, expiration checks, and refreshing tokens when an Entra ID
configuration is provided. Designed for temporary tokens passed via request headers.
"""

def __init__(self) -> None:
"""Construct Azure token manager."""
self._access_token: Optional[str] = None
self._expires_on: int = 0
self._entra_id_config = None

def set_config(self, azure_config: AzureEntraIdConfiguration) -> None:
"""Set the Azure Entra ID configuration."""
self._entra_id_config = azure_config

@property
def is_entra_id_configured(self) -> bool:
"""Check whether an Entra ID configuration has been set."""
return self._entra_id_config is not None

@property
def is_token_expired(self) -> bool:
"""Check if the current token has expired (observer only)."""
return self._expires_on == 0 or time.time() > self._expires_on

@property
def access_token(self) -> str:
"""Return the currently cached access token (no refresh logic)."""
if not self._access_token:
logger.debug("Access token requested but not yet available.")
return self._access_token or ""

def refresh_token(self) -> None:
"""Refresh and cache a new Azure access token if configuration is set."""
if self._entra_id_config is None:
raise ValueError("Azure configuration is not set for token retrieval")

logger.info("Refreshing Azure access token...")
token_obj = self._retrieve_access_token()
if token_obj:
self._update_access_token(token_obj.token, token_obj.expires_on)
logger.info("Azure access token successfully refreshed.")
else:
raise RuntimeError("Failed to retrieve Azure access token")

def _update_access_token(self, token: str, expires_on: int) -> None:
"""Update token and expiration time (private helper)."""
self._access_token = token
self._expires_on = expires_on - TOKEN_EXPIRATION_LEEWAY
logger.info(
"Access token updated; expires at %s",
time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(self._expires_on)),
)

def _retrieve_access_token(self) -> Optional[AccessToken]:
"""Retrieve access token to call Azure OpenAI."""
if not self._entra_id_config:
return None
tenant_id = self._entra_id_config.tenant_id.get_secret_value()
client_id = self._entra_id_config.client_id.get_secret_value()
client_secret = self._entra_id_config.client_secret.get_secret_value()

try:
credential = ClientSecretCredential(
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret,
)
token = credential.get_token("https://cognitiveservices.azure.com/.default")
return token
except ClientAuthenticationError as e:
logger.error("Error retrieving access token: %s", e, exc_info=True)
return None
46 changes: 46 additions & 0 deletions src/client.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Llama Stack client retrieval class."""

import logging
import json

from typing import Optional

Expand Down Expand Up @@ -53,3 +54,48 @@ def get_client(self) -> AsyncLlamaStackClient:
"AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
)
return self._lsc

def set_client(self, new_client: AsyncLlamaStackClient) -> None:
"""
Replace the currently stored AsyncLlamaStackClient instance.

This method allows updating the client reference when
configuration or runtime attributes have changed.
"""
self._lsc = new_client

def get_client_with_updated_azure_headers(
self,
access_token: str,
api_base: str,
api_version: str,
) -> AsyncLlamaStackClient:
"""Return a new client with updated Azure headers, preserving other headers."""
if not self._lsc:
raise RuntimeError(
"AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
)

current_headers = self._lsc.default_headers if self._lsc else {}
provider_data_json = current_headers.get("X-LlamaStack-Provider-Data")

try:
provider_data = json.loads(provider_data_json) if provider_data_json else {}
except (json.JSONDecodeError, TypeError):
provider_data = {}

# Update only Azure-specific fields
provider_data.update(
{
"azure_api_key": access_token,
"azure_api_base": api_base,
"azure_api_version": api_version,
"azure_api_type": None, # deprecated attribute
}
)

updated_headers = {
**current_headers,
"X-LlamaStack-Provider-Data": json.dumps(provider_data),
}
return self._lsc.copy(set_default_headers=updated_headers) # type: ignore
8 changes: 8 additions & 0 deletions src/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import yaml
from models.config import (
AuthorizationConfiguration,
AzureEntraIdConfiguration,
Configuration,
Customization,
LlamaStackConfiguration,
Expand Down Expand Up @@ -180,5 +181,12 @@ def quota_limiters(self) -> list[QuotaLimiter]:
)
return self._quota_limiters

@property
def azure_entra_id(self) -> Optional[AzureEntraIdConfiguration]:
"""Return Azure Entra ID configuration, or None if not provided."""
if self._configuration is None:
raise LogicError("logic error: configuration is not loaded")
return self._configuration.azure_entra_id


configuration: AppConfig = AppConfig()
Loading
Loading