lightspeed-core · asimurka · Oct 27, 2025 · Nov 5, 2025 · Nov 5, 2025 · Nov 5, 2025
diff --git a/README.md b/README.md
@@ -125,9 +125,8 @@ Lightspeed Core Stack (LCS) supports the large language models from the provider
 | OpenAI   | gpt-5, gpt-4o, gpt4-turbo, gpt-4.1, o1, o3, o4 | Yes          | remote::openai | [1](examples/openai-faiss-run.yaml) [2](examples/openai-pgvector-run.yaml) |
 | OpenAI   | gpt-3.5-turbo, gpt-4                           | No           | remote::openai |                                                                            |
 | RHAIIS (vLLM)| meta-llama/Llama-3.1-8B-Instruct           | Yes          | remote::vllm   | [1](tests/e2e/configs/run-rhaiis.yaml)                                     |
-| RHEL AI (vLLM)| meta-llama/Llama-3.1-8B-Instruct           | Yes          | remote::vllm   | [1](tests/e2e/configs/run-rhelai.yaml)                                     |
-| Azure    | gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3-mini, o4-mini | Yes          | remote::azure  | [1](examples/azure-run.yaml)                                               |
-| Azure    |  o1, o1-mini | No          | remote::azure  |  |
+| Azure    | gpt-5, gpt-5-mini, gpt-5-nano, gpt-4o-mini, o3-mini, o4-mini, o1| Yes          | remote::azure  | [1](examples/azure-run.yaml)                                               |
+| Azure    |  gpt-5-chat, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano,  o1-mini | No or limited         | remote::azure  |  |
 
 The "provider_type" is used in the llama stack configuration file when refering to the provider.
 

diff --git a/docs/providers.md b/docs/providers.md
@@ -36,7 +36,7 @@ The tables below summarize each provider category, containing the following atri
 | meta-reference | inline | `accelerate`, `fairscale`, `torch`, `torchvision`, `transformers`, `zmq`, `lm-format-enforcer`, `sentence-transformers`, `torchao==0.8.0`, `fbgemm-gpu-genai==1.1.2` | ❌ |
 | sentence-transformers | inline | `torch torchvision torchao>=0.12.0 --extra-index-url https://download.pytorch.org/whl/cpu`, `sentence-transformers --no-deps` | ❌ |
 | anthropic | remote | `litellm` | ❌ |
-| azure | remote | — | ✅ |
+| azure | remote | `litellm` | ✅ |
 | bedrock | remote | `boto3` | ❌ |
 | cerebras | remote | `cerebras_cloud_sdk` | ❌ |
 | databricks | remote | — | ❌ |
@@ -64,6 +64,47 @@ Red Hat providers:
 | RHAIIS (vllm) | 3.2.3 (on RHEL 9.20250429.0.4) | remote | `openai` | ✅ |
 | RHEL AI (vllm) | 1.5.2 | remote | `openai` | ✅ |
 
+### Azure Provider - Entra ID Authentication Guide
+
+Lightspeed Core supports secure authentication using Microsoft Entra ID (formerly Azure Active Directory) for the Azure Inference Provider. This allows you to connect to Azure OpenAI without using API keys, by authenticating through your organization’s Azure identity.
+
+#### Lightspeed Core configuration requirements
+
+To enable Entra ID authentication, the `azure_entra_id` block must be included in your LCS configuration, and all three attributes — `tenant_id`, `client_id`, and `client_secret` — are required. The authentication will not work if any of them is missing:
+
+```yaml
+azure_entra_id:
+  tenant_id: ${env.AZURE_TENANT_ID}
+  client_id: ${env.AZURE_CLIENT_ID}
+  client_secret: ${env.AZURE_CLIENT_SECRET}
+```
+**Note:** We strongly recommend to load the secrets from environment variables or secrets.
+
+#### Llama Stack Configuration Requirements
+
+Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema — even though they are not used when Entra ID authentication is enabled. Specifically, the config block for the Azure provider must include `api_key`, `api_base`, and `api_version`.
+
+While `api_key` is not used in Entra ID mode, it must still be set to a dummy value because Llama Stack validates its presence. The `api_base` and `api_version` fields remain required and are used in Entra ID authentication.
+
+```yaml
+inference:
+  - provider_id: azure
+    provider_type: remote::azure
+    config:
+      api_key: ${AZURE_API_KEY:=}       # Required but not used for Entra ID
+      api_base: ${AZURE_API_BASE}
+      api_version: 2025-01-01-preview
+```
+**Note:** Llama Stack currently supports only static API key authentication through the LiteLLM SDK. Lightspeed extends this behavior by dynamically injecting Entra ID access tokens into each request, enabling full compatibility while maintaining schema compliance with the Llama Stack configuration.
+
+#### Access Token Lifecycle and Managment
+When the service starts or an inference request is made:
+1. The system reads your Entra ID configuration.
+1. It checks whether a valid access token already exists:
+    - If the token does not exist or the current token has expired, the system automatically requests a new token from Microsoft Entra ID using your credentials.
+    - If a valid token is still active, it is reused — no new request is made.
+1. The access token grants access to Azure OpenAI Services.
+1. Tokens are automatically refreshed as needed before they expire. Access tokens are typically valid for 1 hour, and this process happens entirely in the background without any manual action.
 
 ---
 

diff --git a/examples/azure-run.yaml b/examples/azure-run.yaml
@@ -72,10 +72,9 @@ providers:
     - provider_id: azure
       provider_type: remote::azure
       config: 
-        api_key: ${env.AZURE_API_KEY}
+        api_key: ${env.AZURE_API_KEY:=}
         api_base: https://ols-test.openai.azure.com/
-        api_version: 2024-02-15-preview
-        api_type: ${env.AZURE_API_TYPE:=}
+        api_version: 2025-01-01-preview
   post_training:
   - provider_id: huggingface
     provider_type: inline::huggingface-gpu

diff --git a/examples/lightspeed-stack-azure-entraid.yaml b/examples/lightspeed-stack-azure-entraid.yaml
@@ -0,0 +1,30 @@
+name: Lightspeed Core Service (LCS)
+service:
+  host: 0.0.0.0
+  port: 8080
+  auth_enabled: false
+  workers: 1
+  color_log: true
+  access_log: true
+llama_stack:
+  # Uses a remote llama-stack service
+  # The instance would have already been started with a llama-stack-run.yaml file
+  use_as_library_client: false
+  # Alternative for "as library use"
+  # use_as_library_client: true
+  # library_client_config_path: <path-to-llama-stack-run.yaml-file>
+  url: http://localhost:8321
+  api_key: xyzzy
+user_data_collection:
+  feedback_enabled: true
+  feedback_storage: "/tmp/data/feedback"
+  transcripts_enabled: true
+  transcripts_storage: "/tmp/data/transcripts"
+
+authentication:
+  module: "noop"
+
+azure_entra_id:
+  tenant_id: ${env.TENANT_ID}
+  client_id: ${env.CLIENT_ID}
+  client_secret: ${env.CLIENT_SECRET}
diff --git a/pyproject.toml b/pyproject.toml
@@ -51,6 +51,8 @@ dependencies = [
     # Used by authorization resolvers
     "jsonpath-ng>=1.6.1",
     "psycopg2-binary>=2.9.10",
+    "azure-core",
+    "azure-identity",
 ]
 
 

diff --git a/src/app/endpoints/query.py b/src/app/endpoints/query.py
@@ -62,6 +62,8 @@
 from utils.transcripts import store_transcript
 from utils.types import TurnSummary
 from utils.token_counter import extract_and_update_token_metrics, TokenCounter
+from authorization.azure_token_manager import AzureEntraIDTokenManager
+
 
 logger = logging.getLogger("app.endpoints.handlers")
 router = APIRouter(tags=["query"])
@@ -287,13 +289,35 @@ async def query_endpoint_handler_base(  # pylint: disable=R0914
     try:
         check_tokens_available(configuration.quota_limiters, user_id)
         # try to get Llama Stack client
-        client = AsyncLlamaStackClientHolder().get_client()
+        client_holder = AsyncLlamaStackClientHolder()
+        client = client_holder.get_client()
         llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
             await client.models.list(),
             *evaluate_model_hints(
                 user_conversation=user_conversation, query_request=query_request
             ),
         )
+
+        azure_token_manager = AzureEntraIDTokenManager()
+        if (
+            provider_id == "azure"
+            and azure_token_manager.is_entra_id_configured
+            and azure_token_manager.is_token_expired
+        ):
+            azure_token_manager.refresh_token()
+            azure_config = next(
+                p.config
+                for p in await client.providers.list()
+                if p.provider_type == "remote::azure"
+            )
+
+            client = client_holder.get_client_with_updated_azure_headers(
+                access_token=azure_token_manager.access_token,
+                api_base=str(azure_config.get("api_base")),
+                api_version=str(azure_config.get("api_version")),
+            )
+            client_holder.set_client(client)
+
         summary, conversation_id, referenced_documents, token_usage = (
             await retrieve_response_func(
                 client,

diff --git a/src/app/endpoints/streaming_query.py b/src/app/endpoints/streaming_query.py
@@ -39,6 +39,7 @@
 from authentication import get_auth_dependency
 from authentication.interface import AuthTuple
 from authorization.middleware import authorize
+from authorization.azure_token_manager import AzureEntraIDTokenManager
 from client import AsyncLlamaStackClientHolder
 from configuration import configuration
 from constants import DEFAULT_RAG_TOOL, MEDIA_TYPE_JSON, MEDIA_TYPE_TEXT
@@ -757,13 +758,35 @@ async def streaming_query_endpoint_handler(  # pylint: disable=too-many-locals,t
 
     try:
         # try to get Llama Stack client
-        client = AsyncLlamaStackClientHolder().get_client()
+        client_holder = AsyncLlamaStackClientHolder()
+        client = client_holder.get_client()
         llama_stack_model_id, model_id, provider_id = select_model_and_provider_id(
             await client.models.list(),
             *evaluate_model_hints(
                 user_conversation=user_conversation, query_request=query_request
             ),
         )
+
+        azure_token_manager = AzureEntraIDTokenManager()
+        if (
+            provider_id == "azure"
+            and azure_token_manager.is_entra_id_configured
+            and azure_token_manager.is_token_expired
+        ):
+            azure_token_manager.refresh_token()
+            azure_config = next(
+                p.config
+                for p in await client.providers.list()
+                if p.provider_type == "remote::azure"
+            )
+
+            client = client_holder.get_client_with_updated_azure_headers(
+                access_token=azure_token_manager.access_token,
+                api_base=str(azure_config.get("api_base")),
+                api_version=str(azure_config.get("api_version")),
+            )
+            client_holder.set_client(client)
+
         response, conversation_id = await retrieve_response(
             client,
             llama_stack_model_id,

diff --git a/src/authorization/azure_token_manager.py b/src/authorization/azure_token_manager.py
@@ -0,0 +1,94 @@
+"""Authentication module for Azure Entra ID Credentials."""
+
+import logging
+import time
+from typing import Optional
+
+from azure.core.credentials import AccessToken
+from azure.core.exceptions import ClientAuthenticationError
+from azure.identity import ClientSecretCredential
+
+from configuration import AzureEntraIdConfiguration
+from utils.types import Singleton
+
+logger = logging.getLogger(__name__)
+
+TOKEN_EXPIRATION_LEEWAY = 30  # seconds
+
+
+class AzureEntraIDTokenManager(metaclass=Singleton):
+    """Microsoft Token cache for Azure OpenAI provider.
+
+    Manages and caches Microsoft Entra ID access tokens for the Azure OpenAI provider.
+    Handles token storage, expiration checks, and refreshing tokens when an Entra ID
+    configuration is provided. Designed for temporary tokens passed via request headers.
+    """
+
+    def __init__(self) -> None:
+        """Construct Azure token manager."""
+        self._access_token: Optional[str] = None
+        self._expires_on: int = 0
+        self._entra_id_config = None
+
+    def set_config(self, azure_config: AzureEntraIdConfiguration) -> None:
+        """Set the Azure Entra ID configuration."""
+        self._entra_id_config = azure_config
+
+    @property
+    def is_entra_id_configured(self) -> bool:
+        """Check whether an Entra ID configuration has been set."""
+        return self._entra_id_config is not None
+
+    @property
+    def is_token_expired(self) -> bool:
+        """Check if the current token has expired (observer only)."""
+        return self._expires_on == 0 or time.time() > self._expires_on
+
+    @property
+    def access_token(self) -> str:
+        """Return the currently cached access token (no refresh logic)."""
+        if not self._access_token:
+            logger.debug("Access token requested but not yet available.")
+        return self._access_token or ""
+
+    def refresh_token(self) -> None:
+        """Refresh and cache a new Azure access token if configuration is set."""
+        if self._entra_id_config is None:
+            raise ValueError("Azure configuration is not set for token retrieval")
+
+        logger.info("Refreshing Azure access token...")
+        token_obj = self._retrieve_access_token()
+        if token_obj:
+            self._update_access_token(token_obj.token, token_obj.expires_on)
+            logger.info("Azure access token successfully refreshed.")
+        else:
+            raise RuntimeError("Failed to retrieve Azure access token")
+
+    def _update_access_token(self, token: str, expires_on: int) -> None:
+        """Update token and expiration time (private helper)."""
+        self._access_token = token
+        self._expires_on = expires_on - TOKEN_EXPIRATION_LEEWAY
+        logger.info(
+            "Access token updated; expires at %s",
+            time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(self._expires_on)),
+        )
+
+    def _retrieve_access_token(self) -> Optional[AccessToken]:
+        """Retrieve access token to call Azure OpenAI."""
+        if not self._entra_id_config:
+            return None
+        tenant_id = self._entra_id_config.tenant_id.get_secret_value()
+        client_id = self._entra_id_config.client_id.get_secret_value()
+        client_secret = self._entra_id_config.client_secret.get_secret_value()
+
+        try:
+            credential = ClientSecretCredential(
+                tenant_id=tenant_id,
+                client_id=client_id,
+                client_secret=client_secret,
+            )
+            token = credential.get_token("https://cognitiveservices.azure.com/.default")
+            return token
+        except ClientAuthenticationError as e:
+            logger.error("Error retrieving access token: %s", e, exc_info=True)
+            return None
diff --git a/src/client.py b/src/client.py
@@ -1,6 +1,7 @@
 """Llama Stack client retrieval class."""
 
 import logging
+import json
 
 from typing import Optional
 
@@ -53,3 +54,48 @@ def get_client(self) -> AsyncLlamaStackClient:
                 "AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
             )
         return self._lsc
+
+    def set_client(self, new_client: AsyncLlamaStackClient) -> None:
+        """
+        Replace the currently stored AsyncLlamaStackClient instance.
+
+        This method allows updating the client reference when
+        configuration or runtime attributes have changed.
+        """
+        self._lsc = new_client
+
+    def get_client_with_updated_azure_headers(
+        self,
+        access_token: str,
+        api_base: str,
+        api_version: str,
+    ) -> AsyncLlamaStackClient:
+        """Return a new client with updated Azure headers, preserving other headers."""
+        if not self._lsc:
+            raise RuntimeError(
+                "AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
+            )
+
+        current_headers = self._lsc.default_headers if self._lsc else {}
+        provider_data_json = current_headers.get("X-LlamaStack-Provider-Data")
+
+        try:
+            provider_data = json.loads(provider_data_json) if provider_data_json else {}
+        except (json.JSONDecodeError, TypeError):
+            provider_data = {}
+
+        # Update only Azure-specific fields
+        provider_data.update(
+            {
+                "azure_api_key": access_token,
+                "azure_api_base": api_base,
+                "azure_api_version": api_version,
+                "azure_api_type": None,  # deprecated attribute
+            }
+        )
+
+        updated_headers = {
+            **current_headers,
+            "X-LlamaStack-Provider-Data": json.dumps(provider_data),
+        }
+        return self._lsc.copy(set_default_headers=updated_headers)  # type: ignore
diff --git a/src/configuration.py b/src/configuration.py
@@ -10,6 +10,7 @@
 import yaml
 from models.config import (
     AuthorizationConfiguration,
+    AzureEntraIdConfiguration,
     Configuration,
     Customization,
     LlamaStackConfiguration,
@@ -180,5 +181,12 @@ def quota_limiters(self) -> list[QuotaLimiter]:
             )
         return self._quota_limiters
 
+    @property
+    def azure_entra_id(self) -> Optional[AzureEntraIdConfiguration]:
+        """Return Azure Entra ID configuration, or None if not provided."""
+        if self._configuration is None:
+            raise LogicError("logic error: configuration is not loaded")
+        return self._configuration.azure_entra_id
+
 
 configuration: AppConfig = AppConfig()