-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Bug Description
The Mistral STT plugin does not transcribe user voice input.
With default config, printing Mistral STT client's responses in the Livekit plugin gives the following output, with empty text but tokens were used:
model='voxtral-mini-latest' text='' usage=UsageInfo(prompt_tokens=8, completion_tokens=4, total_tokens=387, prompt_audio_seconds=1)
language=None segments=[] finish_reason=None
The behavior is the same with any language.
My analysis:
The plugin sets timestamp_granularities=["segment"] and language (either set in Agent implementation or defaults to en) in the transcription request but Mistral's audio transcription documentation states that timestamp_granularities and language can't be used together.
Expected Behavior
A transcription should be returned.
Reproduction Steps
You can reproduce the behavior with this minimal Livekit Agent: https://gist.github.com/AntoineDrt/95ffcd2e026917564b34698994598dfd
Setup and run:
uv venv
uv pip install livekit-plugins-mistralai livekit "livekit-agents[silero]~=1.3"
export LIVEKIT_URL=wss:// LIVEKIT_API_KEY= LIVEKIT_API_SECRET= MISTRAL_API_KEY=
uv run python ./livekit_mistral_stt_empty_transcript.py consoleTry talking to the Agent:
- the logs in the
user_state_changedevent callback show speech is detected - the callback associated to
user_input_transcribednever gets called
Operating System
macOS 15.2 (24C101)
Models Used
Voxtral mini latest, Mistral Medium latest, Cartesia Sonic-3
Package Versions
livekit==1.3.10
livekit_agents==1.3.10
livekit_plugins_mistral==1.3.10
livekit_plugins_silero==1.3.10Proposed Solution
Mistral's audio transcription documentation states that:
timestamp_granularitiesis currently not compatible withlanguage, please use either one or the other.
I tested Mistral STT plugin without timestamp_granularities=["segment"] and it does seem to fix the issue.
Judging from this comment, timestamp_granularities=["segment"] does not return timestamps anyways so I suggest we remove it.
This solution would enable the use of language.