Skip to content

Conversation

@basnijholt
Copy link
Owner

Summary

  • Add speaker diarization as a post-processing step for transcription using pyannote-audio
  • Identifies and labels different speakers in the transcript (useful for meetings, interviews, multi-speaker audio)
  • Works with any ASR provider (Wyoming, OpenAI, Gemini)
  • New optional dependency: pip install agent-cli[diarization]

New CLI Options

Option Description
--diarize/--no-diarize Enable speaker diarization
--diarize-format Output format: inline (default) or json
--hf-token HuggingFace token for pyannote models (required)
--min-speakers Minimum number of speakers hint
--max-speakers Maximum number of speakers hint

Output Formats

Inline (default):

[SPEAKER_00]: Hello, how are you?
[SPEAKER_01]: I'm doing well, thanks!

JSON:

{
  "segments": [
    {"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you?"}
  ]
}

Usage Examples

# Install diarization extra
pip install agent-cli[diarization]

# Basic diarization
agent-cli transcribe --diarize --hf-token YOUR_HF_TOKEN

# Diarize a meeting recording with known participants
agent-cli transcribe --from-file meeting.wav --diarize --min-speakers 2 --max-speakers 4 --hf-token YOUR_TOKEN

Test plan

  • Unit tests for DiarizedSegment dataclass
  • Unit tests for align_transcript_with_speakers function
  • Unit tests for format_diarized_output (inline and JSON)
  • Unit tests for SpeakerDiarizer class with mocked pyannote
  • Updated existing transcribe recovery tests with new parameters
  • All 513 tests passing
  • Pre-commit hooks passing

Add speaker diarization as a post-processing step for transcription using
pyannote-audio. This identifies and labels different speakers in the
transcript, useful for meetings, interviews, or multi-speaker audio.

Features:
- New `--diarize` flag to enable speaker diarization
- `--diarize-format` option for inline (default) or JSON output
- `--hf-token` for HuggingFace authentication (required for pyannote models)
- `--min-speakers` and `--max-speakers` hints for improved accuracy
- Works with any ASR provider (Wyoming, OpenAI, Gemini)
- New optional dependency: `pip install agent-cli[diarization]`

Output formats:
- Inline: `[SPEAKER_00]: Hello, how are you?`
- JSON: structured with speaker, timestamps, and text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants