Whisk is a command-line tool for transcribing audio files using OpenAI's Whisper model, with a focus on Brazilian Portuguese content. It provides a simple, efficient way to convert speech to text from various audio formats.
- Transcribe single audio files or entire directories
- Support for multiple audio formats (.mp3, .wav, .m4a, .opus)
- Multiple output formats (plain text, SRT subtitles, VTT subtitles, JSON)
- Language selection with auto-detection capability
- Choice of Whisper model sizes for different accuracy levels
- Option to skip already transcribed files
- Customizable output directory
- Python 3.10 or later (Whisper works best with Python 3.10, not 3.13)
- FFmpeg installed and accessible in your PATH
-
Clone this repository:
git clone https://github.com/mattfelber/whisk.git cd whisk -
Create and activate a virtual environment:
python -m venv venv # On Windows .\venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Make sure FFmpeg is installed:
- Windows: Download from ffmpeg.org or install using Chocolatey:
choco install ffmpeg - macOS: Install using Homebrew:
brew install ffmpeg - Linux: Install using your package manager, e.g.,
apt install ffmpeg
- Windows: Download from ffmpeg.org or install using Chocolatey:
Transcribe a single audio file:
python whisk.py --file "path/to/audio.opus"Transcribe all audio files in a directory:
python whisk.py --directory "path/to/directory"By default, Whisk assumes the audio is in Portuguese. You can specify a different language:
python whisk.py --file "path/to/audio.mp3" --language enAvailable language options:
pt- Portuguese (default)en- Englishes- Spanishfr- Frenchde- Germanit- Italiannl- Dutchja- Japanesezh- Chineseauto- Auto-detect language
Whisper offers different model sizes with varying accuracy and performance trade-offs:
python whisk.py --file "path/to/audio.mp3" --model mediumAvailable models:
| Model | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en | tiny | ~1 GB | ~32x |
| base | 74 M | base.en | base | ~1 GB | ~16x |
| small | 244 M | small.en | small | ~2 GB | ~6x |
| medium | 769 M | medium.en | medium | ~5 GB | ~2x |
| large | 1550 M | N/A | large | ~10 GB | 1x |
The larger models provide better accuracy but require more computational resources and time. For most purposes, the base or small models offer a good balance between accuracy and speed.
Whisk supports multiple output formats:
python whisk.py --file "path/to/audio.mp3" --output-format srtAvailable formats:
txt- Plain text (default)srt- SubRip subtitle format with timestampsvtt- WebVTT subtitle format with timestampsjson- JSON format with detailed information including timestamps
The SRT and VTT formats include timestamps that align with the audio content. Whisper's timestamp generation is quite accurate, especially with the larger models. Each subtitle segment includes:
- A sequence number
- The start and end time of the segment
- The transcribed text for that segment
Example SRT format:
1
00:00:00,000 --> 00:00:05,000
Hello, this is an example of a subtitle.
2
00:00:05,100 --> 00:00:08,500
This is the second subtitle segment.
These formats are perfect for creating subtitles for videos or for analyzing the timing of speech in audio recordings.
Skip files that already have transcriptions:
python whisk.py --directory "path/to/directory" --skip-existingSave transcriptions to a specific directory:
python whisk.py --directory "path/to/directory" --output-dir "path/to/output"Enable verbose logging:
python whisk.py --file "path/to/audio.mp3" --verboseSpecify a custom FFmpeg path:
python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"usage: whisk.py [-h] (-f FILE | -d DIRECTORY) [-m {tiny,base,small,medium,large}]
[-l {pt,en,es,fr,de,it,nl,ja,zh,auto}] [--skip-existing]
[-o OUTPUT_DIR] [--output-format {txt,srt,vtt,json}]
[--ffmpeg-path FFMPEG_PATH] [--verbose]
Whisk - Audio Transcription Tool using OpenAI's Whisper model
options:
-h, --help show this help message and exit
Input Options:
-f FILE, --file FILE Path to a single audio file to transcribe
-d DIRECTORY, --directory DIRECTORY
Path to a directory containing audio files to transcribe
Processing Options:
-m {tiny,base,small,medium,large}, --model {tiny,base,small,medium,large}
Whisper model to use for transcription (default: base)
-l {pt,en,es,fr,de,it,nl,ja,zh,auto}, --language {pt,en,es,fr,de,it,nl,ja,zh,auto}
Language of the audio content (default: pt)
--skip-existing Skip files that already have a transcription
Output Options:
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Directory to save transcriptions (defaults to same as input)
--output-format {txt,srt,vtt,json}
Format for the transcription output (default: txt)
Advanced Options:
--ffmpeg-path FFMPEG_PATH
Path to FFmpeg executable (if not in PATH)
--verbose Enable verbose logging
-
Audio Quality: Whisper works best with clear audio. If possible, use audio with minimal background noise and good recording quality.
-
Model Selection:
- For short, clear audio in common languages, the
basemodel is often sufficient. - For longer or more complex audio, consider using the
smallormediummodels. - The
largemodel provides the best accuracy but requires significant computational resources.
- For short, clear audio in common languages, the
-
Language Selection: Specifying the correct language can improve transcription accuracy. If you're unsure, you can use the
autooption. -
File Format: While Whisk supports various audio formats, using WAV format with a 16kHz sample rate can sometimes yield better results.
-
Segmentation: For very long audio files, consider splitting them into smaller segments for better accuracy and easier processing.
If you see an error about FFmpeg not being found, make sure it's installed and in your system PATH, or specify the path explicitly:
python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"If you encounter memory errors when using larger models, try:
- Using a smaller model (e.g.,
baseinstead ofmedium) - Processing shorter audio files
- Closing other memory-intensive applications
If the transcription accuracy is poor:
- Try a larger model
- Ensure you've specified the correct language
- Check the audio quality and consider preprocessing to reduce noise
If you encounter issues with file paths in Windows, especially with spaces or special characters:
- Use double quotes around file paths
- Try using absolute paths instead of relative paths
- Avoid special characters in file and directory names
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper for the incredible speech recognition model
- FFmpeg for audio processing capabilities