Whisk Audio Transcription Tool

Whisk is a command-line tool for transcribing audio files using OpenAI's Whisper model, with a focus on Brazilian Portuguese content. It provides a simple, efficient way to convert speech to text from various audio formats.

Features

Transcribe single audio files or entire directories
Support for multiple audio formats (.mp3, .wav, .m4a, .opus)
Multiple output formats (plain text, SRT subtitles, VTT subtitles, JSON)
Language selection with auto-detection capability
Choice of Whisper model sizes for different accuracy levels
Option to skip already transcribed files
Customizable output directory

Installation

Prerequisites

Python 3.10 or later (Whisper works best with Python 3.10, not 3.13)
FFmpeg installed and accessible in your PATH

Setup

Clone this repository:

git clone https://github.com/mattfelber/whisk.git
cd whisk

Create and activate a virtual environment:

python -m venv venv
# On Windows
.\venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Make sure FFmpeg is installed:
- Windows: Download from ffmpeg.org or install using Chocolatey: choco install ffmpeg
- macOS: Install using Homebrew: brew install ffmpeg
- Linux: Install using your package manager, e.g., apt install ffmpeg

Usage

Basic Usage

Transcribe a single audio file:

python whisk.py --file "path/to/audio.opus"

Transcribe all audio files in a directory:

python whisk.py --directory "path/to/directory"

Language Options

By default, Whisk assumes the audio is in Portuguese. You can specify a different language:

python whisk.py --file "path/to/audio.mp3" --language en

Available language options:

pt - Portuguese (default)
en - English
es - Spanish
fr - French
de - German
it - Italian
nl - Dutch
ja - Japanese
zh - Chinese
auto - Auto-detect language

Model Selection

Whisper offers different model sizes with varying accuracy and performance trade-offs:

python whisk.py --file "path/to/audio.mp3" --model medium

Available models:

Model	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	tiny.en	tiny	~1 GB	~32x
base	74 M	base.en	base	~1 GB	~16x
small	244 M	small.en	small	~2 GB	~6x
medium	769 M	medium.en	medium	~5 GB	~2x
large	1550 M	N/A	large	~10 GB	1x

The larger models provide better accuracy but require more computational resources and time. For most purposes, the base or small models offer a good balance between accuracy and speed.

Output Formats

Whisk supports multiple output formats:

python whisk.py --file "path/to/audio.mp3" --output-format srt

Available formats:

txt - Plain text (default)
srt - SubRip subtitle format with timestamps
vtt - WebVTT subtitle format with timestamps
json - JSON format with detailed information including timestamps

About SRT and VTT Formats

The SRT and VTT formats include timestamps that align with the audio content. Whisper's timestamp generation is quite accurate, especially with the larger models. Each subtitle segment includes:

A sequence number
The start and end time of the segment
The transcribed text for that segment

Example SRT format:

1
00:00:00,000 --> 00:00:05,000
Hello, this is an example of a subtitle.

2
00:00:05,100 --> 00:00:08,500
This is the second subtitle segment.

These formats are perfect for creating subtitles for videos or for analyzing the timing of speech in audio recordings.

Other Options

Skip files that already have transcriptions:

python whisk.py --directory "path/to/directory" --skip-existing

Save transcriptions to a specific directory:

python whisk.py --directory "path/to/directory" --output-dir "path/to/output"

Enable verbose logging:

python whisk.py --file "path/to/audio.mp3" --verbose

Specify a custom FFmpeg path:

python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"

Full Command Reference

usage: whisk.py [-h] (-f FILE | -d DIRECTORY) [-m {tiny,base,small,medium,large}]
                [-l {pt,en,es,fr,de,it,nl,ja,zh,auto}] [--skip-existing]
                [-o OUTPUT_DIR] [--output-format {txt,srt,vtt,json}]
                [--ffmpeg-path FFMPEG_PATH] [--verbose]

Whisk - Audio Transcription Tool using OpenAI's Whisper model

options:
  -h, --help            show this help message and exit

Input Options:
  -f FILE, --file FILE  Path to a single audio file to transcribe
  -d DIRECTORY, --directory DIRECTORY
                        Path to a directory containing audio files to transcribe

Processing Options:
  -m {tiny,base,small,medium,large}, --model {tiny,base,small,medium,large}
                        Whisper model to use for transcription (default: base)
  -l {pt,en,es,fr,de,it,nl,ja,zh,auto}, --language {pt,en,es,fr,de,it,nl,ja,zh,auto}
                        Language of the audio content (default: pt)
  --skip-existing       Skip files that already have a transcription

Output Options:
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory to save transcriptions (defaults to same as input)
  --output-format {txt,srt,vtt,json}
                        Format for the transcription output (default: txt)

Advanced Options:
  --ffmpeg-path FFMPEG_PATH
                        Path to FFmpeg executable (if not in PATH)
  --verbose             Enable verbose logging

Tips for Best Results

Audio Quality: Whisper works best with clear audio. If possible, use audio with minimal background noise and good recording quality.
Model Selection:
- For short, clear audio in common languages, the base model is often sufficient.
- For longer or more complex audio, consider using the small or medium models.
- The large model provides the best accuracy but requires significant computational resources.
Language Selection: Specifying the correct language can improve transcription accuracy. If you're unsure, you can use the auto option.
File Format: While Whisk supports various audio formats, using WAV format with a 16kHz sample rate can sometimes yield better results.
Segmentation: For very long audio files, consider splitting them into smaller segments for better accuracy and easier processing.

Troubleshooting

FFmpeg Not Found

If you see an error about FFmpeg not being found, make sure it's installed and in your system PATH, or specify the path explicitly:

python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"

Memory Issues

If you encounter memory errors when using larger models, try:

Using a smaller model (e.g., base instead of medium)
Processing shorter audio files
Closing other memory-intensive applications

Transcription Accuracy

If the transcription accuracy is poor:

Try a larger model
Ensure you've specified the correct language
Check the audio quality and consider preprocessing to reduce noise

Windows File Path Issues

If you encounter issues with file paths in Windows, especially with spaces or special characters:

Use double quotes around file paths
Try using absolute paths instead of relative paths
Avoid special characters in file and directory names

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI Whisper for the incredible speech recognition model
FFmpeg for audio processing capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
BRANCH_PLAN.md		BRANCH_PLAN.md
IMPROVEMENT_PLAN.md		IMPROVEMENT_PLAN.md
README.md		README.md
TASKS.md		TASKS.md
batch_process.log		batch_process.log
batch_process.py		batch_process.py
conversion.log		conversion.log
convert_audio.py		convert_audio.py
ffmpeg_fix.log		ffmpeg_fix.log
fix_ffmpeg_path.py		fix_ffmpeg_path.py
requirements.txt		requirements.txt
test_conversion.log		test_conversion.log
test_convert_transcribe.py		test_convert_transcribe.py
test_pydub.log		test_pydub.log
test_pydub_transcribe.py		test_pydub_transcribe.py
test_simple.log		test_simple.log
test_simple.py		test_simple.py
test_transcribe.py		test_transcribe.py
test_transcription.log		test_transcription.log
transcribe.log		transcribe.log
transcribe.py		transcribe.py
transcribe_wav.py		transcribe_wav.py
transcription.log		transcription.log
transcription_wav.log		transcription_wav.log
whisk.py		whisk.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Whisk Audio Transcription Tool

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Language Options

Model Selection

Output Formats

About SRT and VTT Formats

Other Options

Full Command Reference

Tips for Best Results

Troubleshooting

FFmpeg Not Found

Memory Issues

Transcription Accuracy

Windows File Path Issues

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mattfelber/whisk

Folders and files

Latest commit

History

Repository files navigation

Whisk Audio Transcription Tool

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Language Options

Model Selection

Output Formats

About SRT and VTT Formats

Other Options

Full Command Reference

Tips for Best Results

Troubleshooting

FFmpeg Not Found

Memory Issues

Transcription Accuracy

Windows File Path Issues

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages