Skip to content

mattfelber/whisk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whisk Audio Transcription Tool

Whisk is a command-line tool for transcribing audio files using OpenAI's Whisper model, with a focus on Brazilian Portuguese content. It provides a simple, efficient way to convert speech to text from various audio formats.

Features

  • Transcribe single audio files or entire directories
  • Support for multiple audio formats (.mp3, .wav, .m4a, .opus)
  • Multiple output formats (plain text, SRT subtitles, VTT subtitles, JSON)
  • Language selection with auto-detection capability
  • Choice of Whisper model sizes for different accuracy levels
  • Option to skip already transcribed files
  • Customizable output directory

Installation

Prerequisites

  • Python 3.10 or later (Whisper works best with Python 3.10, not 3.13)
  • FFmpeg installed and accessible in your PATH

Setup

  1. Clone this repository:

    git clone https://github.com/mattfelber/whisk.git
    cd whisk
  2. Create and activate a virtual environment:

    python -m venv venv
    # On Windows
    .\venv\Scripts\activate
    # On macOS/Linux
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Make sure FFmpeg is installed:

    • Windows: Download from ffmpeg.org or install using Chocolatey: choco install ffmpeg
    • macOS: Install using Homebrew: brew install ffmpeg
    • Linux: Install using your package manager, e.g., apt install ffmpeg

Usage

Basic Usage

Transcribe a single audio file:

python whisk.py --file "path/to/audio.opus"

Transcribe all audio files in a directory:

python whisk.py --directory "path/to/directory"

Language Options

By default, Whisk assumes the audio is in Portuguese. You can specify a different language:

python whisk.py --file "path/to/audio.mp3" --language en

Available language options:

  • pt - Portuguese (default)
  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • it - Italian
  • nl - Dutch
  • ja - Japanese
  • zh - Chinese
  • auto - Auto-detect language

Model Selection

Whisper offers different model sizes with varying accuracy and performance trade-offs:

python whisk.py --file "path/to/audio.mp3" --model medium

Available models:

Model Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

The larger models provide better accuracy but require more computational resources and time. For most purposes, the base or small models offer a good balance between accuracy and speed.

Output Formats

Whisk supports multiple output formats:

python whisk.py --file "path/to/audio.mp3" --output-format srt

Available formats:

  • txt - Plain text (default)
  • srt - SubRip subtitle format with timestamps
  • vtt - WebVTT subtitle format with timestamps
  • json - JSON format with detailed information including timestamps

About SRT and VTT Formats

The SRT and VTT formats include timestamps that align with the audio content. Whisper's timestamp generation is quite accurate, especially with the larger models. Each subtitle segment includes:

  1. A sequence number
  2. The start and end time of the segment
  3. The transcribed text for that segment

Example SRT format:

1
00:00:00,000 --> 00:00:05,000
Hello, this is an example of a subtitle.

2
00:00:05,100 --> 00:00:08,500
This is the second subtitle segment.

These formats are perfect for creating subtitles for videos or for analyzing the timing of speech in audio recordings.

Other Options

Skip files that already have transcriptions:

python whisk.py --directory "path/to/directory" --skip-existing

Save transcriptions to a specific directory:

python whisk.py --directory "path/to/directory" --output-dir "path/to/output"

Enable verbose logging:

python whisk.py --file "path/to/audio.mp3" --verbose

Specify a custom FFmpeg path:

python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"

Full Command Reference

usage: whisk.py [-h] (-f FILE | -d DIRECTORY) [-m {tiny,base,small,medium,large}]
                [-l {pt,en,es,fr,de,it,nl,ja,zh,auto}] [--skip-existing]
                [-o OUTPUT_DIR] [--output-format {txt,srt,vtt,json}]
                [--ffmpeg-path FFMPEG_PATH] [--verbose]

Whisk - Audio Transcription Tool using OpenAI's Whisper model

options:
  -h, --help            show this help message and exit

Input Options:
  -f FILE, --file FILE  Path to a single audio file to transcribe
  -d DIRECTORY, --directory DIRECTORY
                        Path to a directory containing audio files to transcribe

Processing Options:
  -m {tiny,base,small,medium,large}, --model {tiny,base,small,medium,large}
                        Whisper model to use for transcription (default: base)
  -l {pt,en,es,fr,de,it,nl,ja,zh,auto}, --language {pt,en,es,fr,de,it,nl,ja,zh,auto}
                        Language of the audio content (default: pt)
  --skip-existing       Skip files that already have a transcription

Output Options:
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory to save transcriptions (defaults to same as input)
  --output-format {txt,srt,vtt,json}
                        Format for the transcription output (default: txt)

Advanced Options:
  --ffmpeg-path FFMPEG_PATH
                        Path to FFmpeg executable (if not in PATH)
  --verbose             Enable verbose logging

Tips for Best Results

  1. Audio Quality: Whisper works best with clear audio. If possible, use audio with minimal background noise and good recording quality.

  2. Model Selection:

    • For short, clear audio in common languages, the base model is often sufficient.
    • For longer or more complex audio, consider using the small or medium models.
    • The large model provides the best accuracy but requires significant computational resources.
  3. Language Selection: Specifying the correct language can improve transcription accuracy. If you're unsure, you can use the auto option.

  4. File Format: While Whisk supports various audio formats, using WAV format with a 16kHz sample rate can sometimes yield better results.

  5. Segmentation: For very long audio files, consider splitting them into smaller segments for better accuracy and easier processing.

Troubleshooting

FFmpeg Not Found

If you see an error about FFmpeg not being found, make sure it's installed and in your system PATH, or specify the path explicitly:

python whisk.py --file "path/to/audio.mp3" --ffmpeg-path "path/to/ffmpeg"

Memory Issues

If you encounter memory errors when using larger models, try:

  • Using a smaller model (e.g., base instead of medium)
  • Processing shorter audio files
  • Closing other memory-intensive applications

Transcription Accuracy

If the transcription accuracy is poor:

  • Try a larger model
  • Ensure you've specified the correct language
  • Check the audio quality and consider preprocessing to reduce noise

Windows File Path Issues

If you encounter issues with file paths in Windows, especially with spaces or special characters:

  • Use double quotes around file paths
  • Try using absolute paths instead of relative paths
  • Avoid special characters in file and directory names

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI Whisper for the incredible speech recognition model
  • FFmpeg for audio processing capabilities

About

free-audio-transcription

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages