This project demonstrates how to convert audio files in MP3 format to text using OpenAI’s Whisper model (openai/whisper-tiny
). It includes implementations for both Google Colab (via a Jupyter notebook) and local execution on Linux (via a Python script in the terminal).
The goal of this project is to transcribe MP3 audio files into text using the Whisper model from OpenAI, provided by Hugging Face’s transformers
library. The project supports:
The project was developed iteratively, addressing issues like long-form audio transcription and deprecated API warnings.
Speech_to_Text.ipynb
: Jupyter notebook for running the STT project in Google Colab.stt_whisper.py
: Python script for running the STT project locally in a Linux terminal.dataset2.mp3
: Sample MP3 file (an excerpt from Edgar Allan Poe’s The Raven).dataset1.mp3
: Additional sample MP3 file for testing.Speech_to_Text.ipynb
in Google Colab by clicking the “Open in Colab” badge at the top of the notebook, or upload it manually via colab.research.google.com.transformers
and ffmpeg
:
!pip install transformers
!apt-get update && apt-get install -y ffmpeg
from google.colab import files
uploaded = files.upload()
the-raven-100-bpm-71717.mp3
or another MP3 file.dataset2.mp3
, which isn’t in the repository. Update the audio_path
to match your uploaded file (e.g., /content/the-raven-100-bpm-71717.mp3
).audio_path
if needed:
audio_path = "/content/the-raven-100-bpm-71717.mp3"
sudo apt-get update
sudo apt-get install python3.9 python3-pip
python3 -m venv whisper_env
source whisper_env/bin/activate
ffmpeg
:
sudo apt-get install ffmpeg
pip install transformers torch
the-raven-100-bpm-71717.mp3
(or another MP3) in your project directory (e.g., ~/stt-whisper-project/
).stt_whisper.py
, update audio_path
to the full path of your MP3 file:
audio_path = "/home/yourusername/stt-whisper-project/the-raven-100-bpm-71717.mp3"
Speech_to_Text.ipynb
in Colab.the-raven-100-bpm-71717.mp3
).cd ~/stt-whisper-project
source whisper_env/bin/activate
python stt_whisper.py
Using dataset2.mp3
(an excerpt from The Raven by Edgar Allan Poe), the transcription output is:
Transcribing audio...
Transcription:
Once upon a midnight dreary, while I pondered weakened weary, over many acquaintance curious volume of forgotten lore, while I nodded nearly napping suddenly there came a tapping as of someone gently wrapping, wrapping at my chamber door to some visitor I muttered, tapping at my chamber door, only this and nothing more. When the raven never flitting still is sitting, still is sitting on the pallet bust of palace just above my chamber door, and his eyes have all the seeming of demons that is dreaming and the lamplight or him streaming throws his shadow on floor, and my soul from out that shadow that lies floating on the floor shall be lifted, never more. the Raven never more.
transformers
: For the Whisper model.torch
: For model inference (CPU or GPU).ffmpeg
: For audio processing.Install locally with:
pip install transformers torch
sudo apt-get install ffmpeg
return_timestamps=True
, which resolves the ValueError: You have passed more than 3000 mel input features
error.inputs
parameter in transformers
is present but doesn’t affect functionality. It can be suppressed by adding:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
audio_path
in the code to match your MP3 file’s location.transformers
library.dataset2.mp3
from Freesound.org (public domain).