diff --git a/domestic_robotics/speech_recognition.ipynb b/domestic_robotics/speech_recognition.ipynb new file mode 100644 index 0000000..c3c32e3 --- /dev/null +++ b/domestic_robotics/speech_recognition.ipynb @@ -0,0 +1,325 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython import display" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Speech recognition tutorial\n", + "\n", + "Our speech module is much smaller compared to other components such as perception. However, it plays an important role in enabling human-machine interaction.\n", + "\n", + "This tutorial will focus on the implementation of speech recognition that is currently being used by the @Home team, which relies on [SpeechRecognition](https://github.com/Uberi/speech_recognition), which is a python library that supports several online and offline speech recognition engines and APIs. Additionally, this tutorial will also cover __Kaldi__ and how it was integrated into the speech recognition pipeline.\n", + "\n", + "This tutorial depends on the following:\n", + "\n", + "- [SpeechRecognition](https://github.com/Uberi/speech_recognition)\n", + " - CMU's [PocketSphinx](https://pypi.org/project/pocketsphinx/) for offline speech recognition (Optionally, in case you do not want to use Google Speech) \n", + "- [Kaldi](https://github.com/kaldi-asr/kaldi) - speech recognition toolkit\n", + "- [py-kaldi-asr](https://github.com/gooofy/py-kaldi-asr) - a python wrapper for Kaldi\n", + "\n", + "The basic idea of speech recognition is to record speech in the form of a sound wave and convert it into a digital representation of the wave. Once we have this audio data, we can use it as input to a model to transcribe the audio into text. I will not delve into what tipes of models since it is not the focus of the tutorial, but keep in mind that there are several existing API's using SOA methods (Google for online and Kaldi for offline in this case). \n", + "\n", + "## List Contents:\n", + "\n", + "- [Installation](#Installation)\n", + " - [Requirements](#Requirements)\n", + " - [SpeechRecognition](#SpeechRecognition)\n", + " - [Kaldi](#Kaldi)\n", + " - [py-kaldi-asr](#py-kaldi-asr)\n", + " - [Kaldi pre-trained models](#Kaldi pre-trained models)\n", + "- [Python SpeechRecognition](#Python SpeechRecognition)\n", + " - [Recognizer Class](#Recognizer Class)\n", + " - [Working with a microphone](#Working with a microphone)\n", + " - [Working Example](#Working Example)\n", + "- [mdr_speech_recognition](#mdr_speech_recognition)\n", + "\n", + "## Installation\n", + "\n", + "### Requirements\n", + "\n", + "* **Python** d2.7, or 3.3+ (required)\n", + "* **PyAudio** 0.2.11+ (required only if you need to use microphone input, ``Microphone``)\n", + "* **PocketSphinx** (required only if you need to use the Sphinx recognizer, ``recognizer_instance.recognize_sphinx``)\n", + "* **Google API Client Library for Python** (required only if you need to use the Google Cloud Speech API, ``recognizer_instance.recognize_google_cloud``)\n", + "* **FLAC encoder** (required only if the system is not x86-based Windows/Linux/OS X)\n", + "* **wget** for additional non-kaldi packages.\n", + "* **Standard UNIX utilities**: bash, perl, awk, grep, and make.\n", + "* Linear-algebra package such as **ATLAS**, **CLAPACK**, or **OpenBLAS**.\n", + "* **Cython**\n", + "\n", + "\n", + "### SpeechRecognition\n", + "\n", + "To install the SpeechRecognition API is as easy to type:\n", + "\n", + "``pip install SpeechRecognition``\n", + "\n", + "However, this API currently does not support Kaldi. For the purpose of this tutorial, install from here:\n", + "\n", + "``git clone -b feature/py-kaldi-asr_support https://github.com/robertocaiwu/speech_recognition.gitt``\n", + "\n", + "``cd speech_recognition && python setup.py install --user``\n", + "\n", + "\n", + "### Kaldi\n", + "\n", + "1. ``git clone https://github.com/kaldi-asr/kaldi.git``\n", + "2. Navigate into the /Kaldi/tools folder and run ./extras/check dependencies.sh\n", + "to check for additional dependencies that are needed for installation.\n", + "3. After installing necessary dependencies, compile by running ``make -j #ofcores`` \n", + "4. Navigate into the /Kaldi/src folder and ``run ./configure –shared`` \n", + "5. ``make depend -j #ofcores`` \n", + "6. ``make check -j #ofcores`` \n", + "7. ``make -j #ofcores``\n", + "\n", + "### py-kaldi-asr\n", + "\n", + "``pip install py-kaldi-asr``\n", + "\n", + "### Kaldi pre-trained models\n", + "\n", + "You can find pre-trained models in various languages from the official [Kaldi documentation](https://kaldi-asr.org/models.html) page and in [zamia-speech](https://goofy.zamia.org/zamia-speech/asr-models/).\n", + "\n", + "The specific one used in this tutorial is a model trained with 1200 hours of audio in English language which can be downloaded from [here](#https://goofy.zamia.org/zamia-speech/asr-models/kaldi-generic-en-tdnn_f-r20190227.tar.xz). This model has decent background noise resistance and can also be used on phone recordings.\n", + "\n", + "\n", + "## Python SpeechRecognition\n", + "\n", + "### Recognizer Class\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "import speech_recognition as sr\n", + "\n", + "rec = sr.Recognizer()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The ``Recognizer`` creates a new instance which represents a collection of speech recognition functionality. This class supports different APIs for recognizing speech:\n", + "\n", + "\n", + "* ``recognize_sphinx()`` [CMU Sphinx](http://cmusphinx.sourceforge.net/wiki/) (works offline)\n", + "* ``recognize_google()`` Google Speech Recognition\n", + "* ``recognize_google_cloud()`` [Google Cloud Speech API](https://cloud.google.com/speech/)\n", + "* ``recognize_wit()`` [Wit.ai](https://wit.ai)\n", + "* ``recognize_bing()`` [Microsoft Bing Voice Recognition](https://www.microsoft.com/cognitive-services/en-us/speech-api)\n", + "* ``recognize_houndify()`` [Houndify API](https://houndify.com/)\n", + "* ``recognize_ibm()`` [IBM Speech to Text](http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text.html)\n", + "\n", + "Out of the box, we can only use Google Speech Recognition. For the rest (with the exception of pocketsphinx), we need a username/password combination to use the online service.\n", + "\n", + "