whisperWhisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The developer of Whisper, OpenAI, shows that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. The 9 models are open-source and can be downloaded.

To my knowledge, Whisper is a very good (and probably the best) ASR engine for now and it can be used as a foundation for building useful applications and for further research on robust speech processing. It was launced in September 2022 and is gaining a lot of positive response.


In many cases where AV-recordings are made, privacy can be an important item. Especcialy when the recordings are "sensitive" the interviewer (or owner of the recordings) must take care about a carefull handling of the recodings. Whisper can easily run on a fast and big server, on your own, small laptop and on any device between these two. The recognition will give an equal result. The only real difference is the processing speed of Whisper: the better your computer (and especially when it has a graphical card) the faster the recognition.
So, certainly for people who have a fast computer and who occasionally do have sensitive date, we always recommend installing it at least on your own system as well in order to avoid the risk of data breach.

Is ASR ready?

No! At this moment (June 2023) there are still a few drawbacks to Whisper such as the lack of diarization (knowing which speaker is speaking) and the sometimes overly fancy result ("I um I, I thought I'd do that for a moment" is usually recognised by Whisper as "I thought I'd do that for a moment").
This last effect is probably caused by the fact that Whisper uses the chatGPT-alike language model to "translate" the recognition into a well-running sentence. This is excellent for transcribing most speech but may not always be desirable for the research of speech and/or dialogues where hesitations, pauses, repetitions and other disfluencies are the topic of research.
On diarization and a more literal transcription of speech, several researchers are working hard. When something more is known about its capabilities and how to use it yourself, we will let it know.

Set-up Whisper

Whisper came out as a Python script. After installing Python (version 3.9 - 3.10) you need to install  PyTorch (1.10.1) and FFMPEG.
Once done, you can download and install (or update to) the latest release of Whisper with the following command:
pip install -U openai-whisper
For more information about this, see here.

Available models and languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. OpenAi observed that the difference becomes less significant for the small.en and medium.en models.


Whisper's performance varies widely depending on the language and the model. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model (The smaller the numbers, the better the performance). Additional WER scores corresponding to the other models and datasets can be found on the Whisper website. For more information, see her.


Important: Whisper's code and model weights are released under the MIT License. See LICENSE for further details.


  • Laatste aanpassing website: woensdag 19 juni 2024, 15:00:14.
  • Copyright @2023 Arjan van Hessen