EMLAR24 logo
The information of the official UU-website about the tutorial Automatic Speech Recognition (ASR) is unfortunately old and does not fit the content of the tutorial.
Below a small update.

Arjan


Automatic Speech Recognition (ASR)

ASR nowadays uses Machine Learning or Artificial Intelligence (AI) technology to process human speech into readable text. The field has grown exponentially over the past decade, with ASR systems popping up in applications for transcription, real-time captions, and more.

Since the arrival of Whisper (OpenAI, Sept 2022), speech recognition for clear sound recordings works at about the human level. I.e. the error rate of transcriptions is between 3% and 8%.
Of course, it depends on the manner of speaking, whether or not the speech contains a strong accent or dialect, the presence of background noise, and the use of typical slang words. One of the reasons Whisper is performing well, is probably the use of dedicated LLM for the recognition.

Tutorial

This partly non-technical tutorial is aimed at students/researchers who use (large quantities of) spoken narratives in their research and want to use Automatic Speech Recognition for transcript generation, phonetic research and/or other research where the relation between what & when was said, is relevant.

We will discuss the following topics:

  1. ASR, a status update of the current technology.
  2. Making your own audio-content suitable for ASR.
  3. Recognising your own, sometimes sensible AV-recordings at your own computer, the faculty computer or in the cloud.
  4. ASR result: a full timed-text (or a table of words, times and confidentialities). What to do next?
  5. Correcting the ASR results into what?

Whisper ASR-engine

In this tutorial we will concentrate on the Whisper ASR-engine that can recognise more than 90 different languages!
Whisper is a set of python-scripts that turn your AV-files into text. However, due to the open-source of the software, it is converted into C++ software. Moreover, various developpers in the world use Whisper to do additional things as speeding up the recognition, a more precise estimation of the start and end-times of the recognised words, speaker diarization and more.

For Windows, MacOS and Linux computers there is also dedicated software to do the recognition.

  • aTrain: A simple windows-engine to do the recognition on Windows machines with or without a GPU. For more information, see here
  • SubtitleEdit: an opensource transcription tool for Windows that can make the subtitles via Whisper
  • MacWhisper: an engine on your Mac, using the C++-software package

DIY

Participants are invited to process their own AV-recordings. However, to avoid long waiting-times, everyone is kindly requested to use a short fragment of max 5 minutes during this tutorial. Once you know how to do it, you can process the large files later on yourself. So, bring this 5 min AV-recording with you.

Language of the recording

Feel free to record the message in any language you want and you speak (quite well). Whisper should be able to detect the language used.

Presentation

The powerpoint (presentation) can be downloaded here.

Questions

Of course you may ask everything during (or after) the tutorial but if you have urgent question before and/or you want me to pay attention to some particular ASR-related items, please mail me at: Dit e-mailadres wordt beveiligd tegen spambots. JavaScript dient ingeschakeld te zijn om het te bekijken.

 

  • Laatste aanpassing website: zondag 05 januari 2025, 09:27:45.
  • Copyright @2023 Arjan van Hessen