During the successful and enjoyable workshop in Arezzo (May 2017), it became clear that, if done properly, automatic transcription of interviews could be useful to get a quick overview of what was said in the interviews. The participants in Arezzo were aware of the imperfections of automatic speech recognition (ASR) and knew that the recognition results will decrease when audio quality is low, when there is background noise, or the speech spoken in (heavy) dialect.
What was explicitly asked in Arezzo, was to keep the portal to be developed, as simple as possible by making no or just a few demands on the audio input, having clear instructions and using as little technical jargon as possible.
After the request for making the portal was approved in autumn 2017, the team of Christoph Draxler at the LMU in München started to build the first version of the OH-portal. An upgraded beta version (1.0.0) of the portal was presented to participants of the 2018 workshop in München.
The idea behind the OH-portal is simple. You go to the website (https://www.phonetik.uni-muenchen.de/apps/oh-portal), select one or more sound files, upload them and download the automatically generated transcription. Currently, audio-files must be formatted as wav files, but in the near future the portal itself will transform a range of submitted audio-files into the correct wav format. What is already possible is that it does not matter with which sample frequency the files are recorded or whether they are mono or stereo. In case of stereo, the portal asks the user whether he or she wants to process both audio-channels separately or together (i.e. added to one signal). If the user chooses for separately, both channels are done one after the other so that when you have recorded the interview with 2 speakers each on a single channel, you can better separate the different speakers, determine turn-taking and get even a better recognition result.
Within the portal, the button to select the wav files recognised from a user’s own computer. Then click on the button whereafter a selection window will open (see below) where you can set the different options.
At this moment, the choices made in the "Verify Files window", are valid for all the selected files: so, you cannot select different languages or recognisers when you select more than one audio-file.
Once the choices have been made, you can start the process via the button.
The audio-files are uploaded and then processed. As said, if a stereo file is included, you will be asked how you want to process the stereo file.
It is widely recognised that speech recognition hardly ever works flawlessly. Depending on the quality of the recordings, the way of speaking and the use of words/jargon by the various speakers, their accents and the presence of background noise, speech recognition will be more or less successful. With good recordings, clear, coherent speech, an error rate of less than 10% is possible for the four languages in the current portal (En, Nl, It, and De).
But even with very good recognition, something can go wrong. The Manual Transcription button offers the possibility to make corrections in the recognition results. However, by editing the recognized text, the connection between the recognized word and the time of the spoken words in the audio-file is broken. After the automatically obtained transcription has been corrected manually, you can restore this connection by choosing Word alignment. The ASR-engine will redo the job, but this time it knows exactly what was said. The result is now a perfect transcription where from every word it is exactly known when it was pronounced. This offers the possibility to automatically generate subtitles and make a karaoke version where the pronounced word is highlighted when played.
During the first day of the 2018 workshop, the way automatic speech recognition works, the choices made and the problems that appeared when building the portal, were explained to the audience. The current OH-portal is a web service that "collects" the audio files and then, depending on the choices made, forwards the files to the different speech recognizers (WebASR in Sheffield, LST-NL in Nijmegen, LST-En in Enschede, EML-D/It in Germany).
Each recognizer then returns its output in a particular output format. So, to get a uniform output result, the results must be re-written by the OH-portal to one of the selected standards. When additional languages are added in the near future, this rewriting process has to be done over and over again.
Commercial versus Open Source
There are many more recognizers available (and also for more languages than the current 4), but they are almost all (semi-) commercial. It is very easy to connect the excellent working Google recognizer and in the beta versions of the OH-portal this was done. But there is a price to pay.
Paying with money is usually not a problem because it is nearly almost just a few euros per hour of speech. But almost always the audio data used is stored on the discs of the commercial parties for extra training, testing or something else. And that is often a problem because in many interviews the content is sensitive and likely subject to the GDPR. Even in our situation with a (reliable) portal - where all user’s data are removed 24 hours after they have been processed - it may be a problem because collection-owners may expressly state that the data cannot leave the "building" without permission, or have not yet put in place GDPR-compliant processing agreements.
As a safety measure, for use of the OH-portal during the workshop it was therefore decided to remove the commercial recognizers as an option of choice; and open collections were used as far as possible for testing purposes.
At the first evaluation on Wednesday afternoon, however, participants questioned whether it might be useful to restore this “commercial” option, and to explicitly indicate that the recognizers X, Y and Z are "commercial and that they will probably keep your data on their disks”. It is then up to the users to decide whether or not to use these recognizers. This is something we will consider for the next version(s).
On Thursday morning, following a short demonstration of how the portal could be used, participants were invited to upload a short sound fragment (own sound file or one available via workshop portal) and recognize, edit, align and finally download the results.
In most cases, this worked well but the systems of the LMU were unable to operate with 20 users in parallel, so error messages appeared, and some participants had to wait a very long time to get the results of a fragment of 5 minutes.
The biggest problems were solved overnight by the team of Christoph, but scalability is certainly something to look at for the next version. Fortunately, most participants were very pleased with the simplicity of the portal. The only thing that turned out to be tricky was extracting the audio from video interviews and / or converting special formats (eg *.wma or *.mp3) into the proscribed *.wav format.
Technically this transformation is a piece of cake, but where users do not know how to do it and do not have the right software on their computer, this may be a barrier. The future option to do it in the portal was therefore greeted with enthusiasm.
Most participants were more than satisfied with the recognition results and did understood that automatic speech recognition of sound fragments that were barely audible was a difficult, if not an impossible ask.
Participants asked whether additional output formats could be included so that they could import the results of the entire process directly into their own systems (Zwangsarbeit Archiv, ELAN), and whether XML files, marked up in, say TEI (Text coding Initiative), could be exported to be exploited in onward tools. Technically this is no problem, but we cannot support all formats of all OH-projects. The portal builders have indicated that in the short term they will look at potentially interesting export formats and will add these to the current export formats.
In general, the participants were satisfied with the opportunities presented by the OH portal. Everyone could, after some help with converting the sound files, process their files, correct the automatic transcription manually, re-align it and download the final results. The fact that the load of the services was too high due to the simultaneous use of 20+ participants, which caused the systems to fail, was actually the only thing that went wrong during the hands-on session. For the builders of the portal however, it was a useful wake-up call ?.
In the coming months the scaling problem will be solved and several other recognizers (both commercial and non-commercial) will be added. During the CLARIN conference days in Pisa we will see which other CLARIN participants have a recognizer available and would like to participate in this OH-portal.
Finally, given that participants at Munich were testing other stand-alone CLARIN and non CLAIN speech processing and analysis tools, the idea of extending the idea of a TChain to an “AChain“ (annotation and analysis) might be useful, thereby offering a more seamless journey from audio recording to an annotated (knowledge-rich) interview.
At the moment Kaldi is the most popular platform for Deep Neural Network (DNN) based speech recognition. The Dutch and English recognizers are already working with Kaldi and both in Germany and Italy scholars are working on a Kaldi-based recognizer for their own language. Because it would be a shame to invent the wheel several times, it was agreed to investigate to what extent we can join forces and work together on Kaldi-based recognizers.
Arjan van Hessen