WEBVTT 00:00:01.283 --> 00:00:06.013 Okay. I hope everyone can see my screen. 00:00:08.845 --> 00:00:11.734 Okay, well, good morning. I'm Arjan van Hessen. 00:00:13.203 --> 00:00:16.874 I will tell a little bit more about myself in the next slide. 00:00:17.060 --> 00:00:22.915 But well, just to inform you, I'm working at University of Twente, University of Utrecht and also at Telecats. 00:00:23.020 --> 00:00:26.612 It's a software company in Enschede and I'm living in Utrecht. 00:00:27.661 --> 00:00:32.833 And my work for the University of Utrecht is, well, bounded by CLARIAH. 00:00:34.281 --> 00:00:41.154 And, well, that's so it's not scientific, but it's more, well, an organizational stuff. 00:00:42.722 --> 00:00:48.855 But in turn, I'm working on speech technology and more or less also now on the understanding of speech. 00:00:49.000 --> 00:00:52.050 But that will be something of this meeting. 00:00:53.000 --> 00:01:06.257 Okay. Yeah, here is an interview four years ago, I believe, for people who are blind or 00:01:06.400 --> 00:01:14.576 have difficulty in good looking. So they are used or they are helped with modern speech 00:01:14.640 --> 00:01:22.316 and language technology. So there was an interview with me for their personal radio or well podcast 00:01:22.381 --> 00:01:23.124 or how do you call it? 00:01:24.521 --> 00:01:28.133 And I got the recording and I thought, 00:01:28.180 --> 00:01:31.132 Well, let's see how speech recognition is doing. 00:02:32.209 --> 00:02:37.557 Well, you can see, and I hope that you could also see the English translation who was generated 00:02:37.660 --> 00:02:39.889 automatically based on this recognition result. 00:02:40.960 --> 00:02:43.710 And so it's a nearly 100% correct recognition. 00:02:44.600 --> 00:02:51.035 There were two smaller errors, Telecats, I mean, the company is Telecats, so it's 00:02:51.340 --> 00:03:01.377 one company, but it wasn't in the vocabulary at that time. So it is recognized as Telen 00:03:01.540 --> 00:03:09.356 & Kets. Okay. And it is the museum. And the museum is a special museum in Nijmegen, dedicated 00:03:09.440 --> 00:03:15.955 for people who are blind or can't see very well. But for the rest, it is perfect. And 00:03:16.180 --> 00:03:19.972 If you say, well, what is missing in this transcription? 00:03:20.540 --> 00:03:24.673 It are the dots and the commas and eventually a question mark. 00:03:25.280 --> 00:03:31.893 So besides of it, it was already a perfect recognition result. 00:03:33.700 --> 00:03:36.752 This was done with the Calde recognizer, 00:03:36.860 --> 00:03:40.232 available for all researchers at the University of Nijmegen. 00:03:40.420 --> 00:03:44.994 So you can go there, register yourself, upload the file, 00:03:45.080 --> 00:03:45.925 and you get the results. 00:03:46.480 --> 00:03:48.410 However, there's a new engine. 00:03:48.660 --> 00:03:51.491 We will talk today, talk about it today, 00:03:52.221 --> 00:03:55.652 and that is outperforming the Kaldi recognizer. 00:03:56.381 --> 00:03:56.602 Okay. 00:04:03.248 --> 00:04:03.489 Okay. 00:04:04.660 --> 00:04:06.227 So if we are looking at the, 00:04:07.540 --> 00:04:09.430 well, the dissemination of the human knowledge 00:04:09.500 --> 00:04:11.911 of the last 3000 years, I mean, so well, 00:04:12.161 --> 00:04:14.030 what has that to do with speech recognition? 00:04:14.080 --> 00:04:19.735 well, a lot. So eventually, we started, of course, with talking 00:04:19.880 --> 00:04:24.234 to each other. Later, it became text. So it was from generation 00:04:24.300 --> 00:04:27.653 to generation possible to give your knowledge to the next 00:04:27.760 --> 00:04:31.934 generations. But now we are in the speech mode and text mode 00:04:32.020 --> 00:04:35.773 again. So you can say, well, we started with the Druids, for 00:04:35.840 --> 00:04:40.374 example, who had a very oral history. And later in the 00:04:40.400 --> 00:04:44.874 medieval time, or well, before that, to be honest, it was the 00:04:45.000 --> 00:04:50.795 written context, information was written down in books. And in 00:04:51.060 --> 00:04:55.774 the 14th century, it became pressed. So it was the, well, 00:04:55.860 --> 00:05:03.956 the in textual form. And in the late, no, the early 20th 00:05:04.240 --> 00:05:12.516 century. Recognition or recording devices were available. And the 00:05:12.640 --> 00:05:18.255 first well, oral history, you may say recordings, you can see 00:05:18.300 --> 00:05:22.832 here is an it is an America in I believe 1916 or something like 00:05:23.640 --> 00:05:31.115 that. And so you can say from oral to written version to 00:05:31.420 --> 00:05:36.793 Nowadays, the podcast, the videos, and all that stuff. 00:05:37.981 --> 00:05:41.912 And we are back in our oral environment, you may say. 00:05:43.101 --> 00:05:45.752 And for example, you can see here the National Initiative 00:05:45.620 --> 00:05:47.086 to Sound and Vision in Hilversum. 00:05:48.381 --> 00:05:49.407 It has a museum. 00:05:49.640 --> 00:05:50.847 It has working offices. 00:05:51.581 --> 00:05:53.829 And underground, so 30 meters deep, 00:05:55.181 --> 00:05:57.851 it is the archive where they have more or less, 00:05:58.060 --> 00:06:00.128 at this time, 800,000 hours of audiovisual material. 00:06:01.700 --> 00:06:04.612 However, they don't know what, I mean, from part, 00:06:04.660 --> 00:06:08.233 they know what it is about, but from huge parts, 00:06:08.340 --> 00:06:10.827 they know it's a broadcast from 1937, 00:06:15.165 --> 00:06:18.974 but what it is about, what is said is unknown. 00:06:19.902 --> 00:06:23.373 At the same time, we see at the universities 00:06:23.620 --> 00:06:30.414 that there is an increasing use of the videos, especially from COVID, but also before. 00:06:31.380 --> 00:06:38.954 We have more and more teachers who are giving a video presentation. 00:06:40.320 --> 00:06:45.334 If it is a video for educational stuff, it has to be stored in the archives for seven 00:06:45.580 --> 00:06:49.773 years, but we don't know what it is about. 00:06:50.000 --> 00:06:54.214 I mean, it is an interview or it's a lecture from Professor 00:06:54.320 --> 00:06:56.811 Janssen, and he is talking about biology, 00:06:56.940 --> 00:06:58.145 but that's what we know. 00:07:00.420 --> 00:07:05.094 So we are missing, we do have some metadata, 00:07:05.280 --> 00:07:06.908 but it is missing partly. 00:07:07.500 --> 00:07:10.830 But the transcriptions normally are not available. 00:07:12.561 --> 00:07:14.490 And so that means that if we want 00:07:14.680 --> 00:07:17.410 to know what the lectures are about, 00:07:18.080 --> 00:07:24.534 we need to listen and also to look at it and that can be nice but I mean with 150 000 hours 00:07:25.820 --> 00:07:37.037 it's not doable to give a good overview of what is available. Okay it's not the new that we are 00:07:37.180 --> 00:07:41.773 coming in and well you can say in decade or in time from artificial intelligence 00:07:42.220 --> 00:07:47.434 it is quite popular in the newspapers and we see each half year there are new developments 00:07:47.840 --> 00:07:54.856 and the latest was a chatGPT and we will talk about that later but we may say that artificial 00:07:54.940 --> 00:08:01.255 intelligence is leaving the laboratories and becoming part of our well history and 00:08:02.784 --> 00:08:08.015 that means that we have a question what is artificial intelligence well you may say 00:08:08.140 --> 00:08:14.876 something magic that gives you the opportunity to to use the software for all kinds of smart things 00:08:15.561 --> 00:08:26.278 But well, if you were looking in the past, you say that in between the 50s, when it was started, the term was was notified and it was more or less rule based. 00:08:26.181 --> 00:08:30.955 So look for chess players. And that was considered as artificial intelligence at that time. 00:08:32.023 --> 00:08:37.274 And in the 80s, it was machine learning. It became popular in the laboratories. 00:08:38.581 --> 00:08:50.058 And nowadays we have machine learning with deep neural networks and the deep neural networks are a kind of a copy from the way we are well thinking. 00:08:50.782 --> 00:08:55.635 So you may say it is coming closer to the way people think. 00:08:56.381 --> 00:09:05.917 And if you can, if you're successful in making software that is copying at least part of our behavior or part of our, yeah, 00:09:06.882 --> 00:09:16.455 you may call it artificial intelligence. But you see it is a changing also vision from the humanity about artificial intelligence. 00:09:18.302 --> 00:09:20.750 And if you say, well, what is AI? 00:09:21.520 --> 00:09:25.773 Of course, we have the famous games. 00:09:26.100 --> 00:09:29.731 It was first with chess in the 90s, I believe, Deep Blue. 00:09:30.780 --> 00:09:35.413 And in 16, so for seven years ago, 00:09:36.360 --> 00:09:42.013 it was Google who started with the game Go. 00:09:43.140 --> 00:09:44.827 And they developed DeepMind. 00:09:46.160 --> 00:09:51.374 And to train DeepMind, they used all the available games 00:09:51.981 --> 00:09:54.331 that were recorded and things like that. 00:09:54.420 --> 00:09:56.630 And they tried to train the computer 00:09:57.000 --> 00:09:58.688 with the existing games. 00:09:59.460 --> 00:10:01.789 OK, it became very good. 00:10:02.620 --> 00:10:05.191 However, then there was a clever one at Google that said, 00:10:05.240 --> 00:10:08.531 well, if we have two versions of DeepMind, 00:10:09.160 --> 00:10:13.173 we can make them play against each other. 00:10:13.480 --> 00:10:14.987 And then the winner will win. 00:10:15.900 --> 00:10:22.495 And that game is considered as a good starting point for the next training and so on and so on. 00:10:23.121 --> 00:10:30.795 So they started for three weeks, I believe, and it was continuous playing, updating the results and then playing again and etc. 00:10:31.940 --> 00:10:37.533 And in the end, I mean, the DeepMind is so good that it is absolutely the world class winner. 00:10:39.181 --> 00:10:46.594 However, I'm not talking about artificial intelligence as such, but it is more the language-dependent 00:10:47.920 --> 00:10:48.563 artificial intelligence. 00:10:51.803 --> 00:10:59.333 In 2011, I believe, it was Watson, or IBM was Watson, who started to join the game. 00:11:02.480 --> 00:11:08.235 It's kind of a quiz instead of the European version where you have a question and you 00:11:08.300 --> 00:11:13.113 you have to give the answer. The answer is given and you have to invent the question. 00:11:13.960 --> 00:11:23.255 But that's it. And they trained the computer and then there was an official presentation. 00:11:24.020 --> 00:11:30.635 In Ken was the world champion of that year. Brett was the guy who won the most money in 00:11:30.680 --> 00:11:37.995 the year before and they played both against Watson. And in the end Watson absolutely overwhelmed 00:11:38.160 --> 00:11:41.110 The other two, I mean, I believe that Ken had 12,000, 00:11:42.280 --> 00:11:43.484 Brett had $14,000, and Watson has 40 or $50,000. 00:11:47.722 --> 00:11:50.171 So it was absolutely an easy game. 00:11:51.603 --> 00:11:53.510 How was the system developed? 00:11:54.520 --> 00:11:56.489 They used, of course, all American, 00:11:57.120 --> 00:12:00.292 it was only in American, text available. 00:12:00.420 --> 00:12:03.571 So the Wikipedia, books, newspaper articles, 00:12:04.100 --> 00:12:06.831 internet stuff, et cetera, et cetera, 00:12:07.320 --> 00:12:09.170 that was trained, given to the computer. 00:12:09.220 --> 00:12:13.852 The computer learned to argue about the results. 00:12:14.780 --> 00:12:17.949 And then in the end, it beat it, and that's game. 00:12:19.540 --> 00:12:19.641 OK. 00:12:20.962 --> 00:12:24.013 And then in 2017, so five years, six years 00:12:24.220 --> 00:12:29.954 after the previous one, we got the transformer models. 00:12:30.720 --> 00:12:33.691 And if you ask me exactly what a transformer model is, 00:12:34.340 --> 00:12:34.904 I'm not 100% sure. 00:12:35.780 --> 00:12:39.030 I'm studying it at the moment, but it's also for me quite new. 00:12:40.720 --> 00:12:43.552 But here you have some nice views. 00:12:43.620 --> 00:12:45.428 You will get the presentation afterward. 00:12:46.201 --> 00:12:51.034 And it is a neural network, of course, that learns context 00:12:51.200 --> 00:12:53.310 and does meaning by tracking relationship 00:12:53.460 --> 00:12:55.767 in sequential data like the words in the sentence. 00:12:57.820 --> 00:13:00.231 Transformer models apply an evolving set 00:13:00.320 --> 00:13:03.212 of mathematical techniques called attention or self 00:13:03.300 --> 00:13:08.454 attention to detect subtle ways, even distant data elements 00:13:08.520 --> 00:13:10.569 in the sphere of influence and dependent on each other. 00:13:11.140 --> 00:13:15.413 OK, well, that's a more or less nice definition 00:13:15.620 --> 00:13:16.746 of the transformer model. 00:13:17.660 --> 00:13:21.589 And it was started in 2017 by a paper of Google. 00:13:24.601 --> 00:13:27.090 And it became absolutely new. 00:13:28.241 --> 00:13:32.213 And then Stanford researchers, they 00:13:32.580 --> 00:13:35.933 called that foundation models instead of transformer models. 00:13:35.960 --> 00:13:39.032 So well, it's a different terminology 00:13:39.120 --> 00:13:40.165 for the same technique. 00:13:41.540 --> 00:13:46.814 But it turns out that these transformer models are very, 00:13:46.960 --> 00:13:52.192 very strong in the current, well, AI revolution. 00:13:53.940 --> 00:13:58.934 OK, transformer are translating text. 00:13:59.060 --> 00:13:59.806 So what can you do? 00:13:59.940 --> 00:14:01.850 You can give it a Dutch text and say, 00:14:01.900 --> 00:14:03.126 will give me the English version. 00:14:04.161 --> 00:14:07.533 And speech, also possible. 00:14:07.640 --> 00:14:10.050 So for people who are hearing impaired, 00:14:10.500 --> 00:14:12.046 it can be very useful. 00:14:13.480 --> 00:14:15.650 They can detect trends and anomalies 00:14:15.760 --> 00:14:17.529 to prevent fraud and things like that. 00:14:17.660 --> 00:14:21.992 So it is important for health care and banking and that stuff. 00:14:22.961 --> 00:14:25.571 And here, it is a nice overview. 00:14:25.740 --> 00:14:27.810 You have the data, the foundation, 00:14:27.960 --> 00:14:29.167 or the transformer models. 00:14:29.640 --> 00:14:32.949 And here you have a couple of, well, resulting tools. 00:14:34.980 --> 00:14:39.894 And I suppose that the most of you will know this video, 00:14:40.100 --> 00:14:41.167 But, well, we can see. 00:15:00.885 --> 00:15:04.093 OK, well, it isn't I mean, you can find it on the Internet. 00:15:05.120 --> 00:15:08.252 It is an astonishing conversation 00:15:08.660 --> 00:15:11.711 where the computer is calling a hairdresser or hair shopper 00:15:12.421 --> 00:15:15.130 and make an appointment for, well, her boss 00:15:16.000 --> 00:15:17.709 who's not involved at the moment. 00:15:18.260 --> 00:15:22.614 And especially this human-like and this humbling 00:15:22.700 --> 00:15:25.211 and things like that is, well, it 00:15:25.320 --> 00:15:27.569 makes it very natural sounding. 00:15:28.661 --> 00:15:29.706 And it is, yeah. 00:15:30.641 --> 00:15:30.761 OK. 00:15:33.100 --> 00:15:35.989 Then we, more or less at the same time, 2017 or 18, 00:15:37.860 --> 00:15:45.956 open AI started and well as the name is promising they develop AI and it will be open well partly 00:15:46.000 --> 00:15:51.555 at least and is it an evolution or a revolution well that's a discussion but 00:15:54.283 --> 00:16:04.476 they started well they make a very impact full entrance in December 2022 so last year 00:16:07.062 --> 00:16:12.953 is the introduction of chatGPT and they well I know that most of you will know chatGPT. 00:16:14.901 --> 00:16:21.456 And yesterday I asked the computer to explain quantum computing in simple terms so that 00:16:21.500 --> 00:16:26.634 was just a question to me and then you get a nice overview in which well more or less 00:16:27.060 --> 00:16:31.690 quantum computing is explained, at least to me. 00:16:34.241 --> 00:16:37.291 And even our last week, I believe, 00:16:38.020 --> 00:16:42.854 they updated their language model, 00:16:42.980 --> 00:16:45.288 you may say, to GPT-4. 00:16:46.600 --> 00:16:51.071 So chatGPT-3 and 3 and 1 half was stopping in 2021 00:16:52.920 --> 00:16:55.027 because they used the data from before 2021. 00:16:57.221 --> 00:17:00.773 And that means that all current activities, the war in Ukraine 00:17:00.900 --> 00:17:05.492 or things like that, you couldn't ask at chatGPT. 00:17:06.481 --> 00:17:09.690 And now they update it to chat version 4, 00:17:10.980 --> 00:17:11.623 GPT4 version 4. 00:17:13.401 --> 00:17:16.453 And if you have a paid account, then you 00:17:16.540 --> 00:17:18.289 can already access it. 00:17:18.440 --> 00:17:20.931 And I suppose it will be becoming available 00:17:21.000 --> 00:17:22.889 for the free engines as well. 00:17:23.380 --> 00:17:26.031 But well, it is, again, better. 00:17:26.481 --> 00:17:28.871 And you can give images to it. 00:17:29.080 --> 00:17:31.430 They can do a kind of humor. 00:17:32.421 --> 00:17:36.230 And it is absolutely amazing what is possible with chatGPT. 00:17:40.461 --> 00:17:44.189 Then in 2022, in September, end of September, 00:17:47.660 --> 00:17:50.890 they silently introduced Whisper. 00:17:52.181 --> 00:17:54.530 And that is a speech recognition engine. 00:17:55.702 --> 00:17:58.550 More or less based also on the Transformer models. 00:17:59.880 --> 00:18:03.993 So what is Whisper? It is an automatic speech recognition 00:18:04.640 --> 00:18:09.575 engine trained on nearly 700,000 hours of multilingual and 00:18:09.602 --> 00:18:11.314 multitask data. 00:18:11.124 --> 00:18:13.636 And to give you an idea, 00:18:12.761 --> 00:18:15.047 700,000 hours is more than you and I will use in 00:18:18.402 --> 00:18:21.792 our life. So it's an enormous amount of data. 00:18:23.101 --> 00:18:30.196 And well, they used 60% I believe was English and 40% were other languages. 00:18:32.804 --> 00:18:39.396 And they showed that use of such a large and diverse data sets leads to an improved robustness 00:18:39.640 --> 00:18:43.293 of accents, background noise and also technical languages. 00:18:44.884 --> 00:18:49.175 Moreover, it enables transcription in multiple languages. So it is one model 00:18:50.183 --> 00:18:54.615 And you can give it in Dutch version or a Chinese version or an Italian version and 00:18:54.680 --> 00:19:00.534 it will give you the translation in or the transcription in that language. 00:19:01.320 --> 00:19:03.631 And also you can translate it to English. 00:19:03.760 --> 00:19:08.813 So if you hear an interesting Chinese conversation and you don't know what it is about, you can 00:19:09.300 --> 00:19:13.774 give it to whisper, make the transcription and then the translation in English and you 00:19:14.000 --> 00:19:15.465 can see what it is about. 00:19:18.300 --> 00:19:25.275 And what is very nice of OpenIA is that this time it was a really open sourcing model. 00:19:25.460 --> 00:19:30.775 So they developed seven models, I believe, or eight or nine models, and you can download 00:19:30.800 --> 00:19:33.268 them from their site and use them. 00:19:35.180 --> 00:19:44.655 OK, if we look to speech recognition in the past two decades, I mean longer, in the 70s 00:19:45.680 --> 00:19:51.554 started slowly, slowly, and it was more or less based on the Fourier transform. Then in 2000, 00:19:52.220 --> 00:19:54.925 nearly, yeah, 2095, 96, they were the first initiatives with the HMM. And then in 2010, 00:20:01.521 --> 00:20:04.933 this is the paper of Microsoft at the InterSpeech conference in Florence. 00:20:05.661 --> 00:20:12.270 they started with the deep neural networks and well in 2019 I believe it was the first time 00:20:17.801 --> 00:20:26.497 that for correct recorded American English conversations it was on the level of the human 00:20:26.600 --> 00:20:32.415 accuracy outperformed it a little bit but let's say it was more or less on the human accuracy and 00:20:32.802 --> 00:20:33.907 And that's quite recent. 00:20:35.120 --> 00:20:42.874 And it is, well, something I made myself. 00:20:44.060 --> 00:20:47.673 But I believe that with the coming of the transformer 00:20:47.740 --> 00:20:51.693 models, it will increase a little bit. 00:20:52.000 --> 00:20:54.188 I mean, yeah, more than 100% is not possible. 00:20:55.400 --> 00:20:58.111 But we will see that in the coming years now 00:20:58.240 --> 00:21:01.352 and in the coming years, it will have reached 00:21:01.400 --> 00:21:08.695 this human accuracy for other languages than American English and it will also increase. 00:21:08.940 --> 00:21:19.157 So in the end it will outperform as humans in correct recognition and said that you have 00:21:19.180 --> 00:21:26.115 to remember that it is for correct good recorded audio. 00:21:26.740 --> 00:21:30.854 If you have a conversation where people talk to each other or there is a train passing 00:21:30.900 --> 00:21:37.075 by or other background noises, it will be different. However, it is performing very 00:21:37.140 --> 00:21:43.635 good. But if it is outperforming humans, it's not sure at the moment. But for nice, cool 00:21:43.880 --> 00:21:49.112 recorded audio, it will absolutely beat us in the coming years. 00:21:51.020 --> 00:21:57.836 OK, however, speech recognition is not perfect. But even if it is perfect, I mean, what is 00:21:57.900 --> 00:22:11.476 going wrong? Well, it is still the case that we do use recorded words, our vocabulary, 00:22:12.600 --> 00:22:21.073 and in the eticali, so the current old version of the ASR, we can recognize 260,000 different 00:22:24.141 --> 00:22:30.615 words, but there are always words that are not in that list, so they cannot be recognized. 00:22:31.180 --> 00:22:38.056 An example was the marsupilami. With the current version of Whisper, I do believe that marsupilami 00:22:38.160 --> 00:22:46.656 will be recognized, but I have to test it. And then there is in the use of the language 00:22:46.741 --> 00:22:51.414 model, you can say something that is between Zwelle and Zwolle and you mean Zwelle. I'm 00:22:51.500 --> 00:22:59.876 going by train from Utrecht to Zwelle and we saw that Zwelle is not that popular and Zwolle is 00:23:00.080 --> 00:23:08.296 very popular so the language model is replacing the recognized Zwelle by Zwolle because it's more 00:23:08.540 --> 00:23:13.214 logical that people will say Zwolle and more or less the train station in Zwelle doesn't exist 00:23:13.320 --> 00:23:20.696 anymore. So it makes sense but it is not close to the recognition and it will be a discussion 00:23:20.760 --> 00:23:30.077 if you want that or not. But okay. And what we do see is that people do say other words. 00:23:31.362 --> 00:23:40.977 They want to say, well, give an example as well. And if you listen to it, you absolutely hear 00:23:42.563 --> 00:23:48.816 and it is just a mistake from people. Yeah. Okay. That can be the case. And with our human brains, 00:23:48.900 --> 00:23:54.194 we can say no, but he means well because Zola has nothing to do in this conversation. 00:23:54.660 --> 00:24:00.015 But that is something else that is the interpretation of the recognition and not the recognition itself. 00:24:02.163 --> 00:24:11.037 Okay and the speech AI is used for well this is an example of some five years ago 00:24:12.202 --> 00:24:20.296 with the old recognition engines and it is an overview of the, well, the, oh, yeah, it 00:24:20.420 --> 00:24:28.296 is an overview of, well, what we believe is for researchers in a normal way. You do have 00:24:28.340 --> 00:24:34.253 an interview. It can be an analog interview or in nowadays you will have an digital interview. 00:24:35.480 --> 00:24:38.790 The analog need to be digitized and you need a storage. 00:24:40.100 --> 00:24:44.493 OK, we have stored the data and then we need the transcription 00:24:45.280 --> 00:24:48.771 and you can do it by yourself, listen to it and type it out. 00:24:50.141 --> 00:24:53.833 Or you can use at least initially automatic speech 00:24:53.920 --> 00:24:56.230 recognition and then you get the timed text 00:24:56.620 --> 00:24:57.887 and the only thing you have to do 00:24:58.140 --> 00:25:01.172 is the checking of if it was recognized well 00:25:01.340 --> 00:25:03.791 and then make some errors and replace them 00:25:03.920 --> 00:25:14.697 things like that and this of course was in my field always a popular item because well 00:25:14.720 --> 00:25:23.116 we believe really that speech recognition will help you however with an error rate of one out of 00:25:24.483 --> 00:25:30.835 10 words there's an error it sometimes was too much work to check and improve the 00:25:31.280 --> 00:25:36.495 recognition results. Instead of doing it, just transcribe it yourself. 00:25:37.642 --> 00:25:44.616 However, with the current version of Whisper, I absolutely believe that it is very useful to first 00:25:46.363 --> 00:25:53.677 run the speech organizer and then check the results because it absolutely improved your 00:25:53.740 --> 00:26:04.637 velocity and it speeds up so this is a little bit an old stuff and yeah okay and then you have of 00:26:04.700 --> 00:26:09.074 course to add your metadata and then you have searchable transcribed audio visual document 00:26:09.140 --> 00:26:14.294 and you can say well give me the interviews where they talk about hunger and then you get 00:26:14.841 --> 00:26:17.028 all the interviews where hunger is pronounced. 00:26:19.140 --> 00:26:24.132 OK, this is more or less 10 years ago. 00:26:26.021 --> 00:26:27.807 I'm a little bit less gray. 00:26:29.520 --> 00:26:32.710 And it was a question of the Dutch court. 00:26:34.141 --> 00:26:35.488 And they say, well, it is at the moment 00:26:35.620 --> 00:26:40.273 it's not allowed to do recordings in the courtroom. 00:26:40.640 --> 00:26:44.553 However, we are interested to see if it is working fine 00:26:44.921 --> 00:26:46.830 and to see what are the possibilities. 00:26:46.920 --> 00:26:49.411 So they asked us and we developed it. 00:26:49.721 --> 00:26:51.710 And it was quite a success. 00:26:51.780 --> 00:26:52.505 There's a video. 00:26:52.660 --> 00:26:55.552 If you look for Rechtspraakherkenning on YouTube, 00:26:55.620 --> 00:26:56.566 you will get this video. 00:26:56.720 --> 00:26:59.972 And it will explain a little bit, well, the performance 00:27:00.120 --> 00:27:01.808 and how it was working. 00:27:02.740 --> 00:27:05.150 A couple of years later, so five years ago, 00:27:05.460 --> 00:27:07.430 the field came with the same question 00:27:07.560 --> 00:27:12.014 and say, well, our research, our researchers, 00:27:12.140 --> 00:27:15.132 our people are recording more and more interviews, 00:27:15.260 --> 00:27:17.590 but once they record it, they need to give it 00:27:18.200 --> 00:27:20.068 to the courtroom, visit transcription. 00:27:21.120 --> 00:27:23.070 So can speech recognition help us 00:27:23.260 --> 00:27:24.587 in speeding up this process? 00:27:25.681 --> 00:27:26.285 And yeah. 00:27:26.980 --> 00:27:28.006 Oh, no. 00:27:31.067 --> 00:27:33.374 And hey. 00:27:37.184 --> 00:27:48.738 OK, OK, so we did the recordings, we simulated and then we start a test where we compared 00:27:48.940 --> 00:27:56.356 the classic way. That's the blue line with the new one and that's the red one. And what 00:27:56.420 --> 00:28:02.375 you see here is that I mean, if you have an interview with someone and you need to make 00:28:02.460 --> 00:28:10.415 some notes. It takes time and sometimes, well, you need some pauses and here, here, here 00:28:11.260 --> 00:28:20.477 and here and here. You can see that the time is increased, but there is not, they are not 00:28:20.560 --> 00:28:26.753 speaking. So that means they need some time to make their notes. And then they say, OK, 00:28:28.000 --> 00:28:35.836 on. However this is decreasing the quality of the interview because once someone starts talking 00:28:35.860 --> 00:28:41.414 you want them to continue talking and you don't want to interfere with well stop a moment and I 00:28:41.880 --> 00:28:51.155 need some notes. So in the end you see that at the same time there are a lot of more words so 30 to 00:28:53.102 --> 00:28:59.496 50% more words were spoken so we have more material in the same time and this was quite 00:28:59.580 --> 00:29:07.816 convincing the field in using the speech recognition. We did a test for the NEOD, 00:29:08.060 --> 00:29:17.096 the witness story, getuigenverhalen.nl. They have 600,000 hours of oral history about World War II 00:29:18.222 --> 00:29:23.033 and well you want I mean no one is going to listen to 600,000 hours of interviews 00:29:24.220 --> 00:29:27.893 but you want to interviews that talking about well some particular topics you 00:29:27.960 --> 00:29:35.516 are interested in so you can search in the spoken content and we did the 00:29:35.540 --> 00:29:42.636 project for the foreign ministry ministry foreign affairs and it was in 00:29:42.800 --> 00:29:49.334 Croatia and Bosnia, so Croatian memories and Bosnian memories, and we recorded 700 interviews. 00:29:50.200 --> 00:29:54.471 However, at that time, the speech recognition was not working for those languages. 00:29:55.780 --> 00:30:01.493 And so this was hand transcribed, but the translation in English was done automatically. 00:30:03.301 --> 00:30:09.916 Clarin and the national or the European infrastructure for language and speech technology 00:30:12.264 --> 00:30:18.796 supported this and they said well can you start an OH portal and it's now a transcription portal 00:30:19.280 --> 00:30:28.356 at the University of München and you do have an account you can go to there and upload your files 00:30:28.580 --> 00:30:31.190 select the language and download the results. 00:30:32.040 --> 00:30:35.693 This will probably in the coming months replaced by Whisper, 00:30:35.880 --> 00:30:39.071 but at the moment it is the old version. 00:30:39.940 --> 00:30:41.528 It is the Google version. 00:30:41.920 --> 00:30:43.709 It is here the Dutch version. 00:30:44.180 --> 00:30:47.692 Well, that are the languages supported at the moment. 00:30:49.704 --> 00:30:55.017 And okay, here are some shots from projects we did. 00:30:54.982 --> 00:30:56.951 This was the forced alignment. 00:30:56.980 --> 00:31:02.614 So we got the text from the second room, so the Tweede Kamer. 00:31:03.380 --> 00:31:07.193 They are forced to give a correct transcription. 00:31:07.260 --> 00:31:07.985 They gave it to us. 00:31:08.120 --> 00:31:09.749 And what we did was the forced alignment. 00:31:10.722 --> 00:31:14.191 So the subtitles are automatically generated. 00:31:16.221 --> 00:31:20.154 And then the Flemish government was quite enthusiastic 00:31:20.220 --> 00:31:21.488 about our Dutch effort. 00:31:21.700 --> 00:31:23.608 And they asked us to do the same for them. 00:31:24.460 --> 00:31:35.097 We did it and this was the result was a collaboration with the Flemish universities and Dutch universities and what they also built was a speaker recognition engine. 00:31:35.200 --> 00:31:38.430 So we know exactly who is speaking at this moment. 00:31:40.281 --> 00:31:42.728 And here are the partners. 00:31:44.660 --> 00:31:45.946 Here are the results. 00:31:47.540 --> 00:31:53.695 However, it is a verbatim transcription, so it is 100% more or less 100% correct transcription. 00:31:54.140 --> 00:31:57.352 And it turned out that it's too much for them. 00:31:57.440 --> 00:31:58.707 So they want more summary. 00:31:59.220 --> 00:32:00.906 And that's not possible at the moment. 00:32:03.460 --> 00:32:04.766 And here you see, again, the 150,000 hours. 00:32:06.000 --> 00:32:08.110 So that will be more at the moment, I suppose. 00:32:08.200 --> 00:32:10.206 But a couple of years ago, it was 150,000 hours of surf. 00:32:12.921 --> 00:32:15.612 And surf is also experimenting together with us 00:32:15.740 --> 00:32:21.374 to see if they can recognize all their material. 00:32:21.620 --> 00:32:23.525 It is more or less 50% English, 50% Dutch. 00:32:27.160 --> 00:32:31.433 OK, then a project from the Utrecht University. 00:32:32.080 --> 00:32:37.595 Can you use it with the patients for the care to report? 00:32:37.700 --> 00:32:41.813 So you're going to your GP, and you have some questions. 00:32:42.000 --> 00:32:43.368 And normally, they are typing a lot. 00:32:43.700 --> 00:32:46.311 Can you replace the typing by speech recognition? 00:32:46.440 --> 00:32:49.230 And well, you have more or less an overview. 00:32:50.320 --> 00:32:52.128 And it turned out that it works well. 00:32:52.941 --> 00:32:57.413 And they are starting now some real life showcases 00:32:57.900 --> 00:32:59.830 where they will show that the recognition will 00:32:59.920 --> 00:33:02.689 help the GP in their daily work. 00:33:03.921 --> 00:33:05.508 And of course, at University of Twente, 00:33:05.880 --> 00:33:07.930 we are focusing on robots interaction 00:33:08.020 --> 00:33:10.650 between children and elderly people with robots. 00:33:11.501 --> 00:33:13.530 And we are using the speech recognition 00:33:13.620 --> 00:33:14.926 to understand what they are saying. 00:33:16.081 --> 00:33:19.152 And at the same time, we're also developing some software 00:33:19.180 --> 00:33:25.215 to see how they are saying it. So the emotion in the speech is as important as what they 00:33:25.220 --> 00:33:26.584 are saying. So how and what. 00:33:30.360 --> 00:33:37.615 So then we have people with difficulties. I mean, the brain doesn't always cooperate. 00:33:38.620 --> 00:33:49.077 Here I will show you an example of someone who has Parkinson. I'm not sure if you can 00:33:49.161 --> 00:33:49.443 hear it. 00:34:06.466 --> 00:34:09.193 So that's I mean you can understand it, 00:34:09.700 --> 00:34:11.989 but you have to listen carefully and. 00:34:13.061 --> 00:34:15.832 Even the modern speech engines not always 00:34:16.161 --> 00:34:18.672 recognize this kind of people as well 00:34:18.842 --> 00:34:20.329 as well as we wanted. 00:34:20.680 --> 00:34:24.694 Yeah, we couldn't hear the no, no, no, no, no, no. 00:34:25.002 --> 00:34:27.351 Well, you will get the presentation and you can listen to it yourself. 00:34:28.402 --> 00:34:32.735 Yeah, I'm I forgot to organize the audio as well. 00:34:33.442 --> 00:34:38.395 And this was a project we did two years ago. 00:34:38.902 --> 00:34:39.967 So in the COVID period. 00:34:41.302 --> 00:34:46.455 And well, I can show you or see if you can hear it. 00:35:02.391 --> 00:35:06.614 Well, you can see the difference between him and also his helper. 00:35:13.060 --> 00:35:15.330 I mean, if he speaks, the recognition is easy, 00:35:15.961 --> 00:35:17.709 but he's impossible to recognize. 00:35:17.880 --> 00:35:20.450 And what we did, we developed a special 00:35:21.420 --> 00:35:22.947 speech recognition engine for him 00:35:23.740 --> 00:35:28.494 to help him using the engine for work, 00:35:30.005 --> 00:35:31.772 for school, for traveling, 00:35:31.980 --> 00:35:33.709 and to tell something about himself. 00:35:34.381 --> 00:35:37.913 It worked, but I mean, it can be much, much better. 00:35:41.626 --> 00:35:48.157 And this will be, well, I will skip this one, but this is in English Parliament and you 00:35:48.260 --> 00:35:53.114 hear so much noise in the background that it is okay, you can hear it, but that is very 00:35:53.180 --> 00:35:55.428 difficult even for an modern engine. 00:35:58.223 --> 00:36:03.233 Okay, as said before, emotion is important. 00:36:04.600 --> 00:36:08.112 We know we are well at the level more or less that with good 00:36:08.700 --> 00:36:11.550 recorded audio we are at the level of the human recognition. 00:36:12.340 --> 00:36:17.157 However, we need some emotion inside and the question is, can 00:36:17.204 --> 00:36:20.617 we detect the emotion in the conversation? Because if you 00:36:20.142 --> 00:36:24.576 ask me, do you like football? I can say yes, and then that isn't 00:36:24.642 --> 00:36:26.691 convincing. Yes, or I can say yeah. 00:36:27.583 --> 00:36:29.972 And it is a yes, but I mean no. 00:36:31.102 --> 00:36:36.915 So can you use that kind of emotion detection inside modern conversations? 00:36:37.882 --> 00:36:43.215 And then it's the question, which emotion? I mean, we have the big five, sadness, anger, fear, joy and neutral. 00:36:43.722 --> 00:36:46.051 I mean, they are more or less human. 00:36:46.761 --> 00:36:52.315 And that means that all the people in the world are using, are having these five emotions. 00:36:52.761 --> 00:37:00.916 However, sarcasm or other more subtle emotions are cultural dependent. 00:37:01.701 --> 00:37:09.072 And that means that it depends on the speaker and also on the listener how the emotion is preserved. 00:37:12.360 --> 00:37:19.874 So that will be a difficult question, but well, it is worthwhile working on it. 00:37:21.981 --> 00:37:28.955 Okay, so that means that brings us more or less at the end of this talk to the next step. 00:37:29.820 --> 00:37:32.892 And that is what do you mean from recognition to understanding. 00:37:33.080 --> 00:37:35.708 So can you understand what is said? 00:37:37.701 --> 00:37:46.336 And we in collaboration with Nijmegen University, Twente University and two companies, we had 00:37:46.400 --> 00:37:53.136 a question from the Dutch police force and they had the database with 45 hours of verbatim 00:37:53.220 --> 00:37:57.114 transcription. Well, they had the audio we needed to give them the transcription, 00:37:57.741 --> 00:38:04.495 the part of speech tagging and also the emotional edition. And one of the companies was Pandora 00:38:05.000 --> 00:38:13.836 in Amersfoort and they made a nice film that is in English and I tried to figure out what was said 00:38:15.424 --> 00:38:19.496 And so I'm using the modern engine to do the recognition. 00:38:20.526 --> 00:38:22.072 Let's see if it is working. 00:38:39.474 --> 00:38:45.098 Yeah, that's the wrong version, so you will have it in the, if I send it to you, you will 00:38:45.120 --> 00:38:49.854 get to the English and also the eventually Dutch transcription below it. 00:38:50.883 --> 00:38:55.375 And, but the recognition is more or less a hundred percent. 00:38:55.480 --> 00:38:59.493 And that's really, I mean, of course it's a good recording, but 00:38:59.721 --> 00:39:02.792 still so good is astonishing. 00:39:04.228 --> 00:39:04.469 Okay. 00:39:04.460 --> 00:39:08.974 If you look to the future of artificial intelligence lead speech technology. 00:39:10.002 --> 00:39:18.137 And we see the next step or the step in the coming years will be from the recognition what was said to the understanding what was meant. 00:39:19.563 --> 00:39:25.557 So that will be an absolutely important part of our research. 00:39:25.421 --> 00:39:32.276 And what is someone's emotional state? And so how can you deal with that person? 00:39:32.060 --> 00:39:38.675 person. And of course we have to figure out how to use the speech recognition for the 00:39:38.780 --> 00:39:45.656 smaller languages and even Dutch is a small language. I mean, given the Chinese and the 00:39:45.740 --> 00:39:53.056 Indian and the American English population, we are doing well. But I mean, Friesian or 00:39:53.140 --> 00:39:59.292 some heavy dialects or languages that are spoken Icelandic only by 300,000 people. Can 00:40:01.360 --> 00:40:11.117 do the recognition for that language as well. And we need to speed up the velocity so you can use 00:40:11.160 --> 00:40:19.776 it in all kinds of real time situations. And what the kind of dream that you are talking 00:40:19.800 --> 00:40:23.774 with a Chinese engineer and he's speaking in Chinese and you're speaking in Dutch, 00:40:24.622 --> 00:40:29.835 his Chinese will be translated automatically to subtitles in Dutch and my Dutch will be 00:40:29.760 --> 00:40:33.553 be automatically translated in Chinese. 00:40:33.640 --> 00:40:36.170 So, I mean, it is more or less possible, 00:40:37.424 --> 00:40:39.010 but we need to speed it up a little bit. 00:40:39.600 --> 00:40:42.470 So to have that conversations in the coming years. 00:40:43.600 --> 00:40:47.833 And okay, this was more or less the end of, 00:40:48.020 --> 00:40:51.713 well, a smaller version of my presentation 00:40:51.900 --> 00:40:53.909 about AI and speech technology. 00:40:54.600 --> 00:40:56.348 And now we're going back to Whisper. 00:40:57.160 --> 00:41:00.212 And Whisper, as said before, is an open source, 00:41:00.280 --> 00:41:02.608 so you can download the models, at least at the moment. 00:41:03.920 --> 00:41:06.150 It works more or less, at least for American English, 00:41:06.260 --> 00:41:08.991 on the human level, so it outperforms it a little bit 00:41:09.100 --> 00:41:10.225 or not, well, it depends. 00:41:11.501 --> 00:41:14.752 And you can ask yourself, why using Whisper? 00:41:15.040 --> 00:41:18.472 Well, here are some statements, but it is, 00:41:20.143 --> 00:41:24.934 I mean, the basic answer is it is working absolutely 00:41:25.400 --> 00:41:34.616 gorgeous. Yeah, it's September, so it's five, six months old. And 00:41:36.123 --> 00:41:41.075 here are some slides where it's more or less explained. However, 00:41:41.140 --> 00:41:47.695 you can look it up yourself at the GitHub repository. And they 00:41:47.800 --> 00:41:53.775 say that it more or less outperforms the most speed 00:41:53.880 --> 00:41:55.568 recognition engines available. 00:41:56.400 --> 00:41:58.950 However, if you have some dedicated, 00:41:59.620 --> 00:42:03.452 very specific conversations, 00:42:04.141 --> 00:42:07.211 then some other engines are doing better. 00:42:07.660 --> 00:42:10.972 But overall, Whisper is the winner at this moment. 00:42:12.644 --> 00:42:16.634 Okay, installing Whisper, is it difficult? 00:42:16.861 --> 00:42:16.961 No. 00:42:17.660 --> 00:42:21.112 Well, you need to install first Python on your computer 00:42:21.621 --> 00:42:23.626 And it is version 3.8 till 3.10. 00:42:26.801 --> 00:42:28.547 I started first with 3.11. 00:42:30.281 --> 00:42:31.588 And then there were some problems. 00:42:31.800 --> 00:42:33.366 So I went back to 3.99. 00:42:35.261 --> 00:42:36.286 And now it works. 00:42:37.540 --> 00:42:39.149 You need to install PyChart. 00:42:39.420 --> 00:42:42.632 And there's a lot of information on the internet 00:42:42.740 --> 00:42:44.508 how to do it, or you may have it already. 00:42:45.040 --> 00:42:47.269 You have to install FFmpeg Python. 00:42:48.280 --> 00:42:52.032 And once you have installed these three basic packages, 00:42:53.463 --> 00:42:57.034 you need to install Git from GitHub. 00:42:57.140 --> 00:42:59.930 So the Whisper Git, I mean, and that's all. 00:43:01.222 --> 00:43:04.172 It takes a couple of minutes and then it's installed 00:43:04.220 --> 00:43:06.350 and you have your engine at your computer. 00:43:09.227 --> 00:43:10.893 Using Whisper, there's a good, well, 00:43:11.020 --> 00:43:15.714 helping facility from the parameters 00:43:15.780 --> 00:43:17.086 that you can add to it. 00:43:18.140 --> 00:43:19.987 However, the most important is the model. 00:43:21.900 --> 00:43:27.133 And it is a fast computer and a GPU. 00:43:27.840 --> 00:43:30.589 It is 32 times faster than real time. 00:43:32.160 --> 00:43:35.329 So one minute will be in two seconds done. 00:43:37.060 --> 00:43:38.006 This is the base version. 00:43:38.160 --> 00:43:38.825 It is slower. 00:43:39.421 --> 00:43:43.153 And this large model, well, it is a relatively speed of one. 00:43:43.561 --> 00:43:44.908 But I doubt it. 00:43:45.220 --> 00:43:46.648 It is slower, I believe. 00:43:47.040 --> 00:43:49.551 But I mean, these are the models that you can load. 00:43:49.680 --> 00:43:51.526 And of course, the large model is a 1.5 gigabyte. 00:43:54.342 --> 00:43:54.946 And that's a lot. 00:43:55.241 --> 00:43:57.931 And the tiny model is only 40 megabytes. 00:43:58.220 --> 00:44:01.331 So normally, I'm using the medium version. 00:44:02.160 --> 00:44:05.973 And because I do have a lot of recordings in English, 00:44:06.080 --> 00:44:08.530 but also in Dutch, in Italian, and other languages, 00:44:09.040 --> 00:44:12.711 I'm not using medium.en, but medium as such. 00:44:13.840 --> 00:44:22.876 five gigabyte and well, I mean, you need that on your computer and it is twice the velocity 00:44:23.041 --> 00:44:28.215 of the real time. So one hour takes you half an hour, a little bit more, but also depending 00:44:28.300 --> 00:44:36.656 on your computer. But I mean, you just give the model at your recognition and you say, 00:44:36.800 --> 00:44:40.872 well, I want to do it in this time is medium only English because they are native English 00:44:41.400 --> 00:44:44.590 want to do it with English only, you type in medium.en. 00:44:45.941 --> 00:44:48.591 And if it's not available, it will be downloaded. 00:44:48.740 --> 00:44:50.088 It takes some minutes, and then you 00:44:50.220 --> 00:44:53.670 have the medium.en model at your computer as well. 00:44:55.341 --> 00:44:56.546 And you can do what you want. 00:44:58.460 --> 00:44:59.647 Then there are some tools. 00:45:00.400 --> 00:45:07.775 I'm particular fond of WhisperX, because the results of Whisper 00:45:08.341 --> 00:45:10.831 are they taking frames of 30 seconds. 00:45:10.960 --> 00:45:12.949 and giving the recognition results. 00:45:13.641 --> 00:45:15.610 The 30 seconds sometimes for subtitles 00:45:15.800 --> 00:45:19.031 or for all kinds of other research is too low. 00:45:19.980 --> 00:45:22.531 And this WhisperX, you get a word level 00:45:22.860 --> 00:45:26.532 more or less accurate recognition. 00:45:26.940 --> 00:45:28.228 So they do the recognition 00:45:28.340 --> 00:45:30.770 and then do a kind of forced alignment based on it. 00:45:31.481 --> 00:45:34.993 And you get a subtitle with one word timeframe. 00:45:36.304 --> 00:45:37.890 And it's very useful. 00:45:38.640 --> 00:45:42.573 For the Macintosh, of course, you can use it with the Python 00:45:43.201 --> 00:45:45.309 on your computer, but there's also Mac Whisper. 00:45:46.520 --> 00:45:48.750 I believe it's Dutch guy who developed it, 00:45:48.900 --> 00:45:52.451 but it is based on the C version of the Whisper engine. 00:45:53.581 --> 00:45:59.313 And the free version can use the small and the tiny models. 00:46:00.520 --> 00:46:02.349 But if you want medium or large, you 00:46:02.460 --> 00:46:04.047 have to pay one time 15 euro. 00:46:04.980 --> 00:46:06.387 So it's not that much. 00:46:07.401 --> 00:46:09.310 And well, here's the interview. 00:46:09.440 --> 00:46:10.547 You have just a screen. 00:46:10.880 --> 00:46:13.892 You download the video or the audio 00:46:13.980 --> 00:46:15.066 you want to be recognized. 00:46:15.581 --> 00:46:18.711 And it starts to record the recognized thing. 00:46:19.481 --> 00:46:20.547 And here you see the results. 00:46:20.620 --> 00:46:24.913 And there are some small things like change some words that, 00:46:25.220 --> 00:46:27.731 for example, Janssen, that can be written 00:46:27.820 --> 00:46:28.945 on very different ways. 00:46:30.301 --> 00:46:31.568 And you say, no, this is Janssen. 00:46:31.680 --> 00:46:32.285 It's two S's. 00:46:32.400 --> 00:46:34.751 And then, well, you can do a search 00:46:34.780 --> 00:46:36.449 and replace on the results. 00:46:36.660 --> 00:46:41.213 but you will see if you use it, it's easy to see. 00:46:42.304 --> 00:46:42.525 Okay. 00:46:43.120 --> 00:46:45.230 I mean, that was the conclusion. 00:46:45.340 --> 00:46:48.312 So Whisper, Python, PyTorch and FFmpeg, 00:46:48.400 --> 00:46:51.111 the basic engines you need to install. 00:46:51.280 --> 00:46:53.811 And once you have done that, pip install, 00:46:54.982 --> 00:46:58.493 git hcps, well, you can see it there. 00:46:59.020 --> 00:47:03.173 And you have the best engine at your computer. 00:47:03.680 --> 00:47:07.171 And again, at the computer I'm giving this presentation, 00:47:07.900 --> 00:47:10.109 I do have a graphical card, NVIDIA. 00:47:10.820 --> 00:47:12.770 So that means I do have a GPU, 00:47:13.080 --> 00:47:17.413 so I can use the GPU version of the Whisper. 00:47:18.703 --> 00:47:22.574 If you don't have it, it will take much more time 00:47:22.660 --> 00:47:26.953 because the relatively speed is based on a fast GPU. 00:47:27.480 --> 00:47:29.791 If you don't have a GPU, yeah, but it takes you more time, 00:47:29.860 --> 00:47:31.027 but you can do it in the evening 00:47:31.100 --> 00:47:32.487 and the next morning it will be ready. 00:47:33.300 --> 00:47:35.717 So, yeah. 00:47:35.727 --> 00:47:36.530 Okay, conclusion. 00:47:38.184 --> 00:47:41.654 Automatic speech recognition is working well at this moment. 00:47:41.740 --> 00:47:45.292 I mean, it's absolutely astonishing how good it is. 00:47:47.386 --> 00:47:49.092 We do need, and then, I mean, 00:47:49.380 --> 00:47:52.011 the community of speech technologies, 00:47:52.100 --> 00:47:56.954 we do need to give a little bit more attention 00:47:57.060 --> 00:47:59.130 to people who are less well-represented, 00:47:59.180 --> 00:48:02.652 people with a heavy dialect or people coming from outside 00:48:02.780 --> 00:48:07.072 and trying to speak Dutch, people with diseases. 00:48:08.162 --> 00:48:10.350 So there is of course something to do, 00:48:11.040 --> 00:48:12.929 but I do believe that in the coming years, 00:48:13.140 --> 00:48:17.533 these will be, well, we probably will solve this. 00:48:19.546 --> 00:48:21.052 And that's the next step, 00:48:21.220 --> 00:48:22.768 but that is more the research topic. 00:48:22.940 --> 00:48:24.910 I mean, we need to go from the recording 00:48:25.060 --> 00:48:25.846 to the understanding. 00:48:25.920 --> 00:48:30.113 We know, we do need to know what the intention 00:48:30.340 --> 00:48:35.672 someone was when we did the recognition and not only a verbatim transcription. 00:48:37.540 --> 00:48:43.591 This is the end and I have the question, what questions do you have for me? 00:48:47.000 --> 00:48:53.752 Before you start this, there will be, if you like it, a follow up. 00:48:56.641 --> 00:49:04.956 Utrecht University will organize that. Rijnike will send you mails. But I can imagine that 00:49:05.060 --> 00:49:13.596 initially the first steps are always more difficult than you want them. So if there are five or more 00:49:13.680 --> 00:49:18.615 people who want to continue trying to do the recognition with their results and on their own 00:49:18.660 --> 00:49:27.476 computer. We will organize some classes, I mean meetings where I will be there eventually another 00:49:28.704 --> 00:49:32.655 technical involved people trying to help you to install the Python and the stuff. 00:49:33.381 --> 00:49:40.976 And then also help you in the recognition and to show you which model to use when things like that. 00:49:41.520 --> 00:49:47.715 But it depends heavily on you. I mean if you are technical quite well involved you can do it yourself. 00:49:48.261 --> 00:49:51.533 If you want us to help you, let us know and we will organize that. 00:49:52.564 --> 00:49:54.933 And now it's open for questions. 00:50:02.327 --> 00:50:04.714 And if there are no questions, that's fine for me as well. 00:50:04.780 --> 00:50:07.271 But yeah, Daphne. 00:50:08.225 --> 00:50:08.566 Hello. 00:50:12.426 --> 00:50:15.014 Wait, let me turn my camera on so you can see me. 00:50:15.802 --> 00:50:20.394 So, I'm not a researcher, but I'm actually a privacy officer at the university. 00:50:21.403 --> 00:50:25.234 So, my interest in this is, well, privacy related. 00:50:25.981 --> 00:50:34.036 And I was wondering how you feel or how you want to deal with the privacy issues that come up with this. 00:50:34.120 --> 00:50:41.736 Because obviously, if you transcribe interviews, then it's more easily accessible to lots of people. 00:50:41.880 --> 00:50:45.733 Absolutely so, but first we have to go one step back. 00:50:46.100 --> 00:50:50.914 I mean, the current version, well, more or less outdated, 00:50:51.000 --> 00:50:54.352 but still working version is that we developed a server 00:50:54.460 --> 00:50:55.626 at the Nijmegen University. 00:50:56.640 --> 00:50:59.151 You go there, you make your login and your password 00:50:59.560 --> 00:51:02.452 and you need to show that you are not in commercial company 00:51:02.540 --> 00:51:05.029 but working at university of that kind of organizations. 00:51:06.000 --> 00:51:08.729 And you upload a file, you select, 00:51:10.121 --> 00:51:12.270 you push the button, start. 00:51:12.860 --> 00:51:15.571 And then after a couple of minutes, 00:51:15.760 --> 00:51:17.890 or depending on the length of your recording, 00:51:17.940 --> 00:51:18.704 you get the results. 00:51:19.740 --> 00:51:24.915 However, if it is quite a sensitive recording, 00:51:24.820 --> 00:51:28.933 it means that it's going to another company 00:51:29.120 --> 00:51:30.508 and that's not doing any harm. 00:51:30.660 --> 00:51:34.773 But still, given the GDPR, it is a little bit tricky 00:51:35.200 --> 00:51:36.848 If that is allowed, yes or no. 00:51:37.740 --> 00:51:40.011 If you have very sensitive material, 00:51:40.120 --> 00:51:43.853 you can make a contract and then it is not done 00:51:43.920 --> 00:51:46.370 via the internet, but you sent a USB stick to them 00:51:46.900 --> 00:51:49.089 and they will handle it with all the privacy 00:51:49.900 --> 00:51:51.648 related items involved. 00:51:52.440 --> 00:51:55.211 However, with the current version of Whisper, 00:51:56.444 --> 00:51:58.512 I mean, you can do it on your own computer. 00:51:58.680 --> 00:52:00.449 So that means that you can listen to it 00:52:00.560 --> 00:52:03.752 and make an handwritten transcription 00:52:04.200 --> 00:52:06.450 or you can give it to Whisper on your own computer. 00:52:06.840 --> 00:52:09.209 So there's no, I mean, you can do it without any internet 00:52:09.940 --> 00:52:11.167 and you get the results. 00:52:11.640 --> 00:52:13.650 And okay, so there is no difference 00:52:13.760 --> 00:52:15.629 between an handmade transcription 00:52:16.080 --> 00:52:19.852 or a transcription done on your own computer with Whisper. 00:52:20.760 --> 00:52:23.150 What you do after that, and then of course, 00:52:23.500 --> 00:52:24.988 it's a sensible point that you mentioned. 00:52:25.420 --> 00:52:26.848 I mean, once you have the transcription 00:52:26.980 --> 00:52:28.569 and you place that somewhere on a website 00:52:28.660 --> 00:52:29.746 or things like that, yeah. 00:52:30.320 --> 00:52:33.933 but there's no difference between the use of whisper 00:52:34.160 --> 00:52:36.370 on your own computer or doing it, 00:52:36.721 --> 00:52:38.609 the transcription by hand yourself. 00:52:39.280 --> 00:52:43.232 So yeah, the privacy is still very important, 00:52:43.880 --> 00:52:48.233 but whisper is not going to change that, so to say. 00:52:49.564 --> 00:52:51.570 Is that more or less what you wanted to hear? 00:52:52.620 --> 00:52:53.867 Yeah, very much. Yeah, thank you. 00:52:54.420 --> 00:52:54.621 Okay. 00:52:58.103 --> 00:53:06.656 Hi, I was wondering, you mentioned the difference between specific languages, like Dutch, and 00:53:07.680 --> 00:53:09.408 that more attention needs to go towards that. 00:53:09.940 --> 00:53:14.674 But what would you say are the biggest differences in the models currently used, for example, 00:53:14.840 --> 00:53:16.506 in sound and vision, which has ASR? 00:53:18.300 --> 00:53:23.955 Do you see patterns in mistakes that it makes, or in what kind of ways does that differ from 00:53:24.121 --> 00:53:25.027 whisper, for example? 00:53:25.361 --> 00:53:31.135 Ah, well, Whisper is, I mean, the error rate on sound and vision that is using the Kaldi 00:53:31.220 --> 00:53:37.295 recognizer is, I mean, is more depending on the quality of the speech and things like 00:53:38.003 --> 00:53:38.144 that. 00:53:38.480 --> 00:53:41.187 But for Dutch, it is between 15 and 12 percentage. 00:53:44.440 --> 00:53:50.573 So it means that you have 85% of the words are recognized well, and for the other 15%, 00:53:51.540 --> 00:53:52.726 you absolutely need to check them. 00:53:54.782 --> 00:54:01.815 is at the human level so that is between three and five percentage. So it is only one-third 00:54:02.220 --> 00:54:09.856 of the errors made at with the Calde recognizer are made by this part. Moreover, it's if you 00:54:09.900 --> 00:54:15.113 say well I'm Arjan and I'm living in Utrecht, Arjan and also Utrecht is written as a capital 00:54:16.020 --> 00:54:21.455 and with the Calde recognizer all the words are lower capital and there are no question marks 00:54:21.480 --> 00:54:27.074 There are no commas, there are no dots, and all those reading stuff is not available. 00:54:27.320 --> 00:54:28.544 With Whisper, it is available. 00:54:31.140 --> 00:54:33.828 And I will show you. 00:54:36.060 --> 00:54:39.966 But that's, yeah, I'm still in. 00:54:47.780 --> 00:54:50.530 It's my web. Can you see what I'm doing? 00:54:51.761 --> 00:54:52.405 Yeah, OK. 00:55:00.707 --> 00:55:02.954 And here is an example. It is in Dutch, but well. 00:55:08.608 --> 00:55:09.792 And here you do see it is, 00:55:10.000 --> 00:55:13.671 either end starts with a capital I and then there's a comma 00:55:14.781 --> 00:55:17.211 and another comma and it's a dot on the end. 00:55:17.400 --> 00:55:22.915 The Taal Uni is written with capitals T and U and et cetera, et cetera. 00:55:23.020 --> 00:55:25.852 So this is a very nice conversation. 00:55:25.940 --> 00:55:28.151 Well, and even Caldi was doing it quite well. 00:55:28.802 --> 00:55:32.954 But the recognition in Caldi is only with the lower capitals. 00:55:33.521 --> 00:55:38.274 And this one is, it knows how to write the Taal Uni, it knows how to write 00:55:38.360 --> 00:55:39.989 the Dutch and things like that. 00:55:40.501 --> 00:55:45.334 So, I mean, this is more or less, and you don't need to repair this. 00:55:45.921 --> 00:55:53.436 So, coming back to your question on sound and vision, even sound and vision will move to whisper as soon as possible. 00:55:54.722 --> 00:56:04.557 I mean, Roeland and I, we are working both at the University of Twente and we know this and they start as quick as possible. 00:56:05.162 --> 00:56:09.374 I mean, we need a couple of weeks, months to make it. 00:56:09.962 --> 00:56:14.895 we need to to build it and yeah it is style unique that's right it's written 00:56:14.940 --> 00:56:22.796 different but anyway and but we mean I mean I mean it's now March so I believe 00:56:22.860 --> 00:56:29.876 that for the summer this per will be the default recognizer for Dutch that we 00:56:30.000 --> 00:56:35.875 offer so that means that you can do it at sound to vision you can redo all the 00:56:35.940 --> 00:56:39.313 stuff that you did in the past or you can say well only the new material will 00:56:39.461 --> 00:56:42.913 done with Whisper, but that's something Sound and Vision has to deal with. 00:56:44.785 --> 00:56:48.735 Is this more or less answering your question? Yeah, it does. Thank you. Okay. 00:56:52.229 --> 00:56:52.852 Other people. 00:57:04.134 --> 00:57:05.596 No questions at all. 00:57:07.260 --> 00:57:07.361 OK. 00:57:09.183 --> 00:57:13.534 Well, or it was completely not understandable, or it was 100% understandable. 00:57:14.843 --> 00:57:16.709 Reinike, may I give the word back to you? 00:57:19.344 --> 00:57:19.946 Yes, you may. 00:57:22.681 --> 00:57:29.055 As Ariane already indicated, we will schedule to follow-up workshops with sufficient interest, 00:57:29.240 --> 00:57:31.469 so from four people on. 00:57:32.480 --> 00:57:40.316 If you're interested, you can send an email to CDH at uu.nl, I already also put it in 00:57:41.124 --> 00:57:41.485 the chat. 00:57:42.680 --> 00:57:50.396 And we will try our best to schedule two workshops that work for everyone on location in the 00:57:50.520 --> 00:57:51.383 Utrecht City Center.