WEBVTT

00:00:01.283 --> 00:00:06.013
Okay. I hope everyone
can see my screen.

00:00:08.845 --> 00:00:11.734
Okay, well, good morning.
I'm Arjan van Hessen.

00:00:13.203 --> 00:00:16.874
I will tell a little bit more
about myself in the next slide.

00:00:17.060 --> 00:00:22.915
But well, just to inform you, I'm working at University
of Twente, University of Utrecht and also at Telecats.

00:00:23.020 --> 00:00:26.612
It's a software company in
Enschede and I'm living in Utrecht.

00:00:27.661 --> 00:00:32.833
And my work for the University of
Utrecht is, well, bounded by CLARIAH.

00:00:34.281 --> 00:00:41.154
And, well, that's so it's not scientific,
but it's more, well, an organizational stuff.

00:00:42.722 --> 00:00:48.855
But in turn, I'm working on speech technology and
more or less also now on the understanding of speech.

00:00:49.000 --> 00:00:52.050
But that will be
something of this meeting.

00:00:53.000 --> 00:01:06.257
Okay. Yeah, here is an interview four years
ago, I believe, for people who are blind or

00:01:06.400 --> 00:01:14.576
have difficulty in good looking. So they are
used or they are helped with modern speech

00:01:14.640 --> 00:01:22.316
and language technology. So there was an interview
with me for their personal radio or well podcast

00:01:22.381 --> 00:01:23.124
or how do you call it?

00:01:24.521 --> 00:01:28.133
And I got the
recording and I thought,

00:01:28.180 --> 00:01:31.132
Well, let's see how
speech recognition is doing.

00:02:32.209 --> 00:02:37.557
Well, you can see, and I hope that you could also
see the English translation who was generated

00:02:37.660 --> 00:02:39.889
automatically based
on this recognition result.

00:02:40.960 --> 00:02:43.710
And so it's a nearly
100% correct recognition.

00:02:44.600 --> 00:02:51.035
There were two smaller errors, Telecats,
I mean, the company is Telecats, so it's

00:02:51.340 --> 00:03:01.377
one company, but it wasn't in the vocabulary
at that time. So it is recognized as Telen

00:03:01.540 --> 00:03:09.356
&amp; Kets. Okay. And it is the museum. And the
museum is a special museum in Nijmegen, dedicated

00:03:09.440 --> 00:03:15.955
for people who are blind or can't see very
well. But for the rest, it is perfect. And

00:03:16.180 --> 00:03:19.972
If you say, well, what is
missing in this transcription?

00:03:20.540 --> 00:03:24.673
It are the dots and the commas
and eventually a question mark.

00:03:25.280 --> 00:03:31.893
So besides of it, it was already
a perfect recognition result.

00:03:33.700 --> 00:03:36.752
This was done with
the Calde recognizer,

00:03:36.860 --> 00:03:40.232
available for all researchers
at the University of Nijmegen.

00:03:40.420 --> 00:03:44.994
So you can go there, register
yourself, upload the file,

00:03:45.080 --> 00:03:45.925
and you get the results.

00:03:46.480 --> 00:03:48.410
However, there's a new engine.

00:03:48.660 --> 00:03:51.491
We will talk today,
talk about it today,

00:03:52.221 --> 00:03:55.652
and that is outperforming
the Kaldi recognizer.

00:03:56.381 --> 00:03:56.602
Okay.

00:04:03.248 --> 00:04:03.489
Okay.

00:04:04.660 --> 00:04:06.227
So if we are looking at the,

00:04:07.540 --> 00:04:09.430
well, the dissemination
of the human knowledge

00:04:09.500 --> 00:04:11.911
of the last 3000
years, I mean, so well,

00:04:12.161 --> 00:04:14.030
what has that to do
with speech recognition?

00:04:14.080 --> 00:04:19.735
well, a lot. So eventually, we
started, of course, with talking

00:04:19.880 --> 00:04:24.234
to each other. Later, it became
text. So it was from generation

00:04:24.300 --> 00:04:27.653
to generation possible to
give your knowledge to the next

00:04:27.760 --> 00:04:31.934
generations. But now we are in
the speech mode and text mode

00:04:32.020 --> 00:04:35.773
again. So you can say, well,
we started with the Druids, for

00:04:35.840 --> 00:04:40.374
example, who had a very
oral history. And later in the

00:04:40.400 --> 00:04:44.874
medieval time, or well, before
that, to be honest, it was the

00:04:45.000 --> 00:04:50.795
written context, information
was written down in books. And in

00:04:51.060 --> 00:04:55.774
the 14th century, it became
pressed. So it was the, well,

00:04:55.860 --> 00:05:03.956
the in textual form. And in
the late, no, the early 20th

00:05:04.240 --> 00:05:12.516
century. Recognition or recording
devices were available. And the

00:05:12.640 --> 00:05:18.255
first well, oral history, you
may say recordings, you can see

00:05:18.300 --> 00:05:22.832
here is an it is an America in I
believe 1916 or something like

00:05:23.640 --> 00:05:31.115
that. And so you can say
from oral to written version to

00:05:31.420 --> 00:05:36.793
Nowadays, the podcast,
the videos, and all that stuff.

00:05:37.981 --> 00:05:41.912
And we are back in our oral
environment, you may say.

00:05:43.101 --> 00:05:45.752
And for example, you can
see here the National Initiative

00:05:45.620 --> 00:05:47.086
to Sound and
Vision in Hilversum.

00:05:48.381 --> 00:05:49.407
It has a museum.

00:05:49.640 --> 00:05:50.847
It has working offices.

00:05:51.581 --> 00:05:53.829
And underground,
so 30 meters deep,

00:05:55.181 --> 00:05:57.851
it is the archive where
they have more or less,

00:05:58.060 --> 00:06:00.128
at this time, 800,000 hours
of audiovisual material.

00:06:01.700 --> 00:06:04.612
However, they don't know
what, I mean, from part,

00:06:04.660 --> 00:06:08.233
they know what it is
about, but from huge parts,

00:06:08.340 --> 00:06:10.827
they know it's a
broadcast from 1937,

00:06:15.165 --> 00:06:18.974
but what it is about,
what is said is unknown.

00:06:19.902 --> 00:06:23.373
At the same time, we
see at the universities

00:06:23.620 --> 00:06:30.414
that there is an increasing use of the videos,
especially from COVID, but also before.

00:06:31.380 --> 00:06:38.954
We have more and more teachers
who are giving a video presentation.

00:06:40.320 --> 00:06:45.334
If it is a video for educational stuff, it
has to be stored in the archives for seven

00:06:45.580 --> 00:06:49.773
years, but we don't
know what it is about.

00:06:50.000 --> 00:06:54.214
I mean, it is an interview or
it's a lecture from Professor

00:06:54.320 --> 00:06:56.811
Janssen, and he is
talking about biology,

00:06:56.940 --> 00:06:58.145
but that's what we know.

00:07:00.420 --> 00:07:05.094
So we are missing, we
do have some metadata,

00:07:05.280 --> 00:07:06.908
but it is missing partly.

00:07:07.500 --> 00:07:10.830
But the transcriptions
normally are not available.

00:07:12.561 --> 00:07:14.490
And so that means
that if we want

00:07:14.680 --> 00:07:17.410
to know what the
lectures are about,

00:07:18.080 --> 00:07:24.534
we need to listen and also to look at it and
that can be nice but I mean with 150 000 hours

00:07:25.820 --> 00:07:37.037
it's not doable to give a good overview of what
is available. Okay it's not the new that we are

00:07:37.180 --> 00:07:41.773
coming in and well you can say in decade
or in time from artificial intelligence

00:07:42.220 --> 00:07:47.434
it is quite popular in the newspapers and we
see each half year there are new developments

00:07:47.840 --> 00:07:54.856
and the latest was a chatGPT and we will talk
about that later but we may say that artificial

00:07:54.940 --> 00:08:01.255
intelligence is leaving the laboratories
and becoming part of our well history and

00:08:02.784 --> 00:08:08.015
that means that we have a question what
is artificial intelligence well you may say

00:08:08.140 --> 00:08:14.876
something magic that gives you the opportunity to
to use the software for all kinds of smart things

00:08:15.561 --> 00:08:26.278
But well, if you were looking in the past, you say that in between the 50s, when
it was started, the term was was notified and it was more or less rule based.

00:08:26.181 --> 00:08:30.955
So look for chess players. And that was considered
as artificial intelligence at that time.

00:08:32.023 --> 00:08:37.274
And in the 80s, it was machine learning.
It became popular in the laboratories.

00:08:38.581 --> 00:08:50.058
And nowadays we have machine learning with deep neural networks and the
deep neural networks are a kind of a copy from the way we are well thinking.

00:08:50.782 --> 00:08:55.635
So you may say it is coming
closer to the way people think.

00:08:56.381 --> 00:09:05.917
And if you can, if you're successful in making software that
is copying at least part of our behavior or part of our, yeah,

00:09:06.882 --> 00:09:16.455
you may call it artificial intelligence. But you see it is a
changing also vision from the humanity about artificial intelligence.

00:09:18.302 --> 00:09:20.750
And if you say,
well, what is AI?

00:09:21.520 --> 00:09:25.773
Of course, we have
the famous games.

00:09:26.100 --> 00:09:29.731
It was first with chess in
the 90s, I believe, Deep Blue.

00:09:30.780 --> 00:09:35.413
And in 16, so for
seven years ago,

00:09:36.360 --> 00:09:42.013
it was Google who
started with the game Go.

00:09:43.140 --> 00:09:44.827
And they developed DeepMind.

00:09:46.160 --> 00:09:51.374
And to train DeepMind, they
used all the available games

00:09:51.981 --> 00:09:54.331
that were recorded
and things like that.

00:09:54.420 --> 00:09:56.630
And they tried to
train the computer

00:09:57.000 --> 00:09:58.688
with the existing games.

00:09:59.460 --> 00:10:01.789
OK, it became very good.

00:10:02.620 --> 00:10:05.191
However, then there was a
clever one at Google that said,

00:10:05.240 --> 00:10:08.531
well, if we have two
versions of DeepMind,

00:10:09.160 --> 00:10:13.173
we can make them
play against each other.

00:10:13.480 --> 00:10:14.987
And then the winner will win.

00:10:15.900 --> 00:10:22.495
And that game is considered as a good starting
point for the next training and so on and so on.

00:10:23.121 --> 00:10:30.795
So they started for three weeks, I believe, and it was continuous
playing, updating the results and then playing again and etc.

00:10:31.940 --> 00:10:37.533
And in the end, I mean, the DeepMind is so good
that it is absolutely the world class winner.

00:10:39.181 --> 00:10:46.594
However, I'm not talking about artificial intelligence
as such, but it is more the language-dependent

00:10:47.920 --> 00:10:48.563
artificial intelligence.

00:10:51.803 --> 00:10:59.333
In 2011, I believe, it was Watson, or IBM
was Watson, who started to join the game.

00:11:02.480 --> 00:11:08.235
It's kind of a quiz instead of the European
version where you have a question and you

00:11:08.300 --> 00:11:13.113
you have to give the answer. The answer is
given and you have to invent the question.

00:11:13.960 --> 00:11:23.255
But that's it. And they trained the computer
and then there was an official presentation.

00:11:24.020 --> 00:11:30.635
In Ken was the world champion of that year.
Brett was the guy who won the most money in

00:11:30.680 --> 00:11:37.995
the year before and they played both against
Watson. And in the end Watson absolutely overwhelmed

00:11:38.160 --> 00:11:41.110
The other two, I mean, I
believe that Ken had 12,000,

00:11:42.280 --> 00:11:43.484
Brett had $14,000, and
Watson has 40 or $50,000.

00:11:47.722 --> 00:11:50.171
So it was absolutely
an easy game.

00:11:51.603 --> 00:11:53.510
How was the system developed?

00:11:54.520 --> 00:11:56.489
They used, of
course, all American,

00:11:57.120 --> 00:12:00.292
it was only in
American, text available.

00:12:00.420 --> 00:12:03.571
So the Wikipedia,
books, newspaper articles,

00:12:04.100 --> 00:12:06.831
internet stuff, et
cetera, et cetera,

00:12:07.320 --> 00:12:09.170
that was trained,
given to the computer.

00:12:09.220 --> 00:12:13.852
The computer learned
to argue about the results.

00:12:14.780 --> 00:12:17.949
And then in the end, it
beat it, and that's game.

00:12:19.540 --> 00:12:19.641
OK.

00:12:20.962 --> 00:12:24.013
And then in 2017, so
five years, six years

00:12:24.220 --> 00:12:29.954
after the previous one, we
got the transformer models.

00:12:30.720 --> 00:12:33.691
And if you ask me exactly
what a transformer model is,

00:12:34.340 --> 00:12:34.904
I'm not 100% sure.

00:12:35.780 --> 00:12:39.030
I'm studying it at the moment,
but it's also for me quite new.

00:12:40.720 --> 00:12:43.552
But here you have
some nice views.

00:12:43.620 --> 00:12:45.428
You will get the
presentation afterward.

00:12:46.201 --> 00:12:51.034
And it is a neural network,
of course, that learns context

00:12:51.200 --> 00:12:53.310
and does meaning
by tracking relationship

00:12:53.460 --> 00:12:55.767
in sequential data like
the words in the sentence.

00:12:57.820 --> 00:13:00.231
Transformer models
apply an evolving set

00:13:00.320 --> 00:13:03.212
of mathematical techniques
called attention or self

00:13:03.300 --> 00:13:08.454
attention to detect subtle
ways, even distant data elements

00:13:08.520 --> 00:13:10.569
in the sphere of influence
and dependent on each other.

00:13:11.140 --> 00:13:15.413
OK, well, that's a more
or less nice definition

00:13:15.620 --> 00:13:16.746
of the transformer model.

00:13:17.660 --> 00:13:21.589
And it was started in
2017 by a paper of Google.

00:13:24.601 --> 00:13:27.090
And it became absolutely new.

00:13:28.241 --> 00:13:32.213
And then Stanford
researchers, they

00:13:32.580 --> 00:13:35.933
called that foundation models
instead of transformer models.

00:13:35.960 --> 00:13:39.032
So well, it's a
different terminology

00:13:39.120 --> 00:13:40.165
for the same technique.

00:13:41.540 --> 00:13:46.814
But it turns out that these
transformer models are very,

00:13:46.960 --> 00:13:52.192
very strong in the
current, well, AI revolution.

00:13:53.940 --> 00:13:58.934
OK, transformer
are translating text.

00:13:59.060 --> 00:13:59.806
So what can you do?

00:13:59.940 --> 00:14:01.850
You can give it a
Dutch text and say,

00:14:01.900 --> 00:14:03.126
will give me the
English version.

00:14:04.161 --> 00:14:07.533
And speech, also possible.

00:14:07.640 --> 00:14:10.050
So for people who
are hearing impaired,

00:14:10.500 --> 00:14:12.046
it can be very useful.

00:14:13.480 --> 00:14:15.650
They can detect
trends and anomalies

00:14:15.760 --> 00:14:17.529
to prevent fraud
and things like that.

00:14:17.660 --> 00:14:21.992
So it is important for health
care and banking and that stuff.

00:14:22.961 --> 00:14:25.571
And here, it is a nice overview.

00:14:25.740 --> 00:14:27.810
You have the
data, the foundation,

00:14:27.960 --> 00:14:29.167
or the transformer models.

00:14:29.640 --> 00:14:32.949
And here you have a
couple of, well, resulting tools.

00:14:34.980 --> 00:14:39.894
And I suppose that the most
of you will know this video,

00:14:40.100 --> 00:14:41.167
But, well, we can see.

00:15:00.885 --> 00:15:04.093
OK, well, it isn't I mean,
you can find it on the Internet.

00:15:05.120 --> 00:15:08.252
It is an astonishing
conversation

00:15:08.660 --> 00:15:11.711
where the computer is calling
a hairdresser or hair shopper

00:15:12.421 --> 00:15:15.130
and make an appointment
for, well, her boss

00:15:16.000 --> 00:15:17.709
who's not involved
at the moment.

00:15:18.260 --> 00:15:22.614
And especially this
human-like and this humbling

00:15:22.700 --> 00:15:25.211
and things like
that is, well, it

00:15:25.320 --> 00:15:27.569
makes it very natural sounding.

00:15:28.661 --> 00:15:29.706
And it is, yeah.

00:15:30.641 --> 00:15:30.761
OK.

00:15:33.100 --> 00:15:35.989
Then we, more or less at
the same time, 2017 or 18,

00:15:37.860 --> 00:15:45.956
open AI started and well as the name is promising
they develop AI and it will be open well partly

00:15:46.000 --> 00:15:51.555
at least and is it an evolution or a
revolution well that's a discussion but

00:15:54.283 --> 00:16:04.476
they started well they make a very impact
full entrance in December 2022 so last year

00:16:07.062 --> 00:16:12.953
is the introduction of chatGPT and they well
I know that most of you will know chatGPT.

00:16:14.901 --> 00:16:21.456
And yesterday I asked the computer to explain
quantum computing in simple terms so that

00:16:21.500 --> 00:16:26.634
was just a question to me and then you get
a nice overview in which well more or less

00:16:27.060 --> 00:16:31.690
quantum computing is
explained, at least to me.

00:16:34.241 --> 00:16:37.291
And even our last
week, I believe,

00:16:38.020 --> 00:16:42.854
they updated their
language model,

00:16:42.980 --> 00:16:45.288
you may say, to GPT-4.

00:16:46.600 --> 00:16:51.071
So chatGPT-3 and 3 and 1
half was stopping in 2021

00:16:52.920 --> 00:16:55.027
because they used the
data from before 2021.

00:16:57.221 --> 00:17:00.773
And that means that all current
activities, the war in Ukraine

00:17:00.900 --> 00:17:05.492
or things like that, you
couldn't ask at chatGPT.

00:17:06.481 --> 00:17:09.690
And now they update
it to chat version 4,

00:17:10.980 --> 00:17:11.623
GPT4 version 4.

00:17:13.401 --> 00:17:16.453
And if you have a
paid account, then you

00:17:16.540 --> 00:17:18.289
can already access it.

00:17:18.440 --> 00:17:20.931
And I suppose it will
be becoming available

00:17:21.000 --> 00:17:22.889
for the free engines as well.

00:17:23.380 --> 00:17:26.031
But well, it is, again, better.

00:17:26.481 --> 00:17:28.871
And you can give images to it.

00:17:29.080 --> 00:17:31.430
They can do a kind of humor.

00:17:32.421 --> 00:17:36.230
And it is absolutely amazing
what is possible with chatGPT.

00:17:40.461 --> 00:17:44.189
Then in 2022, in September,
end of September,

00:17:47.660 --> 00:17:50.890
they silently
introduced Whisper.

00:17:52.181 --> 00:17:54.530
And that is a speech
recognition engine.

00:17:55.702 --> 00:17:58.550
More or less based also
on the Transformer models.

00:17:59.880 --> 00:18:03.993
So what is Whisper? It is an
automatic speech recognition

00:18:04.640 --> 00:18:09.575
engine trained on nearly
700,000 hours of multilingual and

00:18:09.602 --> 00:18:11.314
multitask data.

00:18:11.124 --> 00:18:13.636
And to give you an idea,

00:18:12.761 --> 00:18:15.047
700,000 hours is more
than you and I will use in

00:18:18.402 --> 00:18:21.792
our life. So it's an
enormous amount of data.

00:18:23.101 --> 00:18:30.196
And well, they used 60% I believe was
English and 40% were other languages.

00:18:32.804 --> 00:18:39.396
And they showed that use of such a large and
diverse data sets leads to an improved robustness

00:18:39.640 --> 00:18:43.293
of accents, background noise
and also technical languages.

00:18:44.884 --> 00:18:49.175
Moreover, it enables transcription in
multiple languages. So it is one model

00:18:50.183 --> 00:18:54.615
And you can give it in Dutch version or a
Chinese version or an Italian version and

00:18:54.680 --> 00:19:00.534
it will give you the translation in or
the transcription in that language.

00:19:01.320 --> 00:19:03.631
And also you can
translate it to English.

00:19:03.760 --> 00:19:08.813
So if you hear an interesting Chinese conversation
and you don't know what it is about, you can

00:19:09.300 --> 00:19:13.774
give it to whisper, make the transcription
and then the translation in English and you

00:19:14.000 --> 00:19:15.465
can see what it is about.

00:19:18.300 --> 00:19:25.275
And what is very nice of OpenIA is that this
time it was a really open sourcing model.

00:19:25.460 --> 00:19:30.775
So they developed seven models, I believe,
or eight or nine models, and you can download

00:19:30.800 --> 00:19:33.268
them from their
site and use them.

00:19:35.180 --> 00:19:44.655
OK, if we look to speech recognition in the
past two decades, I mean longer, in the 70s

00:19:45.680 --> 00:19:51.554
started slowly, slowly, and it was more or less
based on the Fourier transform. Then in 2000,

00:19:52.220 --> 00:19:54.925
nearly, yeah, 2095, 96, they were the first
initiatives with the HMM. And then in 2010,

00:20:01.521 --> 00:20:04.933
this is the paper of Microsoft at the
InterSpeech conference in Florence.

00:20:05.661 --> 00:20:12.270
they started with the deep neural networks and
well in 2019 I believe it was the first time

00:20:17.801 --> 00:20:26.497
that for correct recorded American English
conversations it was on the level of the human

00:20:26.600 --> 00:20:32.415
accuracy outperformed it a little bit but let's
say it was more or less on the human accuracy and

00:20:32.802 --> 00:20:33.907
And that's quite recent.

00:20:35.120 --> 00:20:42.874
And it is, well,
something I made myself.

00:20:44.060 --> 00:20:47.673
But I believe that with the
coming of the transformer

00:20:47.740 --> 00:20:51.693
models, it will
increase a little bit.

00:20:52.000 --> 00:20:54.188
I mean, yeah, more
than 100% is not possible.

00:20:55.400 --> 00:20:58.111
But we will see that in
the coming years now

00:20:58.240 --> 00:21:01.352
and in the coming
years, it will have reached

00:21:01.400 --> 00:21:08.695
this human accuracy for other languages than
American English and it will also increase.

00:21:08.940 --> 00:21:19.157
So in the end it will outperform as humans
in correct recognition and said that you have

00:21:19.180 --> 00:21:26.115
to remember that it is for
correct good recorded audio.

00:21:26.740 --> 00:21:30.854
If you have a conversation where people talk
to each other or there is a train passing

00:21:30.900 --> 00:21:37.075
by or other background noises, it will be
different. However, it is performing very

00:21:37.140 --> 00:21:43.635
good. But if it is outperforming humans, it's
not sure at the moment. But for nice, cool

00:21:43.880 --> 00:21:49.112
recorded audio, it will absolutely
beat us in the coming years.

00:21:51.020 --> 00:21:57.836
OK, however, speech recognition is not perfect.
But even if it is perfect, I mean, what is

00:21:57.900 --> 00:22:11.476
going wrong? Well, it is still the case that
we do use recorded words, our vocabulary,

00:22:12.600 --> 00:22:21.073
and in the eticali, so the current old version
of the ASR, we can recognize 260,000 different

00:22:24.141 --> 00:22:30.615
words, but there are always words that are not
in that list, so they cannot be recognized.

00:22:31.180 --> 00:22:38.056
An example was the marsupilami. With the current
version of Whisper, I do believe that marsupilami

00:22:38.160 --> 00:22:46.656
will be recognized, but I have to test it.
And then there is in the use of the language

00:22:46.741 --> 00:22:51.414
model, you can say something that is between
Zwelle and Zwolle and you mean Zwelle. I'm

00:22:51.500 --> 00:22:59.876
going by train from Utrecht to Zwelle and we saw
that Zwelle is not that popular and Zwolle is

00:23:00.080 --> 00:23:08.296
very popular so the language model is replacing
the recognized Zwelle by Zwolle because it's more

00:23:08.540 --> 00:23:13.214
logical that people will say Zwolle and more or
less the train station in Zwelle doesn't exist

00:23:13.320 --> 00:23:20.696
anymore. So it makes sense but it is not close
to the recognition and it will be a discussion

00:23:20.760 --> 00:23:30.077
if you want that or not. But okay. And what
we do see is that people do say other words.

00:23:31.362 --> 00:23:40.977
They want to say, well, give an example as well.
And if you listen to it, you absolutely hear

00:23:42.563 --> 00:23:48.816
and it is just a mistake from people. Yeah. Okay.
That can be the case. And with our human brains,

00:23:48.900 --> 00:23:54.194
we can say no, but he means well because
Zola has nothing to do in this conversation.

00:23:54.660 --> 00:24:00.015
But that is something else that is the interpretation
of the recognition and not the recognition itself.

00:24:02.163 --> 00:24:11.037
Okay and the speech AI is used for well
this is an example of some five years ago

00:24:12.202 --> 00:24:20.296
with the old recognition engines and it is
an overview of the, well, the, oh, yeah, it

00:24:20.420 --> 00:24:28.296
is an overview of, well, what we believe is
for researchers in a normal way. You do have

00:24:28.340 --> 00:24:34.253
an interview. It can be an analog interview or
in nowadays you will have an digital interview.

00:24:35.480 --> 00:24:38.790
The analog need to be
digitized and you need a storage.

00:24:40.100 --> 00:24:44.493
OK, we have stored the data
and then we need the transcription

00:24:45.280 --> 00:24:48.771
and you can do it by yourself,
listen to it and type it out.

00:24:50.141 --> 00:24:53.833
Or you can use at least
initially automatic speech

00:24:53.920 --> 00:24:56.230
recognition and then
you get the timed text

00:24:56.620 --> 00:24:57.887
and the only thing
you have to do

00:24:58.140 --> 00:25:01.172
is the checking of if
it was recognized well

00:25:01.340 --> 00:25:03.791
and then make some
errors and replace them

00:25:03.920 --> 00:25:14.697
things like that and this of course was in
my field always a popular item because well

00:25:14.720 --> 00:25:23.116
we believe really that speech recognition will
help you however with an error rate of one out of

00:25:24.483 --> 00:25:30.835
10 words there's an error it sometimes
was too much work to check and improve the

00:25:31.280 --> 00:25:36.495
recognition results. Instead of
doing it, just transcribe it yourself.

00:25:37.642 --> 00:25:44.616
However, with the current version of Whisper, I
absolutely believe that it is very useful to first

00:25:46.363 --> 00:25:53.677
run the speech organizer and then check the
results because it absolutely improved your

00:25:53.740 --> 00:26:04.637
velocity and it speeds up so this is a little bit
an old stuff and yeah okay and then you have of

00:26:04.700 --> 00:26:09.074
course to add your metadata and then you have
searchable transcribed audio visual document

00:26:09.140 --> 00:26:14.294
and you can say well give me the interviews
where they talk about hunger and then you get

00:26:14.841 --> 00:26:17.028
all the interviews where
hunger is pronounced.

00:26:19.140 --> 00:26:24.132
OK, this is more or
less 10 years ago.

00:26:26.021 --> 00:26:27.807
I'm a little bit less gray.

00:26:29.520 --> 00:26:32.710
And it was a question
of the Dutch court.

00:26:34.141 --> 00:26:35.488
And they say, well,
it is at the moment

00:26:35.620 --> 00:26:40.273
it's not allowed to do
recordings in the courtroom.

00:26:40.640 --> 00:26:44.553
However, we are interested
to see if it is working fine

00:26:44.921 --> 00:26:46.830
and to see what
are the possibilities.

00:26:46.920 --> 00:26:49.411
So they asked us
and we developed it.

00:26:49.721 --> 00:26:51.710
And it was quite a success.

00:26:51.780 --> 00:26:52.505
There's a video.

00:26:52.660 --> 00:26:55.552
If you look for
Rechtspraakherkenning on YouTube,

00:26:55.620 --> 00:26:56.566
you will get this video.

00:26:56.720 --> 00:26:59.972
And it will explain a little
bit, well, the performance

00:27:00.120 --> 00:27:01.808
and how it was working.

00:27:02.740 --> 00:27:05.150
A couple of years
later, so five years ago,

00:27:05.460 --> 00:27:07.430
the field came with
the same question

00:27:07.560 --> 00:27:12.014
and say, well, our
research, our researchers,

00:27:12.140 --> 00:27:15.132
our people are recording
more and more interviews,

00:27:15.260 --> 00:27:17.590
but once they record
it, they need to give it

00:27:18.200 --> 00:27:20.068
to the courtroom,
visit transcription.

00:27:21.120 --> 00:27:23.070
So can speech
recognition help us

00:27:23.260 --> 00:27:24.587
in speeding up this process?

00:27:25.681 --> 00:27:26.285
And yeah.

00:27:26.980 --> 00:27:28.006
Oh, no.

00:27:31.067 --> 00:27:33.374
And hey.

00:27:37.184 --> 00:27:48.738
OK, OK, so we did the recordings, we simulated
and then we start a test where we compared

00:27:48.940 --> 00:27:56.356
the classic way. That's the blue line with
the new one and that's the red one. And what

00:27:56.420 --> 00:28:02.375
you see here is that I mean, if you have an
interview with someone and you need to make

00:28:02.460 --> 00:28:10.415
some notes. It takes time and sometimes, well,
you need some pauses and here, here, here

00:28:11.260 --> 00:28:20.477
and here and here. You can see that the time
is increased, but there is not, they are not

00:28:20.560 --> 00:28:26.753
speaking. So that means they need some time
to make their notes. And then they say, OK,

00:28:28.000 --> 00:28:35.836
on. However this is decreasing the quality of
the interview because once someone starts talking

00:28:35.860 --> 00:28:41.414
you want them to continue talking and you don't
want to interfere with well stop a moment and I

00:28:41.880 --> 00:28:51.155
need some notes. So in the end you see that at the
same time there are a lot of more words so 30 to

00:28:53.102 --> 00:28:59.496
50% more words were spoken so we have more
material in the same time and this was quite

00:28:59.580 --> 00:29:07.816
convincing the field in using the speech
recognition. We did a test for the NEOD,

00:29:08.060 --> 00:29:17.096
the witness story, getuigenverhalen.nl. They have
600,000 hours of oral history about World War II

00:29:18.222 --> 00:29:23.033
and well you want I mean no one is going
to listen to 600,000 hours of interviews

00:29:24.220 --> 00:29:27.893
but you want to interviews that talking
about well some particular topics you

00:29:27.960 --> 00:29:35.516
are interested in so you can search
in the spoken content and we did the

00:29:35.540 --> 00:29:42.636
project for the foreign ministry
ministry foreign affairs and it was in

00:29:42.800 --> 00:29:49.334
Croatia and Bosnia, so Croatian memories and
Bosnian memories, and we recorded 700 interviews.

00:29:50.200 --> 00:29:54.471
However, at that time, the speech recognition
was not working for those languages.

00:29:55.780 --> 00:30:01.493
And so this was hand transcribed, but the
translation in English was done automatically.

00:30:03.301 --> 00:30:09.916
Clarin and the national or the European
infrastructure for language and speech technology

00:30:12.264 --> 00:30:18.796
supported this and they said well can you start
an OH portal and it's now a transcription portal

00:30:19.280 --> 00:30:28.356
at the University of München and you do have an
account you can go to there and upload your files

00:30:28.580 --> 00:30:31.190
select the language
and download the results.

00:30:32.040 --> 00:30:35.693
This will probably in the coming
months replaced by Whisper,

00:30:35.880 --> 00:30:39.071
but at the moment
it is the old version.

00:30:39.940 --> 00:30:41.528
It is the Google version.

00:30:41.920 --> 00:30:43.709
It is here the Dutch version.

00:30:44.180 --> 00:30:47.692
Well, that are the languages
supported at the moment.

00:30:49.704 --> 00:30:55.017
And okay, here are some
shots from projects we did.

00:30:54.982 --> 00:30:56.951
This was the forced alignment.

00:30:56.980 --> 00:31:02.614
So we got the text from the
second room, so the Tweede Kamer.

00:31:03.380 --> 00:31:07.193
They are forced to give
a correct transcription.

00:31:07.260 --> 00:31:07.985
They gave it to us.

00:31:08.120 --> 00:31:09.749
And what we did was
the forced alignment.

00:31:10.722 --> 00:31:14.191
So the subtitles are
automatically generated.

00:31:16.221 --> 00:31:20.154
And then the Flemish
government was quite enthusiastic

00:31:20.220 --> 00:31:21.488
about our Dutch effort.

00:31:21.700 --> 00:31:23.608
And they asked us to
do the same for them.

00:31:24.460 --> 00:31:35.097
We did it and this was the result was a collaboration with the Flemish universities
and Dutch universities and what they also built was a speaker recognition engine.

00:31:35.200 --> 00:31:38.430
So we know exactly who
is speaking at this moment.

00:31:40.281 --> 00:31:42.728
And here are the partners.

00:31:44.660 --> 00:31:45.946
Here are the results.

00:31:47.540 --> 00:31:53.695
However, it is a verbatim transcription, so it
is 100% more or less 100% correct transcription.

00:31:54.140 --> 00:31:57.352
And it turned out that
it's too much for them.

00:31:57.440 --> 00:31:58.707
So they want more summary.

00:31:59.220 --> 00:32:00.906
And that's not
possible at the moment.

00:32:03.460 --> 00:32:04.766
And here you see,
again, the 150,000 hours.

00:32:06.000 --> 00:32:08.110
So that will be more at
the moment, I suppose.

00:32:08.200 --> 00:32:10.206
But a couple of years ago,
it was 150,000 hours of surf.

00:32:12.921 --> 00:32:15.612
And surf is also
experimenting together with us

00:32:15.740 --> 00:32:21.374
to see if they can
recognize all their material.

00:32:21.620 --> 00:32:23.525
It is more or less 50%
English, 50% Dutch.

00:32:27.160 --> 00:32:31.433
OK, then a project from
the Utrecht University.

00:32:32.080 --> 00:32:37.595
Can you use it with the
patients for the care to report?

00:32:37.700 --> 00:32:41.813
So you're going to your GP,
and you have some questions.

00:32:42.000 --> 00:32:43.368
And normally,
they are typing a lot.

00:32:43.700 --> 00:32:46.311
Can you replace the typing
by speech recognition?

00:32:46.440 --> 00:32:49.230
And well, you have
more or less an overview.

00:32:50.320 --> 00:32:52.128
And it turned out
that it works well.

00:32:52.941 --> 00:32:57.413
And they are starting now
some real life showcases

00:32:57.900 --> 00:32:59.830
where they will show
that the recognition will

00:32:59.920 --> 00:33:02.689
help the GP in their daily work.

00:33:03.921 --> 00:33:05.508
And of course, at
University of Twente,

00:33:05.880 --> 00:33:07.930
we are focusing
on robots interaction

00:33:08.020 --> 00:33:10.650
between children and
elderly people with robots.

00:33:11.501 --> 00:33:13.530
And we are using
the speech recognition

00:33:13.620 --> 00:33:14.926
to understand what
they are saying.

00:33:16.081 --> 00:33:19.152
And at the same time, we're
also developing some software

00:33:19.180 --> 00:33:25.215
to see how they are saying it. So the emotion
in the speech is as important as what they

00:33:25.220 --> 00:33:26.584
are saying. So how and what.

00:33:30.360 --> 00:33:37.615
So then we have people with difficulties. I
mean, the brain doesn't always cooperate.

00:33:38.620 --> 00:33:49.077
Here I will show you an example of someone
who has Parkinson. I'm not sure if you can

00:33:49.161 --> 00:33:49.443
hear it.

00:34:06.466 --> 00:34:09.193
So that's I mean
you can understand it,

00:34:09.700 --> 00:34:11.989
but you have to
listen carefully and.

00:34:13.061 --> 00:34:15.832
Even the modern
speech engines not always

00:34:16.161 --> 00:34:18.672
recognize this kind
of people as well

00:34:18.842 --> 00:34:20.329
as well as we wanted.

00:34:20.680 --> 00:34:24.694
Yeah, we couldn't hear
the no, no, no, no, no, no.

00:34:25.002 --> 00:34:27.351
Well, you will get the presentation
and you can listen to it yourself.

00:34:28.402 --> 00:34:32.735
Yeah, I'm I forgot to
organize the audio as well.

00:34:33.442 --> 00:34:38.395
And this was a project
we did two years ago.

00:34:38.902 --> 00:34:39.967
So in the COVID period.

00:34:41.302 --> 00:34:46.455
And well, I can show you
or see if you can hear it.

00:35:02.391 --> 00:35:06.614
Well, you can see the difference
between him and also his helper.

00:35:13.060 --> 00:35:15.330
I mean, if he speaks,
the recognition is easy,

00:35:15.961 --> 00:35:17.709
but he's impossible
to recognize.

00:35:17.880 --> 00:35:20.450
And what we did, we
developed a special

00:35:21.420 --> 00:35:22.947
speech recognition
engine for him

00:35:23.740 --> 00:35:28.494
to help him using
the engine for work,

00:35:30.005 --> 00:35:31.772
for school, for traveling,

00:35:31.980 --> 00:35:33.709
and to tell something
about himself.

00:35:34.381 --> 00:35:37.913
It worked, but I mean, it
can be much, much better.

00:35:41.626 --> 00:35:48.157
And this will be, well, I will skip this one,
but this is in English Parliament and you

00:35:48.260 --> 00:35:53.114
hear so much noise in the background that it
is okay, you can hear it, but that is very

00:35:53.180 --> 00:35:55.428
difficult even for
an modern engine.

00:35:58.223 --> 00:36:03.233
Okay, as said before,
emotion is important.

00:36:04.600 --> 00:36:08.112
We know we are well at the
level more or less that with good

00:36:08.700 --> 00:36:11.550
recorded audio we are at the
level of the human recognition.

00:36:12.340 --> 00:36:17.157
However, we need some emotion
inside and the question is, can

00:36:17.204 --> 00:36:20.617
we detect the emotion in the
conversation? Because if you

00:36:20.142 --> 00:36:24.576
ask me, do you like football? I
can say yes, and then that isn't

00:36:24.642 --> 00:36:26.691
convincing. Yes,
or I can say yeah.

00:36:27.583 --> 00:36:29.972
And it is a yes, but I mean no.

00:36:31.102 --> 00:36:36.915
So can you use that kind of emotion
detection inside modern conversations?

00:36:37.882 --> 00:36:43.215
And then it's the question, which emotion? I mean, we
have the big five, sadness, anger, fear, joy and neutral.

00:36:43.722 --> 00:36:46.051
I mean, they are
more or less human.

00:36:46.761 --> 00:36:52.315
And that means that all the people in the world
are using, are having these five emotions.

00:36:52.761 --> 00:37:00.916
However, sarcasm or other more
subtle emotions are cultural dependent.

00:37:01.701 --> 00:37:09.072
And that means that it depends on the speaker and
also on the listener how the emotion is preserved.

00:37:12.360 --> 00:37:19.874
So that will be a difficult question,
but well, it is worthwhile working on it.

00:37:21.981 --> 00:37:28.955
Okay, so that means that brings us more or
less at the end of this talk to the next step.

00:37:29.820 --> 00:37:32.892
And that is what do you mean
from recognition to understanding.

00:37:33.080 --> 00:37:35.708
So can you
understand what is said?

00:37:37.701 --> 00:37:46.336
And we in collaboration with Nijmegen University,
Twente University and two companies, we had

00:37:46.400 --> 00:37:53.136
a question from the Dutch police force and
they had the database with 45 hours of verbatim

00:37:53.220 --> 00:37:57.114
transcription. Well, they had the audio
we needed to give them the transcription,

00:37:57.741 --> 00:38:04.495
the part of speech tagging and also the emotional
edition. And one of the companies was Pandora

00:38:05.000 --> 00:38:13.836
in Amersfoort and they made a nice film that is
in English and I tried to figure out what was said

00:38:15.424 --> 00:38:19.496
And so I'm using the modern
engine to do the recognition.

00:38:20.526 --> 00:38:22.072
Let's see if it is working.

00:38:39.474 --> 00:38:45.098
Yeah, that's the wrong version, so you will
have it in the, if I send it to you, you will

00:38:45.120 --> 00:38:49.854
get to the English and also the
eventually Dutch transcription below it.

00:38:50.883 --> 00:38:55.375
And, but the recognition is
more or less a hundred percent.

00:38:55.480 --> 00:38:59.493
And that's really, I mean, of
course it's a good recording, but

00:38:59.721 --> 00:39:02.792
still so good is astonishing.

00:39:04.228 --> 00:39:04.469
Okay.

00:39:04.460 --> 00:39:08.974
If you look to the future of artificial
intelligence lead speech technology.

00:39:10.002 --> 00:39:18.137
And we see the next step or the step in the coming years will be from
the recognition what was said to the understanding what was meant.

00:39:19.563 --> 00:39:25.557
So that will be an absolutely
important part of our research.

00:39:25.421 --> 00:39:32.276
And what is someone's emotional state?
And so how can you deal with that person?

00:39:32.060 --> 00:39:38.675
person. And of course we have to figure out
how to use the speech recognition for the

00:39:38.780 --> 00:39:45.656
smaller languages and even Dutch is a small
language. I mean, given the Chinese and the

00:39:45.740 --> 00:39:53.056
Indian and the American English population,
we are doing well. But I mean, Friesian or

00:39:53.140 --> 00:39:59.292
some heavy dialects or languages that are
spoken Icelandic only by 300,000 people. Can

00:40:01.360 --> 00:40:11.117
do the recognition for that language as well. And
we need to speed up the velocity so you can use

00:40:11.160 --> 00:40:19.776
it in all kinds of real time situations. And
what the kind of dream that you are talking

00:40:19.800 --> 00:40:23.774
with a Chinese engineer and he's speaking
in Chinese and you're speaking in Dutch,

00:40:24.622 --> 00:40:29.835
his Chinese will be translated automatically
to subtitles in Dutch and my Dutch will be

00:40:29.760 --> 00:40:33.553
be automatically
translated in Chinese.

00:40:33.640 --> 00:40:36.170
So, I mean, it is
more or less possible,

00:40:37.424 --> 00:40:39.010
but we need to
speed it up a little bit.

00:40:39.600 --> 00:40:42.470
So to have that conversations
in the coming years.

00:40:43.600 --> 00:40:47.833
And okay, this was
more or less the end of,

00:40:48.020 --> 00:40:51.713
well, a smaller version
of my presentation

00:40:51.900 --> 00:40:53.909
about AI and speech technology.

00:40:54.600 --> 00:40:56.348
And now we're going
back to Whisper.

00:40:57.160 --> 00:41:00.212
And Whisper, as said
before, is an open source,

00:41:00.280 --> 00:41:02.608
so you can download the
models, at least at the moment.

00:41:03.920 --> 00:41:06.150
It works more or less, at
least for American English,

00:41:06.260 --> 00:41:08.991
on the human level, so
it outperforms it a little bit

00:41:09.100 --> 00:41:10.225
or not, well, it depends.

00:41:11.501 --> 00:41:14.752
And you can ask yourself,
why using Whisper?

00:41:15.040 --> 00:41:18.472
Well, here are some
statements, but it is,

00:41:20.143 --> 00:41:24.934
I mean, the basic answer
is it is working absolutely

00:41:25.400 --> 00:41:34.616
gorgeous. Yeah, it's September,
so it's five, six months old. And

00:41:36.123 --> 00:41:41.075
here are some slides where it's
more or less explained. However,

00:41:41.140 --> 00:41:47.695
you can look it up yourself at
the GitHub repository. And they

00:41:47.800 --> 00:41:53.775
say that it more or less
outperforms the most speed

00:41:53.880 --> 00:41:55.568
recognition engines available.

00:41:56.400 --> 00:41:58.950
However, if you
have some dedicated,

00:41:59.620 --> 00:42:03.452
very specific conversations,

00:42:04.141 --> 00:42:07.211
then some other
engines are doing better.

00:42:07.660 --> 00:42:10.972
But overall, Whisper is
the winner at this moment.

00:42:12.644 --> 00:42:16.634
Okay, installing
Whisper, is it difficult?

00:42:16.861 --> 00:42:16.961
No.

00:42:17.660 --> 00:42:21.112
Well, you need to install
first Python on your computer

00:42:21.621 --> 00:42:23.626
And it is version 3.8 till 3.10.

00:42:26.801 --> 00:42:28.547
I started first with 3.11.

00:42:30.281 --> 00:42:31.588
And then there
were some problems.

00:42:31.800 --> 00:42:33.366
So I went back to 3.99.

00:42:35.261 --> 00:42:36.286
And now it works.

00:42:37.540 --> 00:42:39.149
You need to install PyChart.

00:42:39.420 --> 00:42:42.632
And there's a lot of
information on the internet

00:42:42.740 --> 00:42:44.508
how to do it, or you
may have it already.

00:42:45.040 --> 00:42:47.269
You have to install
FFmpeg Python.

00:42:48.280 --> 00:42:52.032
And once you have installed
these three basic packages,

00:42:53.463 --> 00:42:57.034
you need to install
Git from GitHub.

00:42:57.140 --> 00:42:59.930
So the Whisper Git,
I mean, and that's all.

00:43:01.222 --> 00:43:04.172
It takes a couple of
minutes and then it's installed

00:43:04.220 --> 00:43:06.350
and you have your
engine at your computer.

00:43:09.227 --> 00:43:10.893
Using Whisper,
there's a good, well,

00:43:11.020 --> 00:43:15.714
helping facility
from the parameters

00:43:15.780 --> 00:43:17.086
that you can add to it.

00:43:18.140 --> 00:43:19.987
However, the most
important is the model.

00:43:21.900 --> 00:43:27.133
And it is a fast
computer and a GPU.

00:43:27.840 --> 00:43:30.589
It is 32 times
faster than real time.

00:43:32.160 --> 00:43:35.329
So one minute will be
in two seconds done.

00:43:37.060 --> 00:43:38.006
This is the base version.

00:43:38.160 --> 00:43:38.825
It is slower.

00:43:39.421 --> 00:43:43.153
And this large model, well,
it is a relatively speed of one.

00:43:43.561 --> 00:43:44.908
But I doubt it.

00:43:45.220 --> 00:43:46.648
It is slower, I believe.

00:43:47.040 --> 00:43:49.551
But I mean, these are the
models that you can load.

00:43:49.680 --> 00:43:51.526
And of course, the large
model is a 1.5 gigabyte.

00:43:54.342 --> 00:43:54.946
And that's a lot.

00:43:55.241 --> 00:43:57.931
And the tiny model
is only 40 megabytes.

00:43:58.220 --> 00:44:01.331
So normally, I'm using
the medium version.

00:44:02.160 --> 00:44:05.973
And because I do have a
lot of recordings in English,

00:44:06.080 --> 00:44:08.530
but also in Dutch, in
Italian, and other languages,

00:44:09.040 --> 00:44:12.711
I'm not using medium.en,
but medium as such.

00:44:13.840 --> 00:44:22.876
five gigabyte and well, I mean, you need that
on your computer and it is twice the velocity

00:44:23.041 --> 00:44:28.215
of the real time. So one hour takes you half
an hour, a little bit more, but also depending

00:44:28.300 --> 00:44:36.656
on your computer. But I mean, you just give
the model at your recognition and you say,

00:44:36.800 --> 00:44:40.872
well, I want to do it in this time is medium
only English because they are native English

00:44:41.400 --> 00:44:44.590
want to do it with English
only, you type in medium.en.

00:44:45.941 --> 00:44:48.591
And if it's not available,
it will be downloaded.

00:44:48.740 --> 00:44:50.088
It takes some
minutes, and then you

00:44:50.220 --> 00:44:53.670
have the medium.en model
at your computer as well.

00:44:55.341 --> 00:44:56.546
And you can do what you want.

00:44:58.460 --> 00:44:59.647
Then there are some tools.

00:45:00.400 --> 00:45:07.775
I'm particular fond of WhisperX,
because the results of Whisper

00:45:08.341 --> 00:45:10.831
are they taking
frames of 30 seconds.

00:45:10.960 --> 00:45:12.949
and giving the
recognition results.

00:45:13.641 --> 00:45:15.610
The 30 seconds
sometimes for subtitles

00:45:15.800 --> 00:45:19.031
or for all kinds of other
research is too low.

00:45:19.980 --> 00:45:22.531
And this WhisperX,
you get a word level

00:45:22.860 --> 00:45:26.532
more or less
accurate recognition.

00:45:26.940 --> 00:45:28.228
So they do the recognition

00:45:28.340 --> 00:45:30.770
and then do a kind of
forced alignment based on it.

00:45:31.481 --> 00:45:34.993
And you get a subtitle
with one word timeframe.

00:45:36.304 --> 00:45:37.890
And it's very useful.

00:45:38.640 --> 00:45:42.573
For the Macintosh, of course,
you can use it with the Python

00:45:43.201 --> 00:45:45.309
on your computer, but
there's also Mac Whisper.

00:45:46.520 --> 00:45:48.750
I believe it's Dutch
guy who developed it,

00:45:48.900 --> 00:45:52.451
but it is based on the C
version of the Whisper engine.

00:45:53.581 --> 00:45:59.313
And the free version can use
the small and the tiny models.

00:46:00.520 --> 00:46:02.349
But if you want
medium or large, you

00:46:02.460 --> 00:46:04.047
have to pay one time 15 euro.

00:46:04.980 --> 00:46:06.387
So it's not that much.

00:46:07.401 --> 00:46:09.310
And well, here's the interview.

00:46:09.440 --> 00:46:10.547
You have just a screen.

00:46:10.880 --> 00:46:13.892
You download the
video or the audio

00:46:13.980 --> 00:46:15.066
you want to be recognized.

00:46:15.581 --> 00:46:18.711
And it starts to record
the recognized thing.

00:46:19.481 --> 00:46:20.547
And here you see the results.

00:46:20.620 --> 00:46:24.913
And there are some small things
like change some words that,

00:46:25.220 --> 00:46:27.731
for example, Janssen,
that can be written

00:46:27.820 --> 00:46:28.945
on very different ways.

00:46:30.301 --> 00:46:31.568
And you say, no,
this is Janssen.

00:46:31.680 --> 00:46:32.285
It's two S's.

00:46:32.400 --> 00:46:34.751
And then, well,
you can do a search

00:46:34.780 --> 00:46:36.449
and replace on the results.

00:46:36.660 --> 00:46:41.213
but you will see if you
use it, it's easy to see.

00:46:42.304 --> 00:46:42.525
Okay.

00:46:43.120 --> 00:46:45.230
I mean, that was the conclusion.

00:46:45.340 --> 00:46:48.312
So Whisper, Python,
PyTorch and FFmpeg,

00:46:48.400 --> 00:46:51.111
the basic engines
you need to install.

00:46:51.280 --> 00:46:53.811
And once you have
done that, pip install,

00:46:54.982 --> 00:46:58.493
git hcps, well, you
can see it there.

00:46:59.020 --> 00:47:03.173
And you have the best
engine at your computer.

00:47:03.680 --> 00:47:07.171
And again, at the computer
I'm giving this presentation,

00:47:07.900 --> 00:47:10.109
I do have a graphical
card, NVIDIA.

00:47:10.820 --> 00:47:12.770
So that means I do have a GPU,

00:47:13.080 --> 00:47:17.413
so I can use the GPU
version of the Whisper.

00:47:18.703 --> 00:47:22.574
If you don't have it, it
will take much more time

00:47:22.660 --> 00:47:26.953
because the relatively
speed is based on a fast GPU.

00:47:27.480 --> 00:47:29.791
If you don't have a GPU,
yeah, but it takes you more time,

00:47:29.860 --> 00:47:31.027
but you can do it in the evening

00:47:31.100 --> 00:47:32.487
and the next morning
it will be ready.

00:47:33.300 --> 00:47:35.717
So, yeah.

00:47:35.727 --> 00:47:36.530
Okay, conclusion.

00:47:38.184 --> 00:47:41.654
Automatic speech recognition
is working well at this moment.

00:47:41.740 --> 00:47:45.292
I mean, it's absolutely
astonishing how good it is.

00:47:47.386 --> 00:47:49.092
We do need, and then, I mean,

00:47:49.380 --> 00:47:52.011
the community of
speech technologies,

00:47:52.100 --> 00:47:56.954
we do need to give a
little bit more attention

00:47:57.060 --> 00:47:59.130
to people who are
less well-represented,

00:47:59.180 --> 00:48:02.652
people with a heavy dialect
or people coming from outside

00:48:02.780 --> 00:48:07.072
and trying to speak Dutch,
people with diseases.

00:48:08.162 --> 00:48:10.350
So there is of course
something to do,

00:48:11.040 --> 00:48:12.929
but I do believe that
in the coming years,

00:48:13.140 --> 00:48:17.533
these will be, well, we
probably will solve this.

00:48:19.546 --> 00:48:21.052
And that's the next step,

00:48:21.220 --> 00:48:22.768
but that is more
the research topic.

00:48:22.940 --> 00:48:24.910
I mean, we need to
go from the recording

00:48:25.060 --> 00:48:25.846
to the understanding.

00:48:25.920 --> 00:48:30.113
We know, we do need
to know what the intention

00:48:30.340 --> 00:48:35.672
someone was when we did the recognition
and not only a verbatim transcription.

00:48:37.540 --> 00:48:43.591
This is the end and I have the question,
what questions do you have for me?

00:48:47.000 --> 00:48:53.752
Before you start this, there
will be, if you like it, a follow up.

00:48:56.641 --> 00:49:04.956
Utrecht University will organize that. Rijnike
will send you mails. But I can imagine that

00:49:05.060 --> 00:49:13.596
initially the first steps are always more difficult
than you want them. So if there are five or more

00:49:13.680 --> 00:49:18.615
people who want to continue trying to do the
recognition with their results and on their own

00:49:18.660 --> 00:49:27.476
computer. We will organize some classes, I mean
meetings where I will be there eventually another

00:49:28.704 --> 00:49:32.655
technical involved people trying to help
you to install the Python and the stuff.

00:49:33.381 --> 00:49:40.976
And then also help you in the recognition and to
show you which model to use when things like that.

00:49:41.520 --> 00:49:47.715
But it depends heavily on you. I mean if you are
technical quite well involved you can do it yourself.

00:49:48.261 --> 00:49:51.533
If you want us to help you, let
us know and we will organize that.

00:49:52.564 --> 00:49:54.933
And now it's open for questions.

00:50:02.327 --> 00:50:04.714
And if there are no questions,
that's fine for me as well.

00:50:04.780 --> 00:50:07.271
But yeah, Daphne.

00:50:08.225 --> 00:50:08.566
Hello.

00:50:12.426 --> 00:50:15.014
Wait, let me turn my camera
on so you can see me.

00:50:15.802 --> 00:50:20.394
So, I'm not a researcher, but I'm actually
a privacy officer at the university.

00:50:21.403 --> 00:50:25.234
So, my interest in this
is, well, privacy related.

00:50:25.981 --> 00:50:34.036
And I was wondering how you feel or how you want to
deal with the privacy issues that come up with this.

00:50:34.120 --> 00:50:41.736
Because obviously, if you transcribe interviews,
then it's more easily accessible to lots of people.

00:50:41.880 --> 00:50:45.733
Absolutely so, but first we
have to go one step back.

00:50:46.100 --> 00:50:50.914
I mean, the current version,
well, more or less outdated,

00:50:51.000 --> 00:50:54.352
but still working version is
that we developed a server

00:50:54.460 --> 00:50:55.626
at the Nijmegen University.

00:50:56.640 --> 00:50:59.151
You go there, you make
your login and your password

00:50:59.560 --> 00:51:02.452
and you need to show that you
are not in commercial company

00:51:02.540 --> 00:51:05.029
but working at university
of that kind of organizations.

00:51:06.000 --> 00:51:08.729
And you upload
a file, you select,

00:51:10.121 --> 00:51:12.270
you push the button, start.

00:51:12.860 --> 00:51:15.571
And then after a
couple of minutes,

00:51:15.760 --> 00:51:17.890
or depending on the
length of your recording,

00:51:17.940 --> 00:51:18.704
you get the results.

00:51:19.740 --> 00:51:24.915
However, if it is quite
a sensitive recording,

00:51:24.820 --> 00:51:28.933
it means that it's going
to another company

00:51:29.120 --> 00:51:30.508
and that's not doing any harm.

00:51:30.660 --> 00:51:34.773
But still, given the
GDPR, it is a little bit tricky

00:51:35.200 --> 00:51:36.848
If that is allowed, yes or no.

00:51:37.740 --> 00:51:40.011
If you have very
sensitive material,

00:51:40.120 --> 00:51:43.853
you can make a contract
and then it is not done

00:51:43.920 --> 00:51:46.370
via the internet, but you
sent a USB stick to them

00:51:46.900 --> 00:51:49.089
and they will handle
it with all the privacy

00:51:49.900 --> 00:51:51.648
related items involved.

00:51:52.440 --> 00:51:55.211
However, with the
current version of Whisper,

00:51:56.444 --> 00:51:58.512
I mean, you can do it
on your own computer.

00:51:58.680 --> 00:52:00.449
So that means that
you can listen to it

00:52:00.560 --> 00:52:03.752
and make an
handwritten transcription

00:52:04.200 --> 00:52:06.450
or you can give it to Whisper
on your own computer.

00:52:06.840 --> 00:52:09.209
So there's no, I mean, you
can do it without any internet

00:52:09.940 --> 00:52:11.167
and you get the results.

00:52:11.640 --> 00:52:13.650
And okay, so there
is no difference

00:52:13.760 --> 00:52:15.629
between an
handmade transcription

00:52:16.080 --> 00:52:19.852
or a transcription done on
your own computer with Whisper.

00:52:20.760 --> 00:52:23.150
What you do after
that, and then of course,

00:52:23.500 --> 00:52:24.988
it's a sensible point
that you mentioned.

00:52:25.420 --> 00:52:26.848
I mean, once you
have the transcription

00:52:26.980 --> 00:52:28.569
and you place that
somewhere on a website

00:52:28.660 --> 00:52:29.746
or things like that, yeah.

00:52:30.320 --> 00:52:33.933
but there's no difference
between the use of whisper

00:52:34.160 --> 00:52:36.370
on your own
computer or doing it,

00:52:36.721 --> 00:52:38.609
the transcription
by hand yourself.

00:52:39.280 --> 00:52:43.232
So yeah, the privacy
is still very important,

00:52:43.880 --> 00:52:48.233
but whisper is not going
to change that, so to say.

00:52:49.564 --> 00:52:51.570
Is that more or less
what you wanted to hear?

00:52:52.620 --> 00:52:53.867
Yeah, very much.
Yeah, thank you.

00:52:54.420 --> 00:52:54.621
Okay.

00:52:58.103 --> 00:53:06.656
Hi, I was wondering, you mentioned the difference
between specific languages, like Dutch, and

00:53:07.680 --> 00:53:09.408
that more attention
needs to go towards that.

00:53:09.940 --> 00:53:14.674
But what would you say are the biggest differences
in the models currently used, for example,

00:53:14.840 --> 00:53:16.506
in sound and vision,
which has ASR?

00:53:18.300 --> 00:53:23.955
Do you see patterns in mistakes that it makes,
or in what kind of ways does that differ from

00:53:24.121 --> 00:53:25.027
whisper, for example?

00:53:25.361 --> 00:53:31.135
Ah, well, Whisper is, I mean, the error rate
on sound and vision that is using the Kaldi

00:53:31.220 --> 00:53:37.295
recognizer is, I mean, is more depending on
the quality of the speech and things like

00:53:38.003 --> 00:53:38.144
that.

00:53:38.480 --> 00:53:41.187
But for Dutch, it is between
15 and 12 percentage.

00:53:44.440 --> 00:53:50.573
So it means that you have 85% of the words
are recognized well, and for the other 15%,

00:53:51.540 --> 00:53:52.726
you absolutely
need to check them.

00:53:54.782 --> 00:54:01.815
is at the human level so that is between three
and five percentage. So it is only one-third

00:54:02.220 --> 00:54:09.856
of the errors made at with the Calde recognizer
are made by this part. Moreover, it's if you

00:54:09.900 --> 00:54:15.113
say well I'm Arjan and I'm living in Utrecht,
Arjan and also Utrecht is written as a capital

00:54:16.020 --> 00:54:21.455
and with the Calde recognizer all the words are
lower capital and there are no question marks

00:54:21.480 --> 00:54:27.074
There are no commas, there are no dots, and
all those reading stuff is not available.

00:54:27.320 --> 00:54:28.544
With Whisper, it is available.

00:54:31.140 --> 00:54:33.828
And I will show you.

00:54:36.060 --> 00:54:39.966
But that's, yeah, I'm still in.

00:54:47.780 --> 00:54:50.530
It's my web. Can you
see what I'm doing?

00:54:51.761 --> 00:54:52.405
Yeah, OK.

00:55:00.707 --> 00:55:02.954
And here is an example.
It is in Dutch, but well.

00:55:08.608 --> 00:55:09.792
And here you do see it is,

00:55:10.000 --> 00:55:13.671
either end starts with a capital
I and then there's a comma

00:55:14.781 --> 00:55:17.211
and another comma
and it's a dot on the end.

00:55:17.400 --> 00:55:22.915
The Taal Uni is written with capitals
T and U and et cetera, et cetera.

00:55:23.020 --> 00:55:25.852
So this is a very
nice conversation.

00:55:25.940 --> 00:55:28.151
Well, and even Caldi
was doing it quite well.

00:55:28.802 --> 00:55:32.954
But the recognition in Caldi
is only with the lower capitals.

00:55:33.521 --> 00:55:38.274
And this one is, it knows how to
write the Taal Uni, it knows how to write

00:55:38.360 --> 00:55:39.989
the Dutch and things like that.

00:55:40.501 --> 00:55:45.334
So, I mean, this is more or less,
and you don't need to repair this.

00:55:45.921 --> 00:55:53.436
So, coming back to your question on sound and vision, even
sound and vision will move to whisper as soon as possible.

00:55:54.722 --> 00:56:04.557
I mean, Roeland and I, we are working both at the University
of Twente and we know this and they start as quick as possible.

00:56:05.162 --> 00:56:09.374
I mean, we need a couple
of weeks, months to make it.

00:56:09.962 --> 00:56:14.895
we need to to build it and yeah it
is style unique that's right it's written

00:56:14.940 --> 00:56:22.796
different but anyway and but we mean I
mean I mean it's now March so I believe

00:56:22.860 --> 00:56:29.876
that for the summer this per will be
the default recognizer for Dutch that we

00:56:30.000 --> 00:56:35.875
offer so that means that you can do it
at sound to vision you can redo all the

00:56:35.940 --> 00:56:39.313
stuff that you did in the past or you
can say well only the new material will

00:56:39.461 --> 00:56:42.913
done with Whisper, but that's something
Sound and Vision has to deal with.

00:56:44.785 --> 00:56:48.735
Is this more or less answering your
question? Yeah, it does. Thank you. Okay.

00:56:52.229 --> 00:56:52.852
Other people.

00:57:04.134 --> 00:57:05.596
No questions at all.

00:57:07.260 --> 00:57:07.361
OK.

00:57:09.183 --> 00:57:13.534
Well, or it was completely not
understandable, or it was 100% understandable.

00:57:14.843 --> 00:57:16.709
Reinike, may I give
the word back to you?

00:57:19.344 --> 00:57:19.946
Yes, you may.

00:57:22.681 --> 00:57:29.055
As Ariane already indicated, we will schedule
to follow-up workshops with sufficient interest,

00:57:29.240 --> 00:57:31.469
so from four people on.

00:57:32.480 --> 00:57:40.316
If you're interested, you can send an email
to CDH at uu.nl, I already also put it in

00:57:41.124 --> 00:57:41.485
the chat.

00:57:42.680 --> 00:57:50.396
And we will try our best to schedule two workshops
that work for everyone on location in the

00:57:50.520 --> 00:57:51.383
Utrecht City Center.