BRS85 - Wetenschappelijk

Christoph Draxler, Henk van den Heuvel, Arjan van Hessen, Silvia Calamai, Louise Corti, Stefania Scagliola

Abstract

In this paper we present a first version of a transcription portal for audio files based on automatic speech recognition (ASR) in various languages. The portal is implemented in the CLARIN resources research network and intended for use by non-technical scholars. We explain the background and interdisciplinary nature of interview data, the perks and quirks of using ASR for transcribing the audio in a research context, the dos and don’ts for optimal use of the portal, and future developments foreseen. The portal is promoted in a range of workshops, but there are a number of challenges that have to be met. These challenges concern privacy issues, ASR quality, and cost, amongst others.

Keywords: automatic speech recognition, interviews, digital humanities, social sciences, research infrastructure

icoonpdf full paper

Integrating digital practices and tools in the scholarly workflow

Summary

fantasy As increasingly sophisticated new technologies come on stream, there is one type of data that begs to be explored by the wide array of available digital humanities tools, and that is interview data. (De Jong, 2014, Corti and Fielding, 2016). A community of experts from The Netherlands, Great Britain, Italy and Germany engaged with interview data from different perspectives, organized a series of 4 workshops from 2016 to 2018 that were funded by CLARIN (Common Language Resources and Technology Infrastructure) and held in Oxford, Utrecht, Arezzo and München. This paper presents the preliminary results of these series of multidisciplinary workshops, sketching the goals, the selection of participants, of data and of tools and the way the invited scholars coped with unfamiliar scholarly approaches and digital tools. The premise was that the multimodal character (text, sound and facial expression) and multidisciplinary potential of interview data (history, oral and written language, audio-visual data) is rarely fully exploited, as most scholars focus on the analysis of the textual representation of the interviews. This might change by getting acquainted with scholarly approaches and conventions from other disciplines.

Developing a Transcription Chain

As the first goal was to develop a portal for automatic transcription and alignment of interview data for different languages, the participating countries were selected on the basis of the availability of mature open source speech retrieval software. Scholars who participated represented the following communities:

Historians and social science users who undertake research with recorded interview data sources;
Linguists who use spoken language sources;
Software tools specialists

Consequently during the first three workshops user requirements were collected and the performance of various speech to text software on interview data was tested with data provided by researchers and data curators.

Cross disciplinary overtures in München

In the 3 day workshop in München the focus shifted to the phase of the annotation and analysis of the data. Anticipating that the diversity of participants and tools would made the organization of the workshop complex, it was key to design the workshop in a way that ensured ‘satisfying experiences’. To this end the following principles were applied:

Information on the participants level of digital savviness was gathered,
Data familiar to the participants in both a common language (English) and in their native language was prepared
Homework was assigned in order to become familiar with the tools,
A participant with advanced digital skills was present in each of the language groups, and
Feedback was elicited and recorded directly after the session exercises.

In the first session the first version of the T-Chain was tested with German, Dutch, English and Italian data. In the subsequent sessions participants worked with proprietary and open source annotation tools, common among social scientists, with text mining tools used by computational linguists, and with emotion recognition tools used by computer scientists.

A parade of research trajectories

Before starting the hands on sessions it was deemed necessary to present the research profiles of the disciplines represented. This to counter the ‘simplification’ that occurs when referring to other disciplines. Within every discipline distinct sub-disciplines exist. Speaking of ‘linguistics’ is an over-simplification, just as the term ‘oral history’ covers a broad variety of approaches. An oral historian will typically approach a recorded interview as an intersubjective account of a past experience, whereas a colleague might consider the same source as a factual testimony. A social scientist is likely to try to discover common themes and similarities and differences across a whole set of interviews, while a computational linguist will do the same, but based on counting frequencies. To illustrate the variety of landscapes, participants were invited to provide ‘research trajectories’ that reflected their own approach(es) to working with interview data. This enabled us to come up with a high-level simplified trajectory and to identify how and where the digital tools might fit into the researchers’ workflow.

High-level simplified journey of working with interview data

Tools for Transcription, annotation, linguistic analysis and emotion recognition The researchers were invited to work in four ‘language groups’ of 5 to 6 people (Dutch, English, Italian and German) in hands-on sessions, using step-by-step worksheets and pre-prepared interview extracts. The T-Chain, developed with CLARIN support, with its speech to text and alignment software, was able to partly substitute the cumbersome transcription of interviews, a practice that is common to anyone working with interviews. When trying out the annotation tools, that offer a structured system to attribute meaning to passages, the common needs tended to decrease. The reason is that a choice of a particular annotation tool leads to an engrained practice of research, that cannot be easily traded for an alternative. For this purpose the open source tool ELAN was used, and compared with the proprietary qualitative data software NVivo. In the following two sessions the experimental component increased, as the same interview data was explored with computational linguistic tools of increasing complexity. First the web-based tools Voyant and NLPCore, were used allowing the processing of transcripts ‘on the fly’, thus enabling a whole set of specific language features to be directly identified. The second tool, TXM, had to be installed and allowed for a more granular analysis of language features, requiring the integration of a specific language model, the splitting of speakers, the conversion of data into computer readable XML language, and the ‘lemmatization’ of the data. The last session was the most daunting one, illustrating what the same data yielded when processed with the acoustic tool PRAAT, and the facial recognition tool Open Smile.

User experience and evaluation

A first analysis of the user experience suggests that scholars are very much open to cross- fertilization. Knowing what a digital actually does ‘behind the scenes’ increased participants’ sense of the tool’s usability. But scholars are only willing to integrate a digital tool into their existing research practice and methodological mindset, if it can easily be used or adapted to their needs. Limited functionality of the free easy-to-use tools, and the observed methodological and technological complexity and jargon-laden nature of the dedicated downloadable tools, were both seen as significant barriers, despite the availability of clear documentation. Specifically complex was the experience with the linguistic tool requiring data to be preprocessed. These have a high learning curve. While output triggers fascination and curiosity, it appears difficult to translate insights into the structure of language with the entire corpus as level of analysis to one’s non-computational practice of interpreting individual personal accounts. The same applies to the emotion recognition tools. The messy data that preprocessing interviews about migration yielded, led to sifting out possible hypothesis, but not to a deeper understanding of the experience of migration. The real challenge lies in being able translate insights with regard to scale, frequency and paralinguistic features into the classic interpretation of the interview data. Often this means looking at other things, for instance the amount and character of silences within an entire corpus. The most salient conclusion of the exploration could be that the traditional practice of interpretation of interviews can be enriched by considering digital tools as purely heuristic instruments, that should be considered at the very beginning of the research process, when one is still considering what collections to reuse or what approach to take. The workshop participants also pointed to some more mundane concerns. Explanation of linguistic approaches would be better appreciated in more lay terms, following a step-by-step pathway from meta to concrete level. A third concern was related to the ethical issues of uploading an interview in a webservice without knowing what happens to this data. This prompted thinking about how these tools could add explicit GDPR-compliant data processing agreements to allay worries. Finally, users pointed to a great need for contextual information about, and metadata for, the data collection and processing, when interview data sources are used. Inviting language resources and scholars across languages and different disciplines certainly enriched the meeting experience.

References

Corti, L., and Fielding, N. (2016). Opportunities From the Digital Revolution: Implications for Researching, Publishing, and Consuming Qualitative Research. SAGE Open. https://doi.org/10.1177/2158244016678912\.
Lanman, B.A, and Wendling, L.M. (2006) Preparing the Next Generation of Oral Historians: An Anthology of Oral History Education. Altamira Press.
de Jong, FMG., van Hessen, A., Petrovic, T., Scagliola, S. (2014) ‘Croatian Memories : speech, meaning and emotions in a collection of interviews on experiences of war and trauma’, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14).
Truong, K P, Westerhof , G J, Lamers, S.M.A., de Jong, FMG, Sools, A, (2013) ‘Emotional expression in oral history narratives: comparing results of automated verbal and nonverbal analyses’ Proceedings of the Workshop on Computational Models of Narrative, CMN 2013, 2013 Workshop on Computational Models of Narrative, CMN 2013 Hamburg, Germany.
Van den Heuvel, H., van Hessen, A.,Scagliola,S., Draxler, C. (2017) Transcribing Oral History Audio Recordings – the Transcription Chain Workflow. Poster at EU. Clarin Conference, Budapest, September 18/19- 2017.

Pagina 1 van 4

A CLARIN Transcription Portal for Interview Data

Cross disciplinary overtures with interview data

A Transcription Portal for Oral History Research and Beyond

Speech Recognition and Scholarly Research: Usability and Sustainability

Evaluation of the OH-portal

In response to your inquiry Automatic e-mail answer suggestion in a Dutch Contact Centre

Utterance generation for transaction dialogues

Croatian Memories: an interview collection with personal narratives on war and trauma