Useful References


General Speech datasets

Emotional Speech Datasets

Speech Conversation Datasets

CABank English CallHome Corpus (Link)

The CallHome English corpus of telephone speech was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense.

This release of the CallHome English corpus consists of 120 unscripted telephone conversations between native speakers of English. The CD-ROM distribution contains the speech data only, along with essential documentation files and software for handling the compressed speech data. The transcripts and other text data and documentation are distributed separately (typically via electronic transmission from the LDC's ftp/web server), and will be subject to periodic updates. 

The transcripts cover a contiguous 5 or 10 minute segment taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends overseas. All calls originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography.

AMI Corpus (Link)

The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. For a gentle introduction to the corpus, see the corpus overview. To access the data, follow the directions given there. Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains. Detailed information can be found in the documentation section. 

HCRC Map Task Corpus(Link)

The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes. It was originally designed to elicit behaviours that answer specific research questions in linguistics. You can read more about the design here. Since the original material was released in 1992, the corpus design has been used not just for linguistics research, but also in teaching and by computational linguists for training machine classifiers.

Since HCRC continues to use the Corpus in our own research, we welcome contact with colleagues engaged in similar projects. For this reason we ask users to notify us at as a matter of courtesy of the topic of their intended work with these materials.

Because the Map Task is available in a number of forms, we provide a brief history explaining what these are what they contain. Most people just want the most up-to-date version, which is in the format for the NITE XML Toolkit (see NXT-format XML Annotations (v2.1)). The simplest way to acquire the corpus in that format is from the main download page. To make things easier, the audio and maps are available from the same page.

Privacy & Security

Privacy reserve

WiFi sensing

WiFi sensing Surveys

User identification/Human presence checking

Lip movement detection