The Groningen Corpus

The Groningen Corpus was collected by A.M. Sulter, MD and Prof. H.K. Schutte as part of a research project funded by NWO (Netherlands Organization for Scientific Research). The 4 CD-ROMs contain over 20 hours of speech. It is a corpus of read speech material in Dutch, recorded on PCM tape under fairly good conditions. These 4 CD-ROMs contain speech from 238 speakers who read:

  • 2 short texts (the famous North wind text, and a longer text, de Koning by Godfried Bomans, with many quoted sentences to elicit `emotional' speech)
  • 23 short sentences (containing all possible vowels and all possible consonants and consonant clusters in Dutch)
  • 20 numbers (the numbers 0--9 and the tens from 10--100)
  • 16 monosyllabic words (containing all possible vowels in Dutch)
  • 3 long vowels (a:, E, i)
Ninety-four speakers of the 238 speakers also read an extended word list. Orthographic transcriptions of the material are included.

The speakers are all speakers of the standard variant of Dutch. Some of the speakers are trained, others untrained, and others voice patients. Speaker information such as the age, length, and weight of the speakers, as well as smoking and drinking habits are provided. In addition, the voice quality of the speakers was evaluated by the speakers themselves and by a panel of untrained listeners. Additional information on vocal behaviour of the speakers will become available in the near future as a printed supplement to the CDs. In addition to the speech signal, an electro-glottograph signal has been included on the CD-ROMs. The data have been sampled at 16 kHz and compressed with the programme `Shorthen' (the UNIX version of this programme by Tony Robinson of Cambridge University Engineering Department, Cambridge, UK, is included on the CD-ROMs). The files have NIST Sphere headers.

The Groningen Corpus is the result of the joint efforts of the collectors of the data, A.M. Sulter, MD and Prof. H.K. Schutte, of the Speech Processing Expertise Centre (SPEX) which reprocessed the data for production on CD-ROM. The production on CD-ROM was partially supported by ELSNET and the pre-mastering was done at LIMSI-CNRS.

For further information, including ordering information, please contact ELRA - European Language Resources Association .


[print/pda] [no frame] [navigation table] [navigation frame]     Page generated 13-02-2008 by Steven Krauwer Disclaimer / Contact ELSNET