The Groningen Corpus
The Groningen Corpus was collected by A.M. Sulter, MD and
Prof. H.K. Schutte as part of a research project funded by NWO
(Netherlands Organization for Scientific Research). The 4 CD-ROMs
contain over 20 hours of speech. It is a corpus of read speech
material in Dutch, recorded on PCM tape under fairly good conditions.
These 4 CD-ROMs contain speech from 238 speakers who read:
Ninety-four speakers of the 238 speakers also read an extended word
list. Orthographic transcriptions of the material are included.
- 2 short texts (the famous North wind text, and a longer
text, de Koning by Godfried Bomans, with many quoted sentences to
elicit `emotional' speech)
- 23 short sentences (containing all possible vowels and all
possible consonants and consonant clusters in Dutch)
- 20 numbers (the numbers 0--9 and the tens from 10--100)
- 16 monosyllabic words (containing all possible vowels in Dutch)
- 3 long vowels (a:, E, i)
The speakers are all speakers of the standard variant of Dutch. Some
of the speakers are trained, others untrained, and others voice
patients. Speaker information such as the age, length, and weight of
the speakers, as well as smoking and drinking habits are provided. In
addition, the voice quality of the speakers was evaluated by the
speakers themselves and by a panel of untrained listeners. Additional
information on vocal behaviour of the speakers will become available
in the near future as a printed supplement to the CDs. In addition to
the speech signal, an electro-glottograph signal has been included on
the CD-ROMs. The data have been sampled at 16 kHz and compressed with
the programme `Shorthen' (the UNIX version of this programme by Tony
Robinson of Cambridge University Engineering Department, Cambridge,
UK, is included on the CD-ROMs). The files have NIST Sphere headers.
The Groningen Corpus is the result of the joint efforts of the
collectors of the data, A.M. Sulter, MD and Prof. H.K. Schutte, of the
Speech Processing Expertise Centre (SPEX) which reprocessed the data
for production on CD-ROM. The production on CD-ROM was partially
supported by ELSNET and the pre-mastering was done at LIMSI-CNRS.
For further information,
including ordering information, please contact ELRA -
Resources Association .