|
|
The NEMLAR Arabic Language Resources
The NEMLAR Arabic LRs comprise a set of three resources (namely, the NEMLAR Arabic Written Corpus, NEMLAR Arabic Broadcast
News Speech Corpus and NEMLAR Arabic Speech synthesis Corpus). These resources are owned and copyrighted by the NEMLAR
Consortium and they are available through ELRA .
The NEMLAR Arabic Written Corpus consists of 500K words of Standard Arabic text compiled from 13 different domains
(political news, political debate, Islamic text, common-word phrases, text from Broadcast News, business, Arabic literature,
general news, interviews, scientific press, sports press, dictionary-entry explanations and legal domain text), aiming to
achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features
of modern Arabic language. The time span of the data included goes from late 1990’s to 2005. The corpus is provided in 4
different versions: a) raw text, b) fully vowelized text, c) text with Arabic lexical analysis, and d) Arabic POS-tagged.
text
The NEMLAR Arabic Broadcast News Speech Corpus consists of 40 hours of transcribed Standard Arabic data (from 209 male and
50 female speakers) recorded from four different radio stations. Each daily-broadcast recording contains between 25 and 30
minutes of news and interviews. Transcriptions follow Transcriber conventions with the additional patch for Arabic. Thus,
transcriptions were done in Arabic characters and their transliterations were automatically generated. The character set
used for the transliterations follows the ISO-8859 standard. The annotation levels included focused on orthographic
transcription of speech, including named entities; speakers and speaker turns; segment markers; topic/story boundaries;
background noises; change of background; music/noise, and word boundaries.
The NEMLAR Arabic Speech synthesis Corpus has been produced so as to help build concatenative and parametric Arabic
TTS systems. This corpus consists of 10 hours of annotated recorded speech from native Arabic speakers (5 hours of a male
and 5 hours of a female speaker). All speech data was recorded at 96 kHz, 24 bits, 2 channels (one from a highly-sensitive
large-membrane microphone, and the other for electroglottograph (EGG) signal). The prompt sheets created were the same for
both male and female recordings. They contained 33,200 words that offered the following distribution: a) 6,600 were
extracted from different domains of the NEMLAR Arabic Broadcast News Speech corpus; b) 16,500 were selected from different
domains of the NEMLAR Arabic Written corpus; c) 3,500 represented frequent Arabic phrases, and d) the remaining 6,600 aimed
to cover missing and rare diphones. The full corpus comprises the following components: orthographic transcription, prosodic
transcription, phonetic transcription, phonetic segmentation and pitch marks.
The project is supported by the European
Commission's
INCO-MED
programme and is running from
February 1st 2003 until July 31st 2005.
|