The whole CSJ contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech material are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed using a two-way transcription scheme designed especially for CSJ. Also, POS (part-of-speech) analysis based upon two different kinds of 'word' is applied for the whole corpus.
3.2 The Core
There is a true subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech. Core is the part of CSJ to which we concentrate the cost of annotation. In addition to the two-way transcription and two-way POS analysis, segment label, intonation label, and other miscellaneous annotations are provided for the Core.