Segmental Labeling of CSJ
Kikuchi Hideaki (National Inst. For Japanese Language)

a, i, u, e, o (voiced), A, I, U, E, o (devoiced), aH, iH, uH, eH, oH (long vowels)
Plain consonants
k, g, G[Υ], @[], s, z, t, c[ts], d, n, h, F, b, p, m, r, w
Phonetically palatalized (before /I/) consonants
kj, gj, Gj, @j, sj[], zj[], cj[t], nj[],hj[Ç],
Phonologically palatalized consonants
ky, gy, Gy, @y, sy, zy, cy, ny, hy, ny, hy, by, py, my, ry
Special morae
N (moraic nasal), Q (geminate), H(long vowel)
Special labels
#, <cl>, <pz>, <uv>, <sv> , <fv>, <?>, <N>, <b>

This table shows the inventory of the segmental labels used for the labeling of the CSJ-Core. The labels are basically phonemic, but there are some phonetic labels, too. Devoiced vowels, phonetic palatalization, and, variants of /g/ phoneme are the examples. Special labels are used to refe r to sub-phonemic events like closure of stops and affricates (<cl>), utterance-internal pause (<pz>), succession of voicing after the end of an utterance (<uv>), and so forth.

This inventory was designed to cover segmental phonological variations observed in Tokyo Japanese, but some of the labels are not used in the labeling of the CSJ-Core. Currently, we do not make distinction between '@' and'g', because there was considerable inter- labeler disagreement.

The process of segmental labeling was half- automated by use of the HMM-based speech recognition technique. This figure shows the flowchart of segmental labeling. Phoneme label sequence was generated from the ("phonetic") transcription. The generated phoneme labels are aligned to speech signal using phoneme HMM. Some of the phoneme labels are transformed into phonetic ones by rule. Human labelers, then, inspect the aligned labels and correct them if necessary.

At last, human labeler will modify the them manually. Automatic label generation and alignment reduced the time of labeling to about 50%.

Here is an example of segmental labels. The sample material is "watashi wa ryokoH ga daisuki de" (I like traveling very much).

Listen to the sound of sample(.wav file)

In order to evaluate the accuracy of segmental labeling, inter- labeler reliability was evaluated. Difference of label boundary was measured as the index of inter- labeler agreement.

Sample Labeler A Labeler B Diff. (ms)
# time[s] total merged total merged (Abs.)
1 22.4 298 29 282 36 7.78
2 24.2 348 30 299 36 7.76
3 22.9 312 31 291 41 7.70
4 33.6 408 25 378 45 8.34
5 22.2 275 19 261 25 6.76
6 32.0 396 33 350 56 11.46
7 19.2 253 17 237 26 8.37
8 28.3 307 24 306 26 9.73
9 23.0 301 22 275 35 7.03
whole 227.8 2898 230 2679 326 8.37
rate [%] - 7.9 - 12.2

This table compares the labeling results of two expert labelers who labeled 9 short samples excerpted from CSJ. The mean of absolute difference is about 8 ms.

Compared to the mean difference of ATR speech database, which deals with the material of word list reading, this is not a bad performance as the labeling of spontaneous speech material.

Lastly, accuracy of automatic labeling (without human correction) was evaluated. For this purpose, two HMM phoneme models were applied for three types, namely, read, spontaneous monologue, and spontaneous dialogue.

The first HMM model was trained with 40 hours of read speech taken from the JNAS corpus, which contains the reading of "Mainichi" newspaper articles.The second one was trained with 59 hours of the CSJ. The condition of acoustic analysis is shown in the bottom of the slide.

Acoustic model : HMM phoneme model
Training data :
Reed speach JNAS 40[h]
Spontaneous speech CSJ 59[h]
Condition of acoustic analysis
Sampling rate 16[kHz]
Frame lengh 25[ms]
Frame Shift 10[ms]
Feature parameter 12MFCC+12ΔMFCC+ΔPower

Target data Std. Dev. Of Diff.[ms] Acoustic model
ATR-DB 18.63 JNAS-mono
APS of CSJ 20.50 JNAS-mono
SPS of CSJ 21.60 JNAS-mono
All of CSJ 21.72 CSJ-mono
Dialogue([Osuga 2001]) 29.4 JNAS-mono
Dialogue([Mera 2001]) 28.19 JNAS-mono
Dialogue([Wightman 95]) 22.7 monophone
Difficulty of segmental labeling : Read < CSJ < Dialogue
Training Data : Read > CSJ

This table shows that the performance of automatic labeling applied for CSJ spontaneous speech is intermediate between the performance of read speech data (ATR database) and spontaneous dialogue data (reported by Osuga, Mera, and Wightman). Among the CSJ samples, SPS was more difficult than APS.

It is also interesting to see that the performance of acoustic model trained by CSJ data was not better than the performance of the model trained by JNAS data.

Presumably, this is because the acoustic boundaries in the read speech data are clearer than that of spontaneous speech, and, accordingly, the read speech model could detect the boundary features more accurately than the spontaneous speech model.

back to 5-4. Segmental label