The structure of the CSJ-RDB
The CSJ-RDB is composed of the following three types of databases.
Basic database (csj.db)
The basic database is composed of the following five types of tables:
Figure 1. Basic database data representation scheme
Segment tables
Each unit in Figure 1 is a segment table describing a particular element of the discourse. The following information is common to all segment tables:
Column Name |
Description |
Example |
TalkID |
ID of a talk |
A01F0055 |
ClauseID, BunsetsuID, SUWID etc. |
ID of individual unit |
00720909L |
StartTime |
Unit start time |
720.909 |
EndTime |
Unit end time |
721.369 |
Channel |
Speaker label |
L |
Using these, you can identify the unique position in which the unit occurs. There are some special cases - for example tone information (prosodic information such as the accents and tone at the end of the phrase) - which occur in a single moment (same start and end times). Such cases are represented by point tables. Point tables contain the following information:
Column name | Description | Example |
TalkID | ID of a talk | A01F0055 |
ToneID | ID of individual unit | 00720909L |
Time | Time the element occurs | 720.909 |
Channel | Speaker label | L |
In addition to the information common to all segment tables described above, there is also information specific to the different elements. For more information on the specific data contained in each table, please see the following:
See: Individual segment table information list
Unaligned-segment Tables
In spontaneous speech, there are often occurences where multiple words fuse into a single element that cannot be split. For example, it is possible for the word "I" ("Bokuwa") to be fused and pronounced "Boka". In such a case, the morphology information (long and short unit words) would give separate information for the word "I" ("Boku") and the topicalizing particle "Wa". However, because these segments were fused, it is not possible to identify the start and end times of the short unit words. Therefore, in the segment table you will be able to see the start and end time, but it will lack certain detailed information such as part of speech. However, the unaligned-segment table will contain that part of speech information as the "I" and topicalizer are separated, but will not contain start or end times.
Relationship between segments and unaligned-segments in cases where a word has been uttered by fusing short units.
All unaligned-segment tables contain the following four attributes:
Column name | Description | Example |
TalkID | ID of a talk | A01F0055 |
SUWMorphID, LUWMorphID | ID of each unaligned (fused) elements. | 00720909L |
In addition to this common information, there is also information specific to each table. Please see the following for further information:
See: Individual sub-segment table information list
Parent-child relationship table
Parent-child relationship tables represent the hierarchical relationships shown in Figure 1 by comparing pairs of IDs. For example, as seen in Figure 3, any "clause unit table" will have a correseponding "Bunsetsu table" - because there is a parent-child relationship between these two unit types, a table will be included which presents the relationship between the two.
The following information is common to all parent-child relationship tables, as seen in Figure 3.
Column name |
Description |
Example |
TalkID |
ID of a talk |
A01F0055 |
ClauseID, BunsetsuID, SUWID etc. |
ID of the parent (ancestor) segment |
00262895L (ID of the parent clause unit in Fig. 3) |
ClauseID, BunsetsuID, SUWID etc. |
ID of the child (descendant) segment |
00263769L (ID of a child clause in Fig. 3) |
len |
The total number of child segments in the parent segment |
4 (in the case where a parent unit consists of four child units) |
nth |
Position of the child segment in the parent segment |
3 (the third segment from the beginning) |
Since clause units and short unit words also share a ancestor-descendant type relationsip, a number of similar tables may be created. On the other hand, since clause units and accentual phrases are not in a parent-child relationship, no such tables will be created (see Figure 1).
By using the parent-child relationship tables, you can facilitate analysis involving more than one unit. You could, for example, perform searches that retrieved the final clause unit's length, or that returned the clause unit found 10 units before that. Because other types of relationships are also described, you could for example find the prefix of the short unit word in the final clause of the final clause unit.
Link tables
Other relationships between units beyond the parent-child relationship are also considered. For example, clause dependency describes the relationship between two bunsetsu.
Link tables are used in the basic database to describe relationships other than parent-child. There are two such types of tables in the basic database: "Bunsetsu dependency tables" and "Tone inheritance tables". The latter describes which accent phrase various tone labels - such as tone accents and prosody labels - are part of.
The following information is contained in all link tables:
Column name | Description | Example |
TalkID | ID of a talk | A01F0055 |
Original BunsetsuID clause | ID of the source segment | 00358705L |
BunsetsuID clause after linking | ID of the linked segment | 00359291L |
The specific names of the link/source IDs are different in each table type. For more details, see the following:
See: Individual link table information list
There are a number of different meta-data tables included in the basic database. "Basic discourse information" contains basic information about the recording. "Basic speaker information" contains data on the speaker(s) in the recording. "Dialogue information" contains information such as the ID of the recording that was the topic of the dialogue (or interview), and information on the interviewier. "Re-reading information" contains information such as the ID of the original subject of the re-reading. Finally "Individual impressionistic rating information" and "Grouped impressionistic rating information" gives information about subjective impressions of certain discourses.
See: Individual meta-data table information list
Table information details
■Unaligned-segment Tables
- * 短単位
- Column
- Description
- Example
- usegSUWMorph
- OrthographicTranscription
- 出現形(短単位)
- (M 行き)
- PlainOrthographicUlanscription
- Untagged surface form (SUW)
- 行き
- SUWDictionaryForm
- Dictionary form (SUW)
- イク
- SUWLemma
- Representative lemma (SUW)
- 行く
- PhoneticUlanscription
- Phonetic transcription (SUW)
- イキ
- SUWPOS
- Part of speech (SUW)
- 動詞 (Verb)
- SUWConjugateType2
- Inflection type (SUW)
- カ行五段2 (-k verb type)
- SUWConjugateForm2
- Conjugated form (SUW)
- 連用形2 (Conjunctive form)
- SUWMiscPOSInfo1
- Other information 1 (SUW)
- 副助詞 (Adverbial particle)
- SUWMiscPOSInfo2
- Other information 2 (SUW))
- 語幹 (Stem)
- SUWMiscPOSInfo3
- Other information 3 (SUW)
- 言いよどみ (Hesitation)
- ClauseBoundaryLabel
-
参照: 『日本語話し言葉コーパスの構築法』 第5章 節単位情報
5.2.3 CBAP-csjが検出する節境界の種類 p.267-269
図5.5 CBAP-csjで検出される49種類の節境界ラベル p.267
|
- Label for clause boundaries
- <テ節>
- CU_preBracket
-
参照: 『日本語話し言葉コーパスの構築法』 第5章 節単位情報
5.4.1 人手修正作業の概要 p.292-293
表5.4 人手修正操作記号の一覧 「範囲記号」のうち開き括弧 p.293
|
- Brackets before a clause unit
- <<
- CU_postBracket
-
参照: 『日本語話し言葉コーパスの構築法』 第5章 節単位情報
5.4.1 人手修正作業の概要 p.292-293
表5.4 人手修正操作記号の一覧 「範囲記号」のうち閉じ括弧 p.293
|
- Brackets after a clause unit
- >>
- CU_OperationSign
-
参照: 『日本語話し言葉コーパスの構築法』 第5章 節単位情報
5.4.1 人手修正作業の概要 p.292-293
表5.4 人手修正操作記号の一覧 「切断記号」「結合記号」 p.293
|
- Operator symbol for the clause unit
- -
- CU_ObligateComment
-
参照: 『日本語話し言葉コーパスの構築法』 第5章 節単位情報
5.4.2 人手修正作業で扱う項目の分類 p.293-p.294
図5.11 人手修正の対象となる項目の一覧と,コア177講演における出現数 p.294
|
- Necessary comments on the clause unit
- 体言止め
(Sentence-final NP)
- * LUWDictionaryForm
- Column
- Description
- Example
- usegLUWMorph
- LUWDictionaryForm
- Dictionary form (LUW)
- イク
- LUWLemma
- Representative lemma (LUW)
- 行く
- LUWPOS
- Part of speech (LUW)
- 動詞 (Verb)
- LUWConjugateType
- Inflection type (LUW)
- カ行五段 (-k verb type)
- LUWConjugateForm
- Conjugated form (LUW)
- 連用形 (Conjunctive form)
- LUWMiscPOSInfo1
- Other information 1 (LUW)
- 格助詞 (Case-marking particle)
- LUWMiscPOSInfo2
- Other Information 2 (LUW)
- 促音便 (Nasalized)
- LUWMiscPOSInfo3
- Other Information 3 (LUW)
- 連語 (Compound word)
■ Link Tables
- * Bunsetsu dependencies
- Column
- Description
- Example
- linkDepBunsetsu
- TalkID
- The discourse ID
- S01F0001
- BunsetsuID
- Linking clause ID
- 00000676L
- ModifieeBunsetsuID
- Modified clause ID
- 00001131L
- Dep_Label
- Dependency label
- D
- Dep_ObligateComment
- 係り受け義務的コメント
- F
- * Tone inheritance
- Column
- Description
- Example
- linkTone2AP
- TalkID
- The discourse ID
- S01F0001
- APID
- The linked accent phrase ID
- 00005551L
Syntactic information sub-set database (csj_syn.db)
The syntactic information subset database is based on the basic database, but is composed of only syntactic information as shown in Figure 4. Thedetails of the tables are the same as in the basic database.
Figure 4; Syntactic information subset database data representation scheme
Acoustic information database (csj_ac.db)
The acoustic information database contains the following two tables. When combined with csj.db and csj_syn.dp, it is possible to extract the F0 value and power level of a specified location.
- F0 value table
- Column
- Description
- Example
- pointF0
- TalkID
- The discourse ID
- S01F0001
- Channel
- The narrator label
- L
- F0ID
- The ID of the F0 extraction point
- 34
- F0Val
- The F0 value (extracted with ESPS. Used at the time of prosodic labelling)
- 294.523
- Power information table
- Column
- Description
- Example
- pointPwr
- TalkID
- The discourse ID
- S01F0001
- Channel
- The narrator label
- L
- PwrID
- The ID of the power extraction point
- 15
PwrVal
- The power value (extracted with wavesurfer)
37.703727722168