TSV format data can be found under the LUW and SUW directories on Disc 2, divided into further directories by sub-corpus. Each directory contains a compressed file, which when extracted contains tabular data regarding each sub-corpus. However, for the sub-corpora with extremely large amounts of data (LB and PB), the data will be segmented into multiple files.
The TSV data contains the previously discussed morphological information in delimited tabular format, and is based on the BCCWJ online service "Chunagon". There are tables containing information on both short and long unit words, which is further divided by sub-corpus. The text data is in UTF-8 format (without BOM).
TSV contains redundant data for both SUWs and LUWs which is available individually.
The contents of the TSV data fields regarding Short Unit Words are listed (from left to right) in the table "Short Unit Word TSV Fields". One short unit word corresponds to a single record (line).
Field Name | Notes |
---|---|
Subcorpus Name | |
Sample ID | |
Character Starting Position | The offset from the head of the original sample where the SUW is found (in increments of 10). |
Character Ending Position | |
Sequence Number | The position of the word within a long unit word (increments of 10). |
Surface Form Start Position | Offset of the start/end positions of the surface and infinitive forms from the sample's head (increments of 10) |
Surface Form End Position | |
Fixed Length Flag | 0: Not fixed length, 1: Fixed length |
Variable Length Flag | 0: Not variable length, 1: Variable length |
Sentence Head Label | B: Is a sentence head, I: Not a sentence head |
Word List ID | An ID identifying the word at the level of surface or infinitive form (as the number of digits is very large the bigint format is used) |
Lexeme ID | An ID corresponding to UNIDIC lexemes. |
Lexeme | Short unit word information. |
Lexeme Reading | |
Detailed Lexeme Classification | |
Word Type | |
Part of Speech | |
Inflectional Pattern | |
Inflectional Form | |
Word Form | |
Usage | |
Infinitive form | |
Infinitive Form and Surface Form | |
Original Character String | |
Surface Pronunciation |
Regarding the "Sentence head label", it will be labelled "B" at the starting point of "sentence" tags in C-XML.
The difference between the "Character starting position" and the "Surface form starting position" fields corresponds with the aforementioned "Original character string" and "Infinitive form and Surface form" fields. The "Original character string" included in the short unit word information is the character string before the conversion of any numerical characters. If the conversion of numberical characters results in the creation of a segmented character string, then the different fields will be assigned as shown in the table "Examples of interactions based on the start point of the conversion of a string of numerical characters".
Character start point | Character end point | Sequencing Number | Surface form start point | Surface form end point | Infinitive and surface form | Original character string |
---|---|---|---|---|---|---|
10 | 50 | 10 | 10 | 30 | 二千 | 2011 |
10 | 50 | 20 | 30 | 40 | 十 | 2011 |
10 | 50 | 30 | 40 | 50 | 一 | 2011 |
he contents of the TSV data fields regarding Long Unit Words are listed (from left to right) in the table "Long Unit Word TSV Fields". One long unit word corresponds to a single record (line).
Field Name | Notes |
---|---|
Subcorpus Name | |
Sample ID | |
Surface Form Start Point | The offset from the head of the infinitive and surface forms in the original sample where the LUW is found (in increments of 10). |
Surface Form End Point | |
Clause | B: Is a clause, Empty space: Not a clause |
Short/Long Differentiation Flag | Indicates if the scope of a SUW and LUW match. 0: Short and long match, 1: Short and long differ. |
Fixed Length Flag | 0: Not fixed length, 1: Fixed length |
Variable Length Flag | 0: Not variable length, 1: Veriable length |
Lexeme | Long Unit Word information. |
Lexeme Reading | |
Word Type | |
Part of Speech | |
Inflectional Pattern | |
Inflected Form | |
Word Form | |
Infinitive Form | |
Infinitive and Surface Form | |
Original Character String | |
Surface Pronunciation | |
Sequence Number | The ordering of the long unit word within a sample (increments of 10). |
Character Start Point | Offset of the distance of the character from the head of the original character string (increments of 10). |
Character End Point | |
Sentence Head Label | B: Sentence head, I: Not a sentence head. |