TSV Data Details

TSV format data can be found under the LUW and SUW directories on Disc 2, divided into further directories by sub-corpus. Each directory contains a compressed file, which when extracted contains tabular data regarding each sub-corpus. However, for the sub-corpora with extremely large amounts of data (LB and PB), the data will be segmented into multiple files.

The TSV data contains the previously discussed morphological information in delimited tabular format, and is based on the BCCWJ online service "Chunagon". There are tables containing information on both short and long unit words, which is further divided by sub-corpus. The text data is in UTF-8 format (without BOM).

TSV contains redundant data for both SUWs and LUWs which is available individually.

Short Unit Word TSV Fields

The contents of the TSV data fields regarding Short Unit Words are listed (from left to right) in the table "Short Unit Word TSV Fields". One short unit word corresponds to a single record (line).

Short Unit Word TSV Fields

Field Name	Notes
Subcorpus Name
Sample ID
Character Starting Position	The offset from the head of the original sample where the SUW is found (in increments of 10).
Character Ending Position
Sequence Number	The position of the word within a long unit word (increments of 10).
Surface Form Start Position	Offset of the start/end positions of the surface and infinitive forms from the sample's head (increments of 10)
Surface Form End Position
Fixed Length Flag	0: Not fixed length, 1: Fixed length
Variable Length Flag	0: Not variable length, 1: Variable length
Sentence Head Label	B: Is a sentence head, I: Not a sentence head
Word List ID	An ID identifying the word at the level of surface or infinitive form (as the number of digits is very large the bigint format is used)
Lexeme ID	An ID corresponding to UNIDIC lexemes.
Lexeme	Short unit word information.
Lexeme Reading
Detailed Lexeme Classification
Word Type
Part of Speech
Inflectional Pattern
Inflectional Form
Word Form
Usage
Infinitive form
Infinitive Form and Surface Form
Original Character String
Surface Pronunciation

Regarding the "Sentence head label", it will be labelled "B" at the starting point of "sentence" tags in C-XML.
　The difference between the "Character starting position" and the "Surface form starting position" fields corresponds with the aforementioned "Original character string" and "Infinitive form and Surface form" fields. The "Original character string" included in the short unit word information is the character string before the conversion of any numerical characters. If the conversion of numberical characters results in the creation of a segmented character string, then the different fields will be assigned as shown in the table "Examples of interactions based on the start point of the conversion of a string of numerical characters".

Examples of interactions based on the start point of a string of converted numerical characters

Character start point	Character end point	Sequencing Number	Surface form start point	Surface form end point	Infinitive and surface form	Original character string
10	50	10	10	30	二千	2011
10	50	20	30	40	十	2011
10	50	30	40	50	一	2011

Long Unit Word TSV Fields

he contents of the TSV data fields regarding Long Unit Words are listed (from left to right) in the table "Long Unit Word TSV Fields". One long unit word corresponds to a single record (line).

Long Unit Word TSV Fields

Field Name	Notes
Subcorpus Name
Sample ID
Surface Form Start Point	The offset from the head of the infinitive and surface forms in the original sample where the LUW is found (in increments of 10).
Surface Form End Point
Clause	B: Is a clause, Empty space: Not a clause
Short/Long Differentiation Flag	Indicates if the scope of a SUW and LUW match. 0: Short and long match, 1: Short and long differ.
Fixed Length Flag	0: Not fixed length, 1: Fixed length
Variable Length Flag	0: Not variable length, 1: Veriable length
Lexeme	Long Unit Word information.
Lexeme Reading
Word Type
Part of Speech
Inflectional Pattern
Inflected Form
Word Form
Infinitive Form
Infinitive and Surface Form
Original Character String
Surface Pronunciation
Sequence Number	The ordering of the long unit word within a sample (increments of 10).
Character Start Point	Offset of the distance of the character from the head of the original character string (increments of 10).
Character End Point
Sentence Head Label	B: Sentence head, I: Not a sentence head.