The C-XML data can be found in the C-XML directory on Disc 1, divided into subdirectories for each sub-corpus. Each directory holds a compressed file, which when extracted contains a number of XML files, each corresponding to a single sample. XML files for registers with a particularly large number of samples (the LB, PB, OC, and OY registers) will be extracted into individual subdirectories.
The BCCWJ is made up of a number of sub-corpora. The document structure tag sets correspond to the special properties of each of the sub-corpora, and are provided as shown in the table "Relationship between tag sets and sub-corpora". Individual tag sets are defined as XML documents. In addition, there are cases where the application of otherwise identical tags may change depending on whether the source text was a paper or electronic media. Situational differences in the properties and applications of certain tags may arise because of this.
The Tag sets (or TS) are broadly divided into the following 3 types. The "Variable length sample" (partially amended) in the table refers to a tag set which has been amended in parts as compared with the regular "Variable length sample" TS.
Sub-Corpus | Tag Set | Media-Type of Source |
---|---|---|
Publication Sub-corpus | Variable length TS, Fixed length TS | Printed |
Library Sub-corpus | Variable length TS, Fixed length TS | Printed |
White Papers | Variable length TS, Fixed length TS | Printed |
Textbooks | Variable length TS (partially amended) | Printed |
PR Documents | Variable length TS | Electronic |
Bestsellers | Variable length TS | Printed |
Yahoo! Answers | Yahoo! Answers TS | Electronic |
Yahoo! Blogs | Variable length TS (partially amended) | Electronic |
Poetry | Variable length TS (partially amended) | Printed |
Legal Documents | Variable length TS | Electronic |
Minutes of the National Diet | Variable lentgh TS | Electronic |
The principle differences between the C-XML and M-XML tags are as follows
The variable length tag set is a tag set made to describe the variable length samples (samples which are comprised of one full "Article"). In total, there are 46 tag types.
The information provided by this tag set is broken down into the following 3 groups.
"sample" and "sampling" are the tags related to samples. The "sample" tag marks the scope of a given sample, while the "sampling" tag contains information regarding various points of interest about the sample.
The roles of this type of tag are: (1) To increase the convenience of data retrieval and processing, and (2) To allow the accurate description of the original texts in electronic format. An example of the former is the "correction" tag (denoting places where misprinted text was corrected).
An example of the latter are the "rubi" (denoting kana superscripts), and "missingCharacter" (denoting non-standard characters) tags.
These types of tags are used to assign logical roles to different parts of the document, and as seen in the variable length tag set tag list below they are divided into the following categories: (a) Hierarchical structure, (b) Figures, (c) Citations, (d) Annotations, and (e) Other.
The following is an explanation of tags related to hierarchical structure. These tags present information related to the high-level structure of the article, such as "cluster", "sentence", and "paragraph". The following example shows these elements are they appear in an extracted portion of an XML file. Indentation in the example further indicates subordinate structural hierarchy. For example, in the figure below you can see that there are "titleBlock", "cluster", and "paragraph" elements subortinate to the "article" element.
article
titleBlock 第2節 内外均衡の背景
paragraph 53年度中にみられた...
cluster
titleBlock 1.財政金融政策の効果
paragraph 石油危機後,...
cluster
titleBlock (公共投資の拡大)
Tag Name | Contents | |
---|---|---|
Data Sampling | sample | Marks the scope of one sample. |
sampling | Gives meta-data regarding sampling. | |
Hierarchical Structure (Document Structure) | article | A single element of coherent text by one author. |
blockEnd | Marks semantic boundaries. | |
cluster | Denotes the scope covered by a title tag. | |
titleBlock | Denotes the title and directly related elements. | |
title | The title of a definite portion of the sample. | |
orphanedTitle | The title of an indefinite portion of the sample. | |
list | Itemized or listed elements. | |
paragraph | A paragraph element. | |
sentence | A sentence element. | |
Figures (Document Structure) | figureBlock | Denotes the figure and any accompanying elements. |
figure | The figure element itself. | |
caption | Any caption describing a figure. | |
table | A table element. | |
Citations (Document Structure) | quotation | A quoted element from outside the text of the current article (e.g. from lists, photos, illustrations, etc). |
citation | A cited element from another document. | |
source | Information regarding a cited document (document title, author name, document data, etc.) | |
speech | Denotes transcriptions of speech or internal monologues. | |
speaker | Marks character strings explicitly indicating the speaker. | |
quote | Denotes quotations, utterances, internal monologues, and descriptions directly referenced from a different article. | |
Annotations (Document Structure)) | noteBody | Indicates a note and its scope. |
noteBodyInline | Indicates an inline note. | |
Other (Document Structure) | abstract | Indicates an abstract of an article or cluster element. |
authorsData | Indicates information about the author of the article. | |
Other (Document Structure) | contents | A table of contents. |
profile | Profiles of the author or characters. | |
rejectedBlock | Indicates the existence of a block deleted from the sample. | |
verse | Indicates a poem, tanka, haiku, or song lyrics. | |
verseLine | Marks a single line in a verse. | |
Characters and Annotation | ruby | Indicates kana superscript showing kanji readings. |
correction | Marks corrections of the original text. | |
missingCharacter | Characters from outside the JIS X 0213:2004 character set. | |
enclosedCharacter | Marks character enclosed by shapes which serve as bibliographical labels, such as ©. | |
cursive | Indicates characters written in cursive. | |
image | Symbols from outside the JIS X 0213:2004 character set. | |
superScript | Marks a superscript element. | |
subScript | Marks a subscript element. | |
fraction | Marks the proper fraction portion of a compount number. | |
delete | Marks text with a strike-through line. | |
br | A line break. | |
info | Info concerning characters in the original text, such as erasure lines. | |
rejectedSpam | Indicates the existence of deleted inline characters. | |
substitution | Indicates a character has been substituted for. |
The fixed length tag set was made to describe fixed length samples (those samples consisting of exactly 1,000 characters). The specifications of the tag set are largely the same as of the variable length TS, but with the following differences.
Fixed length block elements may not match up to the definitions of variable length elements. For example, an "article" element may contain a number of articles, chapters, or sections, in a fixed length sample is is possible that even just the first "titleBlock" element will not cover the entirety of the following text.
The "isWholeArticle" attribute of an "article" element is taken as implied (optional).
The "cluster" element.
Samples from the "Yahoo! Answers" sub-corpus are arranged logically in pairs of questions and answers. However, because it was not possible to fully describe this structure using the existing variable and fixed length tag sets, these structures were treated as independent document types. There are 9 types of tags.
Tag Name | Contents |
---|---|
sample | Indicates the scope of a question/answer pair. |
OCQuestion | Marks the original question. |
OCAnswer | Marks the original answer. |
br | A line break. |
webLine | Marks automatic logical operators RE: web data. |
sentence | Marks a unified sentence. |
rejectedBlock | Marks an erased element. |
ncr | Marks the erasure of mathematical operands or the substitution of an "=". |
Info | Information deemed auxiliary. |
As shown in the "Relationships between tag sets and sub-corpora" table above, certain sub-corpora make use of an amended variable length tag set. This section will explain the differences between amended versions of the variable length tag set.
"ASCIIArt" has been added as a possible argument of the "type" attribute of the "rejectedBlock" tag. This indicates so-called ASCII art which was omitted during sample creation.
A single "sample" element may contain multiple subordinate "article" elements. This is due to the sampling methods used for the "Poetry" sub-corpus, wherein multiple works (e.g. "article" elements) are included in each sample. It should be noted that in the variable length tag set a "sample" element will only have a single "article" sub-element.
22 tag types have been omitted from and 8 types added to the variable length tag set for use with the "Textbook" sub-corpus.
Tag Name | Contents | |
---|---|---|
Elements describing hierarchical linguistic structure. | book | [Added] Indicates a full text book. |
cluster | [Modified] Indicates the scope of the text covered by a chapter title, or section title as shown in the TOC. | |
Elements describing specific language structures. | copyright | [Modified] Indicates elements where special copyright treatment is necessary beyond what is found in "citation" elements. |
supplement | [Modified] Indicates an element that is in a different format from the main text (the main scholastic content), or that is supplemental to the main body of the textbook. | |
skippedBlock | [Added] A text element that was not covered in the lexical survey when creating the textbook corpus word list. | |
Elements relating to characters and annotation. | surrogatePair | [Added] Denotes a surrogate pair character in the JIS X 0213:2004 format as indicated by a "=". |
subRuby | [Added] Indicates kana transcriptions of kanji readings below the text in cases of horizontal writing, and to the left of the text in cases of vertical writing. | |
root | [Added] Indicates areas where there is a danger of misinterpretation of the scope of the text indicated by a √ mark. | |
skippedSpan | [Added] Indicates a character string that was omitted from the lexical survey when creating the textbook corpus word list. |
* Reference: Tanaka et al. (2011). "II Textbook corpus character input and tag usage."