Character-based XML (C-XML)

The C-XML data can be found in the C-XML directory on Disc 1, divided into subdirectories for each sub-corpus. Each directory holds a compressed file, which when extracted contains a number of XML files, each corresponding to a single sample. XML files for registers with a particularly large number of samples (the LB, PB, OC, and OY registers) will be extracted into individual subdirectories.

Document Structure Tag Sets and Their Relationship with the Sub-Corpora

The BCCWJ is made up of a number of sub-corpora. The document structure tag sets correspond to the special properties of each of the sub-corpora, and are provided as shown in the table "Relationship between tag sets and sub-corpora". Individual tag sets are defined as XML documents. In addition, there are cases where the application of otherwise identical tags may change depending on whether the source text was a paper or electronic media. Situational differences in the properties and applications of certain tags may arise because of this.

The Tag sets (or TS) are broadly divided into the following 3 types. The "Variable length sample" (partially amended) in the table refers to a tag set which has been amended in parts as compared with the regular "Variable length sample" TS.

Variable Length TS:
A tag set made to describe variable length samples (where each sample is made up of a single "Article").
Fixed Length TS:
A tag set made to describe fixed length samples (where each sample consists of 1000 characters).
Yahoo! Answers TS:
A tag set made to describe samples from Yahoo! Answers.

Relationships Between Tag Sets and Sub-Corpora

Sub-Corpus	Tag Set	Media-Type of Source
Publication Sub-corpus	Variable length TS, Fixed length TS	Printed
Library Sub-corpus	Variable length TS, Fixed length TS	Printed
White Papers	Variable length TS, Fixed length TS	Printed
Textbooks	Variable length TS (partially amended)	Printed
PR Documents	Variable length TS	Electronic
Bestsellers	Variable length TS	Printed
Yahoo! Answers	Yahoo! Answers TS	Electronic
Yahoo! Blogs	Variable length TS (partially amended)	Electronic
Poetry	Variable length TS (partially amended)	Printed
Legal Documents	Variable length TS	Electronic
Minutes of the National Diet	Variable lentgh TS	Electronic

Variable Length Tag Set

The principle differences between the C-XML and M-XML tags are as follows

The variable length tag set is a tag set made to describe the variable length samples (samples which are comprised of one full "Article"). In total, there are 46 tag types.
The information provided by this tag set is broken down into the following 3 groups.

Tags Related to Sampling

"sample" and "sampling" are the tags related to samples. The "sample" tag marks the scope of a given sample, while the "sampling" tag contains information regarding various points of interest about the sample.

Tags Related to Characters and Transcription

The roles of this type of tag are: (1) To increase the convenience of data retrieval and processing, and (2) To allow the accurate description of the original texts in electronic format. An example of the former is the "correction" tag (denoting places where misprinted text was corrected).

生活基<correction type="erratum" originalText="盟">盤</correction>に
伸びを示し<correction type="omission">てGlt;/correction>いる
整備を<correction type="excess" originalText="を" />図るべく

An example of the latter are the "rubi" (denoting kana superscripts), and "missingCharacter" (denoting non-standard characters) tags.

Tags Related to Document Structure

These types of tags are used to assign logical roles to different parts of the document, and as seen in the variable length tag set tag list below they are divided into the following categories: (a) Hierarchical structure, (b) Figures, (c) Citations, (d) Annotations, and (e) Other.

The following is an explanation of tags related to hierarchical structure. These tags present information related to the high-level structure of the article, such as "cluster", "sentence", and "paragraph". The following example shows these elements are they appear in an extracted portion of an XML file. Indentation in the example further indicates subordinate structural hierarchy. For example, in the figure below you can see that there are "titleBlock", "cluster", and "paragraph" elements subortinate to the "article" element.

　　article
　　　titleBlock 第２節　内外均衡の背景
　　　paragraph 　53年度中にみられた...
　　　cluster
　　　　titleBlock １．財政金融政策の効果
　　　　paragraph 　石油危機後，...
　　　　cluster
　　　　　titleBlock （公共投資の拡大）

	Tag Name	Contents
Data Sampling	sample	Marks the scope of one sample.
Data Sampling	sampling	Gives meta-data regarding sampling.
Hierarchical Structure (Document Structure)	article	A single element of coherent text by one author.
	blockEnd	Marks semantic boundaries.
	cluster	Denotes the scope covered by a title tag.
	titleBlock	Denotes the title and directly related elements.
	title	The title of a definite portion of the sample.
	orphanedTitle	The title of an indefinite portion of the sample.
	list	Itemized or listed elements.
	paragraph	A paragraph element.
	sentence	A sentence element.
Figures (Document Structure)	figureBlock	Denotes the figure and any accompanying elements.
	figure	The figure element itself.
	caption	Any caption describing a figure.
	table	A table element.
Citations (Document Structure)	quotation	A quoted element from outside the text of the current article (e.g. from lists, photos, illustrations, etc).
	citation	A cited element from another document.
	source	Information regarding a cited document (document title, author name, document data, etc.)
	speech	Denotes transcriptions of speech or internal monologues.
	speaker	Marks character strings explicitly indicating the speaker.
	quote	Denotes quotations, utterances, internal monologues, and descriptions directly referenced from a different article.
Annotations (Document Structure)）	noteBody	Indicates a note and its scope.
Annotations (Document Structure)）	noteBodyInline	Indicates an inline note.
Other (Document Structure)	abstract	Indicates an abstract of an article or cluster element.
Other (Document Structure)	authorsData	Indicates information about the author of the article.
Other (Document Structure)	contents	A table of contents.
	profile	Profiles of the author or characters.
	rejectedBlock	Indicates the existence of a block deleted from the sample.
	verse	Indicates a poem, tanka, haiku, or song lyrics.
	verseLine	Marks a single line in a verse.
Characters and Annotation	ruby	Indicates kana superscript showing kanji readings.
	correction	Marks corrections of the original text.
	missingCharacter	Characters from outside the JIS X 0213:2004 character set.
	enclosedCharacter	Marks character enclosed by shapes which serve as bibliographical labels, such as ©.
	cursive	Indicates characters written in cursive.
	image	Symbols from outside the JIS X 0213:2004 character set.
	superScript	Marks a superscript element.
	subScript	Marks a subscript element.
	fraction	Marks the proper fraction portion of a compount number.
	delete	Marks text with a strike-through line.
	br	A line break.
	info	Info concerning characters in the original text, such as erasure lines.
	rejectedSpam	Indicates the existence of deleted inline characters.
	substitution	Indicates a character has been substituted for.

Fixed Length Tag Set

The fixed length tag set was made to describe fixed length samples (those samples consisting of exactly 1,000 characters). The specifications of the tag set are largely the same as of the variable length TS, but with the following differences.

Fixed Length Samples are Limited by Character Count

Fixed length block elements may not match up to the definitions of variable length elements. For example, an "article" element may contain a number of articles, chapters, or sections, in a fixed length sample is is possible that even just the first "titleBlock" element will not cover the entirety of the following text.

The "isWholeArticle" attribute of an "article" element is taken as implied (optional).

The Following Elements are not Used

The "cluster" element.

Yahoo! Answers Sub-Corpus Tag Set

　Samples from the "Yahoo! Answers" sub-corpus are arranged logically in pairs of questions and answers. However, because it was not possible to fully describe this structure using the existing variable and fixed length tag sets, these structures were treated as independent document types. There are 9 types of tags.

Tag Name	Contents
sample	Indicates the scope of a question/answer pair.
OCQuestion	Marks the original question.
OCAnswer	Marks the original answer.
br	A line break.
webLine	Marks automatic logical operators RE: web data.
sentence	Marks a unified sentence.
rejectedBlock	Marks an erased element.
ncr	Marks the erasure of mathematical operands or the substitution of an "=".
Info	Information deemed auxiliary.

The "Other" Tag Set

As shown in the "Relationships between tag sets and sub-corpora" table above, certain sub-corpora make use of an amended variable length tag set. This section will explain the differences between amended versions of the variable length tag set.

Yahoo! Blog

"ASCIIArt" has been added as a possible argument of the "type" attribute of the "rejectedBlock" tag. This indicates so-called ASCII art which was omitted during sample creation.

Poetry

A single "sample" element may contain multiple subordinate "article" elements. This is due to the sampling methods used for the "Poetry" sub-corpus, wherein multiple works (e.g. "article" elements) are included in each sample. It should be noted that in the variable length tag set a "sample" element will only have a single "article" sub-element.

Textbooks

22 tag types have been omitted from and 8 types added to the variable length tag set for use with the "Textbook" sub-corpus.

Omitted Tags

　abstract, authorsData, blockEnd, contents, cursive, delite, info, insert, list, listItem,
　orphanedTitle, paragraph, profile, quotation, quote, source, speaker, speech, table,
　titleBlock, verse, verseLine

Added Tags

　book, copyright, supplement, skippedBlock, surrogatePair, subRuby, root, skippedSpan

Tags Added or Modified For Use With the Textbook Sub-Corpus

	Tag Name	Contents
Elements describing hierarchical linguistic structure.	book	[Added] Indicates a full text book.
Elements describing hierarchical linguistic structure.	cluster	[Modified] Indicates the scope of the text covered by a chapter title, or section title as shown in the TOC.
Elements describing specific language structures.	copyright	[Modified] Indicates elements where special copyright treatment is necessary beyond what is found in "citation" elements.
	supplement	[Modified] Indicates an element that is in a different format from the main text (the main scholastic content), or that is supplemental to the main body of the textbook.
	skippedBlock	[Added] A text element that was not covered in the lexical survey when creating the textbook corpus word list.
Elements relating to characters and annotation.	surrogatePair	[Added] Denotes a surrogate pair character in the JIS　X　0213：2004 format as indicated by a "=".
	subRuby	[Added] Indicates kana transcriptions of kanji readings below the text in cases of horizontal writing, and to the left of the text in cases of vertical writing.
	root	[Added] Indicates areas where there is a danger of misinterpretation of the scope of the text indicated by a √ mark.
	skippedSpan	[Added] Indicates a character string that was omitted from the lexical survey when creating the textbook corpus word list.

* Reference: Tanaka et al. (2011). "II Textbook corpus character input and tag usage."

How to Apply

BCCWJ Design

Paid Edition Contents

Research Results