The M-XML data can be found in the M-XML directory on Disc 1, divided into subdirectories for each sub-corpus. Each directory holds a compressed file, which when extracted contains a number of XML files, each corresponding to a single sample. XML files for registers with a particularly large number of samples (the LB, PB, OC, and OY registers) will be extracted into individual subdirectories.
M-XML (integrated XML) is a format based on the C-XML (character-based XML) format, which aims to provide an integrated, standardized description of the linguistic structure of both variable and fixed length samples. It embeds while maining information regarding the morphological analysis of short and long unit words, as well as hierarchical structure in order to simplify the handling of information related to linguistic structure. The XML files are encoded in UTF-8 format (without BOM).
C-XML is structured to use separate XML documents for variable length and fixed length samples. However, as both types of samples are gathered from a single text, a large number of sample portions will be duplicates. In order remove the neccessity of embedding morphological analyses twice, an integrated method of treatment for such data is desirable. However, as the tags would conflict it is not a simple matter to integrate two XML formats containing different structures. Accordingly, it was decided to integrate variable length and fixed length samples into a new integrated format using the following methods.
To begin with, unlike the variable length samples which were made with document structure in mind, block element tags describing document structure do not hold great meaning for fixed length samples where the objective was to obtain samples of uniform length. For that reason, it was decided that M-XML should maintain only the document structure of variable length samples, while assigning morphological descriptions (for both long and short unit words) to the contents of fixed length samples. Cases where the fixed length section is taken from a variable length sample are marked by a simple container (<div type="fiexdLength">), maintaining only inline elements.
M-XML identifies elements as described above using the following "mergedSample" element as a root.
<mergedSample sampleID="サンプルID" type="BCCWJ-MorphXML" version="1.0">
In C-XML, different document type definitions are used dependent on the sub-corpus. While the Yahoo! Answers (OC), Blogs (OY), Textbook (OT), and Poetry (OV) registers all have mostly similar structures, they each vary from general variable length samples depending upon their particular document type definitions. Because of this, when processing the data into a unified form it is possible for problems to arise.
Therefore, M-XML modifies portions of the tag sets in order to allow for the processing of a unified document type definition for all sub-corpora. Although the constraints will necessarily be somewhat loose when compared to C-XML, all the XML files can be examined via a single XML scheme. For this integration, tags particular to certain sub-corpora have been partially modified as in the following example.
OC : <OCQuestion> → <article articleID="サンプルID-Question">
<OCAnswer> → <article articleID="サンプルID-Answer">
OC, OY: <br type='physicalLine_original /> → <webBr/>
OT : <root> → <squareRoot>
The BCCWJ's layered structure is created based on the items defined as short unit words, long unit words, and clauses. As continuous clauses are make up sentences, and short unit words composed of characters, the morphological information contained in the BCCWJ uses the following hierarchical structure of lexical items.
Article / Sentence / Phrase / Long Unit Word / Short Unit Word / Character
In order to make practical use of the tags relating to document structure, and of the stratified morphological information, it is desirable for this hierarchical structure and its implied relationships to be reflected in the XML format. Based on this line of thinking, the morphological information is presented using the following structure.
The example below is a single sentence element extracted from a larger sample (for ease of reading, some properties have been omitted.)
The following properties can be assigned to the embedded Short Unit Word tags (SUWs). * In cases where it is not necessary to output the properties of symbols, output will not occur for either the variable or the property itself.
Property Name | Notes |
---|---|
start | Offset values from the heads of character strings in the original sample (in increments of 10). |
end | |
orderID | Serial # (compatible with TSV serial #s) |
lemma | A lexeme |
lForm | A lexeme's reading |
subLemma | Lexical sub-type * Only outputted when there is a distinction. |
wType | Word origin (e.g. native, borrowed, sino-Japanese) |
pos | Part of speech |
cType | Inflectional pattern * Only outputted for inflected words. |
cForm | Inflected form * Only outputted for inflected words. |
formBase | Word form |
usage | Rules of use * Only outputted when there is a distinction. |
orthBase | Infinitive form * Only outputted for inflected words. |
originalText | Characters string appearing in the original text. * Only outputted in instances where there are differences from the element in the original text. |
kanaToken | Kana form and surface form * Outputted only if there is a difference from the word form. |
pronToken | Surface pronunciation. |
Additionally, the underlying and surface forms in the TSV files corresponds to the text contained by the SUW tags. The kana forms and surface forms correspond to the kana readings in the text (or to the kana character strings, in the case of IME input).
* Source: TSV data details.
The following properties can be applied to embedded Long Unit Word (LUW) tags. * In cases where it is not necessary to output the properties of symbols, output will not occur for either the variable or the property itself.
Property Name | Notes |
---|---|
B | Sentence or clause boundaries. Clause boundaries=B, Sentense boundaries=S. |
SL | Sample length. Fixed length=f, Variable length=v |
l_lemma | Lexeme |
l_lForm | Lexeme reading |
l_wType | Word type (e.g. native, borrowed, sino-Japanese) |
l_pos | Part of speech. |
l_cType | Inflectional pattern * Only outputted for inflected words. |
l_cForm | Inflected form * Only outputted for inflected words. |
l_formBase | Word form. |
l_orthBase | Infinitive form * Only outputted for inflected words. |
Does not include information that can be obtained easily from the TSV "Long/Short agreement" data, the XML structure, or the properties of the subordinate SUWs.
If applying the hierarchy of morphological information described above to the various C-XML elements, the results can be thought of as in the "Hierarchical structure of included morphological information" table below (the colored elements are all parts indispensable to the original text). However, at such a time the problem arises of various elements from C-XML causing inconsistencies with this hierarchy. This is dealth with in M-XML by making the following changes to the C-XML tags.
In C-XML, it is allowed to have a large number of subordinate "sentence" tags, allowing as many sentences as necessary to be contained within a larger sentence. For example, in the following example where there is a quotation within a larger sentence, both the full sentence and the quotation are contained in "sentence" tags, with the quotation's "sentence" tag containing some markup information.
Although the mark-up information describing the complex nested sentence does have active meaning, problems do exist, such as: (1) The possibility for the top-level sentence to become excessively long, (2) The lack of clear rules governing the "Sentence" element when inputting morphological analysis, and (3) The inability to sort data via sentence number.
Therefore, in order to maintain the sentence hierarchy in M-XML, it was decided to no longer allow nested sentences as in the example above, and to instead to assign top-level sentences the new type of document structure tag, "superSentence". Subordinate sentences remain as they were, while a partial sentence directly following a "superSentence" tag will be marked with the new property type="fragment".
As there is a possiblity of tags inserted inside a sentence causing conflicts with the morphological information, it will be dealt with in the following way
"rubi" (transcriptions)
While in the BCCWJ furigana transcriptions of kanji readings are typically assigned to all kanji, in cases where there are special or exceptional kanji readings, multi-character strings will be marked with the "rubi" tag, as in the following examples.
This is example would be marked in C-XML as follows:
In M-XML, examples such as 1a) and 2a) where "ruby" tags are contained within "SUW" tags are left as-is. On the other hand, in examples such as 3a, 4a, and 5a where the scope of the "ruby" tag covers the head of an "SUW", it was decided, in order to indicate the true scope of the "ruby" tags, to contain the "ruby" tags within "SUW" tags while maintaining the original text of the "ruby" tag as a tag property. Thus, is becomes possible to revert to the original state, allowing for simple extraction of a special "ruby" tag across multiple units.
Quotations ("quote" tag)
There are cases where an SUW is broken up by a quotation tag, as in the following example.
In addition to the SUW and LUW tags, the following supplemental tags have been created for M-XML.
There are portions of the text wherein numerical transcriptions have been modified after undergoing the NumTrans process. The original text is retained in the "originalText" property of this tag.
The "fraction" has been added to mark fractions (outside of compound numbers) - numerators and denominators are marked in the following fashion. The "NumTrans" process is also occurs together with fractions.
Any reference information regarding revised pagination is left in the "info" tag element. Although all efforts have been made to maintain compatibility as shown above, due to the large number of changes there is no perfect way to maintain compatibility between the appended M-XML tags and C-XML tags.