Integration of Morphological Information in XML (M-XML)

M-XML is an implementation based off the C-XML format which gives information on the linguistic structure of both Fixed and Variable Length Samples - it allows for the embedding of morphological and syntactic information concerning both Long and Short Unit Words, and allows easy access to such linguistic data. M-XML contains the following layers of information concerning lexical items:

Document Structure (Block) Tags / Sentence / Long Unit Word / Short Unit Word / Characters

Layered Structure of Morphological Information in XML Format

Differences from C-XML

The primary differences between C-XML and M-XML are outlined in the following sections.

1. Integration of Sentence Structure Information for Variable Length Samples

M-XML retains only sentence structure information for Variable Length Samples, and fixed length portions selected from Variable Length Samples are surrounded by simple containers. Only inline elements are retained.

2. Unification of Sentence Definitions

Due to some variable sentence definitions used in C-XML for samples of sources such as online knowledge bases, blogs, textbooks, and poetry being merged with sentence definitions in M-XML, certain tags particular to different sub-corpora were changed.

3. Revisions to Sentence Tags

C-XML allows for the recursive nesting of sentence tags, but this was revised in M-XML. In M-XML, a top-level sentence tag changes to the SuperSentence tag, while any a subordinate sentence remains the same. Text following a superSentence tag will be marked with the new notation of sentence type="Fragment".

4. Revisions to Nested Tags

In cases where the ruby or quote tages conflict with or obscure tags giving morphological information, they will be amended.

5. Supplement to the NumTrans Tag

In portions where calculations would be transformed into representative numbers, the NumTrans and fraction tags have been appended.

Example of the Structuring of Morphological Information

Below is an example of M-XML markup on a single extracted sample of a sentence element. For ease of viewing some properties have been omitted.

How to Apply

BCCWJ Design

Paid Edition Contents

Research Results