"The Balanced Corpus of Contemporary Written Japanese" (BCCWJ) is a corpus created for the purpose of attempting to grasp the breadth of contemporary written Japanese, containing extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken.
Morphological analysis has been conducted using the lexical items contained within all samples. Additionally, tags relating to sentence structure, and precise bibliographical information are provided. The copyright negotiation process has also been completed, so the corpus can be used without worry.
"The Balanced Corpus of Contemporary Written Japanese" is available to the public via three methods, online versions (free: Shonagon and Chunagon) and offline version (charged). Requests to use the corpus for commercial purposes are considered on an individual basis, so if that is the case please contact us at the address below.
With "Shonagon" there is no need for registration, and the corpus can be used freely. The use of the "Chuunagon" or Paid editions is under a usage contract - the period of this contract is 1 year for the online version, and 2 years for the offline version, after which time the contract can be automatically renewed.
Please note, the Paid edition contains only the pure corpus data, and does not contain any reference aids (such as dictionary tools).
Explanation of the Properties of the BCCWJ
XML Document Structure / Morphological Information / Integration of Morphological Information in XML Documents
For inquiries please contact: kotonoha[at]ninjal.ac.jp (please convert the 'at')