言語資源開発センター -Center for Language Resource Development- Japanese NINJAL
 

Introduction to the BCCWJ

"The Balanced Corpus of Contemporary Written Japanese" (BCCWJ) is a corpus created for the purpose of attempting to grasp the breadth of contemporary written Japanese, containing extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken.

Morphological analysis has been conducted using the lexical items contained within all samples. Additionally, tags relating to sentence structure, and precise bibliographical information are provided. The copyright negotiation process has also been completed, so the corpus can be used without worry.

"The Balanced Corpus of Contemporary Written Japanese" is available to the public via three methods, online versions (free: Shonagon and Chunagon) and offline version (charged). Requests to use the corpus for commercial purposes are considered on an individual basis, so if that is the case please contact us at the address below.

オンライン版(無償)

少納言

中納言

オフライン版(有料)

DVDデータ

※For academic and general use.

 

With "Shonagon" there is no need for registration, and the corpus can be used freely. The use of the "Chuunagon" or Paid editions is under a usage contract - the period of this contract is 1 year for the online version, and 2 years for the offline version, after which time the contract can be automatically renewed.

Please note, the Paid edition contains only the pure corpus data, and does not contain any reference aids (such as dictionary tools).


Explanation of the Properties of the BCCWJ

The main focus is on compiling published examples of contemporary written japanese.

  • The courpus is focused on a broad selection of published works - in addition to a range of newspapers and magazines, it also focuses on business reports, and textbooks.
  • There is also a section focused on text from the web, such as from message boards.
  • Things such as private journals and messages are not a focus.
  • The compiled materials date from a roughly 30 year period (1976 - 2006). The main body of the corpus texts are from 1986-2006. This is in order have the corpus focus on a more varied temporal sampling of ISBNs (Internation Stadard Book Numbers) in the compiled publications.

To fulfil the above-mentioned objectives, samples were taken entirely at random.

Sample selection

The scope of the corpus is ~ 100 million words (lexical items), excluding spaces and symbols.

Morphological information concerning the samples (words in the texts are further classified by part of speech) are included along with other information in XML files.

XML Document Structure / Morphological Information / Integration of Morphological Information in XML Documents

So that the corpus may be open and used by anyone, the copyright process has been entirely completed.

Basic Design Policy

For inquiries please contact: kotonoha[at]ninjal.ac.jp (please convert the 'at')

 
 

リンク Links