When considering Japanese corpora post year-2000, one point that stands out is the lack of a well balanced corpus covering all aspects of the language.
For example, many newspaper companies make databases of their old articles publically available, and those archives contain several hundred million words. However, even if your focus is solely on written Japanese, such a corpus will still not allow for a good grasp of all of written Japanese; the Japanese found in magazine articles is clearly different from that of newspapers, and literary works go without saying. Regarding literary works such as novels, the "Aozora Bunko" digital library has made a great effort to make it possible for numerous works to become available for public use. However, because all such works must be in the public domain, it can only give examples of Japanese from before the 1950s.
To summarize the above, when looking for a corpus for linguistic study of the Japanese language, existing corpora have the following problems:
To address these problems, a corpus following the design principles below was designed.
In creating a general-purpose corpus, a few potential problems are created.
Based upon the objectives of the corpus, the methods of sampling will necessarily change. In NINJAL's previous examinations of lexical items, it was appropriate to pull from smaller samples of a few dozen words, but in order for the corpus to be used for discourse analysis and semantic factoring it is important to be able to understand context, and so a much larger sample - of thousands of words - is desirable.
Deciding what exactly is representative of the "Reality" of modern Japanese is also a potential problem. This is because, as illustrated in the figure below, there are three possible phases of the existence of a written word which can be focused upon - the "Production", the "Circulation" and the "Reception". Each of these processes hold meaning when considering the reality of a written word.
Among those phases, there is existing data on the use of "Production" and "Circulation" in the creation of samples. However, there is little such pertinent data on the usage of the "Reception" mechanism, and because of the prohibitively large amount of labor involved in conduction a large-scale survey to acquire it, no such data on the influence of the "Reception" mechanism will be available.
In order to attempt to resolve the previously mentioned issues in some way, three sub-corpora with differing properties were constructed. These are the Publication, Library, and Special-Purpose sub-corpora. Each sub-corpus is designed to reflect a different aspect of written Japanese - the Library sub-corpus is focused on the production aspect, the Library sub-corpus on the circulation aspect. The Special-purpose sub-corpus texts such as web documents, and other texts which do not fall under the purview of the other two sub-corpor, with the objective of being of use for specialized research.
As mentioned above, due to the untenable amount of time required, no sub-corpus was constructed which is reflective of the "Reception" aspect of written language. The figure below outlines the details of the three sub-corpora.
The green, top-left section of the figure above gives details of the Publication sub-corpus. The parent data set was created based on a catalogue of publications, and from that a random sampling of data was complied. The sub-corpus aims to give a representative cross-section of published written Japanese.
Regarding sampling, estimations were made of the total amount of words published in a set period (for example, over one year), and samples were taken based on the ratio of total published words in each type of work over the period. In order to estimate the amounts of words, different methods were used: for newspapers, pages were arbitrarily selected over a period of a week and the characters they contained were counted, and the total amount published over a year was extrapolated from that number. For books, the average number of words per page was calculated based on examples from the National Diet Library catalogue, and then extrapolated to a total number of words published over the period.
The purple section in the top-right of the figure above gives information on the Library sub-corpus. The sub-corpus is not focused solely on published works, but rather on texts which have been confirmed to be in wide circulation. With the cooperation of the Tokyo Public Libraries, a parent data set has been created from their collections, and a random sample of that set was taken. This sub-corpus is therefore focused on the 'circulation' aspect of written language.
The parent data set of the Library sub-corpus complements the related section of the parent set from the Publication sub-corpus (containing books), allowing for expanded sampling at roughly similar ratios; the section of the Publication sub-corpus relating to books contains a total of roughly 48.54 billion characters, and the total number of characters contained in the circulated volumes of the libraries of Tokyo's 13 municipalities is quite close, totalling 47.88 billion.
The grey section at the bottom of the above figure relates to the special-purpose sub-corpus, containing other parts of the parent data set. This sub-corpus contains data which, while very important to Japanese, is difficult to fit in the two previous sub-corpora, and data which is of importance for addressing linguistic issues researched at NINJAL.
More specifically, the sub-corpus contains the following types of data: Bestsellers, business reports, authorized textbooks, PR documents, web postings (from Yahoo answers), blogs (from Yahoo! Blogs), poetry, legal documents, and minutes of the national diet.
The BCCWJ seeks to represent the original texts as accurately as possible, but due to personal information protection laws, any portions of text which might reveal personal information will be hidden. There are also instances where entire samples were necessarily excluded.
If we receive requests from copyright holders to withhold portions of texts, the request will be granted regardless of whether or not the section would violate personal information protection laws. For example, on request from a newspaper, we would remove the actual names of "Suspects" or "Victims" from a newspaper article.