UniDic is a generic term for
a (1)Linguistic Design of the uniform word unit (Short Unit Word) and an electronic dictionary based on a hiearchical heading structure defined by the National Institute for Japanese Language and Linguistics (NINJAL),
(2)UniDic database which is the relational database as its implementation,
and (3)UniDic for morphological analysis which is a dictionary for analysis for the morphological analyser MeCab, where the short units exported from the database are used as entries (heading terms).
This website publishes and distributes the (3) UniDic for morphological analysis.
Morphological analysis using UniDic is also referred to as "Short Unit Word (automated) analysis" because UniDic uses Short Unit Word as entries in the MeCab dictionary for analysis.
The primary objective of UniDic is to support the corpus annotations built at the NINJAL.
UniDic databases at the NINJAL are also referenced to the databases of the corpus in the same institution, and the Short Unit Word in the completed corpus databases are:
- registered in the UniDic database, and
- • in a state of referencing (linked) unique entry in UniDic databases.
The advantages of the system management that have integrated this corpus and these dictionaries are as follows.
- When annotating a corpus with Short Unit Words of information, the task is to "just select which entry in UniDic databases corresponds to each short unit that appeared in the corpus, so it is possible to prevent errors that give some different pieces of information (e.g. utilising it) to the same short units appearing at different locations in the corpus, and to reduce the likelihood of inconsistency in the corpus."
- Even if information/attributes (items) that do not exist in the current UniDic database are newly added to the UniDic database, reflection to the corpus (addition of new items) can be done instantaneously by linking between databases.
The biggest benefit of referential relationships with the corpus database is the example index ability that can extract a huge number of examples in the corpus at once from a single entry in the UniDic database. By using the ‘UniDicExplorer’ operation tool for the UniDic database shown in the figure below, and by specifying a Short Unit Word entry in the database and simply pressing the button of the example enumeration, one can obtain a list of examples corresponding to that entry from the corpus database by the stage of lexemes, forms, and orth forms .
Unfortunately, there is currently no direct access service to the in-house corpus database using UniDicExplorer provided for users outside the institution. However, if the corpus has already been published, by using the “Chunagon” corpus search system, it is possible to perform more flexible and simple example searches, such as by specifying co-occurrence and concatenation.
Also, although it is not UniDicDB, CradleExpress is a service for searching the lexical file (lex.csv) of the UniDic.
As noted above, the primary goal of UniDic is to promote corpus annotations built at the NINJAL. The UniDic analysis was originally intended to create (i) Short Unit Word auto-annotation data (non-core data) in the Corpus of Spontaneous Japanese (CSJ). Since the construction of the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a working policy of “manually modifying the results of automated analysis of short units using analysis dictionaries” was adopted, which is now being used as a cost-cutting tool for manual annotation tasks.
The UniDic analysis published on this website is also intended for the above two usages (i, ii). The analytical performance described in the reference article “参考文献 (References)”below is also presented as the accuracy of the corpus created by auto annotation and the reference value when the user of the analysis UniDic attempts to create a similar corpus (i.e. the degree to which something similar can be reproduced).
Short Unit Words are not suitable for syntactic and semantic analyses in the field of natural language processing, because they are designed with an emphasis on example searches with little oversight (e.g. de-contextualising based on the part-of-speech systematics etymology principle founded on unit lengths and possibilities).
For syntactic analysis, we focus on the syntactic function and recommend using a Long Unit Words that certifies in a top-down manner from a Bunsetsu.
On the other hand, since it is a uniform unit for example searching, there are reports that it can realise consistent automatic analysis regardless of the presence or absence of contexts and the difference of contexts, and it is effective on information retrieval systems such as search engines [高橋+, 16].
- 伝 康晴, 浅原 正幸: 「リレーショナル・データベースによる統合的言語資源管理環境」, 第1回『話し言葉の科学と工学』ワークショップ講演予稿集, pp.77-84 (2001).
- 伝 康晴, 小木曽 智信, 小椋 秀樹, 山田 篤, 峯松 信明, 内元 清貴, 小磯 花絵: 「コーパス日本語学のための言語資源：形態素解析用電子化辞書の開発とその応用」, 日本語科学, Vol.22, pp.101-123 (2007).
- 小木曽 智信, 中村 壮範: 「『現代日本語書き言葉均衡コーパス』形態論情報アノテーション支援システムの設計・実装・運用」, 自然言語処理, Vol.21, No.2, pp.301-332 (2014).
- 小木曽 智信, 中村 壮範: 「『現代日本語書き言葉均衡コーパス』形態論情報データベースの設計と実装 改訂版」
- 岡 照晃: 「言語研究のための電子化辞書」, コーパスと辞書, 講座 日本語コーパス 7, pp.1-28, 朝倉書店 (2019).
- 伝 康晴, 小木曽 智信, 小椋 秀樹, 山田 篤, 峯松 信明, 内元 清貴, 小磯 花絵： 「コーパス日本語学のための言語資源：形態素解析用電子化辞書の開発とその応用」, 日本語科学, Vol.22, pp. 101-123 (2007).
- 伝 康晴, 中村 純平, 小木曽 智信, 小椋秀樹: 「語種情報を用いた同表記異音語の解消」, 言語処理学会第14回年次大会, pp.69-72 (2008).
- Yasuharu Den, Junpei Nakamura, Toshinobu Ogiso, Hideki Ogura. A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation, In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008), pp.1019-1024 (2008).
- 小木曽 智信, 小町 守, 松本 裕治: 「歴史的日本語資料を対象とした形態素解析」, 自然言語処理, Vol.20, No.5, pp.727-748 (2013).