UniDic Electronic Dictionary with Uniformmity and Identity

What is UniDic?

UniDic is a generic term for
a (1)Linguistic Design of the uniform word unit (Short Unit Word) and an electronic dictionary based on a hiearchical heading structure defined by the National Institute for Japanese Language and Linguistics (NINJAL),
(2)UniDic database which is the relational database as its implementation,
and (3)UniDic for morphological analysis which is a dictionary for analysis for the morphological analyser MeCab, where the short units exported from the database are used as entries (heading terms).

This website publishes and distributes the (3) UniDic for morphological analysis.

Morphological analysis using UniDic is also referred to as "Short Unit Word (automated) analysis" because UniDic uses Short Unit Word as entries in the MeCab dictionary for analysis.

Objectives of UniDic

Objectives of UniDic database

The primary objective of UniDic is to support the corpus annotations built at the NINJAL.

UniDic databases at the NINJAL are also referenced to the databases of the corpus in the same institution, and the Short Unit Word in the completed corpus databases are:

registered in the UniDic database, and
• in a state of referencing (linked) unique entry in UniDic databases.

The advantages of the system management that have integrated this corpus and these dictionaries are as follows.

When annotating a corpus with Short Unit Words of information, the task is to "just select which entry in UniDic databases corresponds to each short unit that appeared in the corpus, so it is possible to prevent errors that give some different pieces of information (e.g. utilising it) to the same short units appearing at different locations in the corpus, and to reduce the likelihood of inconsistency in the corpus."
Even if information/attributes (items) that do not exist in the current UniDic database are newly added to the UniDic database, reflection to the corpus (addition of new items) can be done instantaneously by linking between databases.

The biggest benefit of referential relationships with the corpus database is the example index ability that can extract a huge number of examples in the corpus at once from a single entry in the UniDic database. By using the ‘UniDicExplorer’ operation tool for the UniDic database shown in the figure below, and by specifying a Short Unit Word entry in the database and simply pressing the button of the example enumeration, one can obtain a list of examples corresponding to that entry from the corpus database by the stage of lexemes, forms, and orth forms .

Unfortunately, there is currently no direct access service to the in-house corpus database using UniDicExplorer provided for users outside the institution. However, if the corpus has already been published, by using the “Chunagon” corpus search system, it is possible to perform more flexible and simple example searches, such as by specifying co-occurrence and concatenation.

Also, although it is not UniDicDB, CradleExpress is a service for searching the lexical file (lex.csv) of the UniDic.

Objective of the UniDic analysis

As noted above, the primary goal of UniDic is to promote corpus annotations built at the NINJAL. The UniDic analysis was originally intended to create (i) Short Unit Word auto-annotation data (non-core data) in the Corpus of Spontaneous Japanese (CSJ). Since the construction of the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a working policy of “manually modifying the results of automated analysis of short units using analysis dictionaries” was adopted, which is now being used as a cost-cutting tool for manual annotation tasks.

The UniDic analysis published on this website is also intended for the above two usages (i, ii). The analytical performance described in the reference article “参考文献 (References)”below is also presented as the accuracy of the corpus created by auto annotation and the reference value when the user of the analysis UniDic attempts to create a similar corpus (i.e. the degree to which something similar can be reproduced).

Short Unit Words are not suitable for syntactic and semantic analyses in the field of natural language processing, because they are designed with an emphasis on example searches with little oversight (e.g. de-contextualising based on the part-of-speech systematics etymology principle founded on unit lengths and possibilities).

For syntactic analysis, we focus on the syntactic function and recommend using a Long Unit Words that certifies in a top-down manner from a Bunsetsu.

On the other hand, since it is a uniform unit for example searching, there are reports that it can realise consistent automatic analysis regardless of the presence or absence of contexts and the difference of contexts, and it is effective on information retrieval systems such as search engines [高橋+, 16].

参考文献 (References)

UniDicの設計と実装全体に関係する文献 (Design and Implementation)

UniDicデータベースに関する文献 (UniDic Databases)

伝康晴, 浅原正幸: 「リレーショナル・データベースによる統合的言語資源管理環境」, 第1回『話し言葉の科学と工学』ワークショップ講演予稿集, pp.77-84 (2001).
伝康晴, 小木曽智信, 小椋秀樹, 山田篤, 峯松信明, 内元清貴, 小磯花絵: 「コーパス日本語学のための言語資源：形態素解析用電子化辞書の開発とその応用」, 日本語科学, Vol.22, pp.101-123 (2007).
小木曽智信, 中村壮範: 「『現代日本語書き言葉均衡コーパス』形態論情報アノテーション支援システムの設計・実装・運用」, 自然言語処理, Vol.21, No.2, pp.301-332 (2014).
小木曽智信, 中村壮範: 「『現代日本語書き言葉均衡コーパス』形態論情報データベースの設計と実装改訂版」

UniDicデータベースからのエクスポートに関係する文献

鴻野知暁, 小木曽智信: 見出し語の時代情報を付与した電子化辞書の構築, 言語処理学会第20回年次大会発表論文集, pp.209-212 (2014).

解析用UniDicに関係する文献 (UniDic for Morphological Analysis)

UniDicを使った日本語研究のケーススタディ (UniDic for Linguistic Researches)

情報検索への応用例 (Application for Information Retrieval)

高橋文彦, 颯々野学: 「情報検索のための単語分割一貫性の定量的評価」, 言語処理学会第22回年次大会(NLP2016), pp.949-952 (2016).