Creating Domain Dictionaries for Serbian Language
Abstract: Automatically created thesauruses are used in order to improve methods for clustering, mining and determining the sentiments of some specific data corpus. There are different methods for the automatic discovering of similar words. Some of them are based on text corpora and mathematical similarity measures, while others use graphs and monolingual dictionaries. Serbian language is the richer than the English, by vocabulary and grammatical issues. Known methods for automatic thesaurus generation may neglect some of these specific issues. This paper deals with a method for automatic generation of a thesaurus from the repositories of documents in the Serbian language based on mathematical methods such as chi-square test, cosine similarity and Jaccard similarity coefficient. The proposed method can be applied either to normalized or non-normalized documents.
engleski
2016
© All rights reserved