Diacritic restoration of Turkish tweets with word2vec
Küçük Resim Yok
Tarih
2018
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Elsevier - Division Reed Elsevier India Pvt Ltd
Erişim Hakkı
info:eu-repo/semantics/openAccess
Özet
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. (C) 2018 Karabuk University. Publishing services by Elsevier B.V.
Açıklama
Anahtar Kelimeler
Text mining, Diacritics restoration, Twitter, Tweet normalization
Kaynak
Engineering Science and Technology-An International Journal-Jestech
WoS Q Değeri
N/A
Scopus Q Değeri
Q1
Cilt
21
Sayı
6