Diacritic restoration of Turkish tweets with word2vec

dc.authoridOzer, Ilyas/0000-0003-2112-5497
dc.authoridOZER, ZEYNEP/0000-0001-8654-0902
dc.authoridFINDIK, OGUZ/0000-0001-5069-6470
dc.contributor.authorOzer, Zeynep
dc.contributor.authorOzer, Ilyas
dc.contributor.authorFindik, Oguz
dc.date.accessioned2024-09-29T15:57:33Z
dc.date.available2024-09-29T15:57:33Z
dc.date.issued2018
dc.departmentKarabük Üniversitesien_US
dc.description.abstractSocial media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. (C) 2018 Karabuk University. Publishing services by Elsevier B.V.en_US
dc.identifier.doi10.1016/j.jestch.2018.09.002
dc.identifier.endpage1127en_US
dc.identifier.issn2215-0986
dc.identifier.issue6en_US
dc.identifier.scopus2-s2.0-85053603934en_US
dc.identifier.scopusqualityQ1en_US
dc.identifier.startpage1120en_US
dc.identifier.urihttps://doi.org/10.1016/j.jestch.2018.09.002
dc.identifier.urihttps://hdl.handle.net/20.500.14619/4880
dc.identifier.volume21en_US
dc.identifier.wosWOS:000449023100002en_US
dc.identifier.wosqualityN/Aen_US
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakScopusen_US
dc.language.isoenen_US
dc.publisherElsevier - Division Reed Elsevier India Pvt Ltden_US
dc.relation.ispartofEngineering Science and Technology-An International Journal-Jestechen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectText miningen_US
dc.subjectDiacritics restorationen_US
dc.subjectTwitteren_US
dc.subjectTweet normalizationen_US
dc.titleDiacritic restoration of Turkish tweets with word2vecen_US
dc.typeArticleen_US

Dosyalar