Publicación en revista: A method for K-Means seeds generation applied to text mining

Publicado en Statistical Methods and Applications

In this paper, a methodology is proposed in order to produce a set of seeds later used as a starting point to K-Means-type unsupervised classiﬁcation algorithms for text mining. Our proposal involves using the eigenvectors obtained from principal component analysis to extract initial seeds, upon appropriate treatment for search of lightly overlapping clusters which are also clearly identiﬁed by keywords. This work is motivated by the interest of the authors in the problem of identiﬁcation of topics and themes previously unknown in short texts. Therefore, in order to validate the goodness of this method, it was applied on a sample of labeled e-mails (NG20) representing a gold standard within the ﬁeld of text mining. Speciﬁcally, some corpora referenced in the literature have been used, conﬁgured in accordance to a mix of topics contained in the sample. The proposed method improves on the results of other state-of-the-art methods to which it is compared.

[Full text]

Buscar este blog

Publicación en revista: A method for K-Means seeds generation applied to text mining

Etiquetas