ISSN 0236-235X (P)
ISSN 2311-2735 (E)

Bookmark

Next issue

1
Publication date:
16 March 2021
-->

Journal articles №4 2020

21. Using statistical indexes to distinguish between scientific and popular science texts on the example of the works of A. E. Fersman [№4 за 2020 год]
Authors: L.G. Gorbich (glg@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Research Associate); А.А. Zhivoderov (csl@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Senior Researcher), Ph.D;
Abstract: With the development of information technology and information systems, the problem of developing methods for machine attribution of texts has become more relevant. These techniques can be used to automatically search for texts of the required genre and style, and establish authorship using computer technology. The development of our methodology was based on the hypothesis that there are structural features of the text that allow it to be attributed to a certain genre or author without taking into account the semantic content, based on the calculation of purely quantitative values of certain parameters and indices. The authors of this paper, along with other researchers, have been developing such indices and forming an optimal set of them for a number of years, and have achieved some success in this. In particular, a set of indexes was formed that allows one to cor-rectly classify texts of different authors by genre with a probability of up to 86 %. To solve the problem of automatic classification of scientific and popular science texts, the authors applied and improved a set of statistical indexes that they had previously developed for attributing other styles. The re-search material was based on the works of academician A.E. Fersman. One of the features of this author is the style duality – the presence of a large number of scientific and popular scientific texts belonging to him, which created a unique opportunity to try to solve the problem of automatic classification of text styles belonging to one author. In the course of the work, it was shown that the sample averages of statistical indices for texts of the two styles differ significantly. Using the methods of discriminant analysis, logistic regression, and ROC-curves, the authors demonstrated the possibility of automatic classification of texts of two styles and, by optimizing the set of indexes used, achieved a significant improvement in the quality of classification. A new statistical index is also proposed that allows minimizing computational costs and successfully (up to 100 % accuracy) solving the problem of classification of scientific and popular science texts, even when using it as the only factor. The results of the study were checked for texts by other authors.
Keywords: text style, automatic text classification, statistical index, discriminant analysis, logistic regression, roc-curve
Visitors: 340

← Preview | 1 | 2 | 3