Journal influence
Higher Attestation Commission (VAK) - К1 quartile
Russian Science Citation Index (RSCI)
Bookmark
Next issue
№1
Publication date:
16 March 2026
Using statistical indexes to distinguish between scientific and popular science texts on the example of the works of A. E. Fersman
Date of submission article: 29.09.2020
UDC: 81-139+81-112
The article was published in issue no. № 4, 2020 [ pp. 720-725 ]Abstract:With the development of information technology and information systems, the problem of developing methods for machine attribution of texts has become more relevant. These techniques can be used to automatically search for texts of the required genre and style, and establish authorship using computer technology. The development of our methodology was based on the hypothesis that there are structural features of the text that allow it to be attributed to a certain genre or author without taking into account the semantic content, based on the calculation of purely quantitative values of certain parameters and indices. The authors of this paper, along with other researchers, have been developing such indices and forming an optimal set of them for a number of years, and have achieved some success in this. In particular, a set of indexes was formed that allows one to cor-rectly classify texts of different authors by genre with a probability of up to 86 %. To solve the problem of automatic classification of scientific and popular science texts, the authors applied and improved a set of statistical indexes that they had previously developed for attributing other styles. The re-search material was based on the works of academician A.E. Fersman. One of the features of this author is the style duality – the presence of a large number of scientific and popular scientific texts belonging to him, which created a unique opportunity to try to solve the problem of automatic classification of text styles belonging to one author. In the course of the work, it was shown that the sample averages of statistical indices for texts of the two styles differ significantly. Using the methods of discriminant analysis, logistic regression, and ROC-curves, the authors demonstrated the possibility of automatic classification of texts of two styles and, by optimizing the set of indexes used, achieved a significant improvement in the quality of classification. A new statistical index is also proposed that allows minimizing computational costs and successfully (up to 100 % accuracy) solving the problem of classification of scientific and popular science texts, even when using it as the only factor. The results of the study were checked for texts by other authors.
Аннотация:С развитием информационной техники и информационных систем актуализировалась проблема разработки методик машинной атрибуции текстов. Эти методики могут быть использованы для автоматического поиска текстов необходимого жанра и стиля и установления авторства с помощью компьютерных технологий. В основу разработки рассматриваемой в статье методики была положена гипотеза о том, что существуют структурные особенности текста, которые позволяют без учета смыслового содержания отнести его к определенному жанру или автору на основе вычисления чисто количественных значений некоторых параметров и индексов. Авторы наряду с другими исследователями в течение ряда лет занимались разработкой таких индексов и формированием из них оптимального набора и добились в этом определенных успехов. В частности, был сформирован набор индексов, позволяющий правильно классифицировать тексты по жанру с вероятностью до 86 %. Для решения задачи автоматической классификации научных и научно-популярных текстов авторы применили и усовершенствовали набор статистических индексов, разработанный ими ранее для атрибуции других стилей. В качестве материала исследования были взяты труды академика А.Е. Ферсмана. Одной из особенностей этого автора является стилевая двойственность – наличие большого числа принадлежащих ему как научных, так и научно-популярных текстов, что создало уникальную возможность для попытки решения задачи автоматической классификации стилей текстов, принадлежащих одному автору. В ходе работы было показано, что выборочные средние статистических индексов для текстов двух стилей достоверно различаются. Применяя методы дискриминантного анализа, логистической регрессии и ROC-кривых, авторы продемонстрировали возможность автоматической классификации текстов двух стилей и с помощью оптимизации используемого набора индексов добились существенного повышения качества классификации. Предложен также новый статистический индекс, позволяющий минимизировать вычислительные затраты и успешно (до 100 % точности) решать задачу классификации научных и научно-популярных текстов даже при использовании его в качестве единственного фактора. Результаты исследования были проверены на текстах других авторов.
| Authors: Gorbich, L.G. (glg@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Research Associate), Ekaterinburg, Russia, А.А. Zhivoderov (csl@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Senior Researcher), Ekaterinburg, Russia, Ph.D | |
| Keywords: text style, automatic text classification, statistical index, discriminant analysis, logistic regression, roc-curve |
|
| Page views: 8314 |
PDF version article |
Использование статистических индексов для различения научных и научно-популярных текстов на примере трудов А.Е. Ферсмана
DOI: 10.15827/0236-235X.132.720-725
Date of submission article: 29.09.2020
UDC: 81-139+81-112
The article was published in issue no. № 4, 2020. [ pp. 720-725 ]
With the development of information technology and information systems, the problem of developing methods for machine attribution of texts has become more relevant. These techniques can be used to automatically search for texts of the required genre and style, and establish authorship using computer technology.
The development of our methodology was based on the hypothesis that there are structural features of the text that allow it to be attributed to a certain genre or author without taking into account the semantic content, based on the calculation of purely quantitative values of certain parameters and indices. The authors of this paper, along with other researchers, have been developing such indices and forming an optimal set of them for a number of years, and have achieved some success in this. In particular, a set of indexes was formed that allows one to cor-rectly classify texts of different authors by genre with a probability of up to 86 %.
To solve the problem of automatic classification of scientific and popular science texts, the authors applied and improved a set of statistical indexes that they had previously developed for attributing other styles. The re-search material was based on the works of academician A.E. Fersman. One of the features of this author is the style duality – the presence of a large number of scientific and popular scientific texts belonging to him, which created a unique opportunity to try to solve the problem of automatic classification of text styles belonging to one author. In the course of the work, it was shown that the sample averages of statistical indices for texts of the two styles differ significantly. Using the methods of discriminant analysis, logistic regression, and ROC-curves, the authors demonstrated the possibility of automatic classification of texts of two styles and, by optimizing the set of indexes used, achieved a significant improvement in the quality of classification. A new statistical index is also proposed that allows minimizing computational costs and successfully (up to 100 % accuracy) solving the problem of classification of scientific and popular science texts, even when using it as the only factor. The results of the study were checked for texts by other authors.
Gorbich, L.G. (glg@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Research Associate), Ekaterinburg, Russia, А.А. Zhivoderov (csl@cbibl.uran.ru) - 1 Central Scientific Library of the Ural Branch of the Russian Academy of Sciences (Senior Researcher), Ekaterinburg, Russia, Ph.D
Ссылка скопирована!
| Permanent link: http://swsys.ru/index.php?page=article&id=4770&lang=&lang=&like=1&lang=en |
Print version |
| The article was published in issue no. № 4, 2020 [ pp. 720-725 ] |
The article was published in issue no. № 4, 2020. [ pp. 720-725 ]
Perhaps, you might be interested in the following articles of similar topics:Perhaps, you might be interested in the following articles of similar topics:
