Sexism detection in Russian language.
This is the blog entry, aimed to describe progress on the project conducted within the GSOC 2019.
The repository of the project itself can be found here and the description of the content of the repository can be found further down
N.B.: we decided to avoid the 4-page-long arxiv publication template, for the sake of completion and full documentation of the process. The references are given in the end of each of the subtopics.
In the course of work on the issue of “hate speech”, two compilations have been made, both of which may be useful for further research in this area. The first one is an attempt to systematize research on hate speech. This file will be updated.
The second one is a compilation of known open source corpora on hate speech. The list includes more than ten of them. This file is not planned to be updated.
Part one. Creating sexist corpus and the process of its annotation.
1.1. The definition of “hate speech” and difficulties of the task. (How to define the “hate speech”; what are the common problems with the annotation of hate speech)
1.2. The process of collecting corpora. (Sources of our data, comments to our code to collect the data, description of resulted corpora)
1.3. The problems with the annotation of our corpora.
(Linguistical problems, general limitations)
Part two. Creating a sexism detector.
2.1 Preprocessing of the corpora. Attempted models and results. (tf-idf + NB, tf-idf + Logistic Regression, finetuned ELMO, finetuned + pretrained ELMO)