View on GitHub

Sexism detection in Russian forum and social network's messages.

The report of the GSOC'19 project

1.3. The process of collecting corpora.

1.3.1. Sarcasm.

One of the main problems was in the distinguishing between sarcasm and seriously meant sexist speech. Example:

“Aren, такую хрень несешь. Еще скажи, что мужик полигамен и должен осеменить больше самок, потому что так заложено природой.»

Translation:

“Aren, you’re saying some bullshit. What next, are you going to say that all men are polygamous and should inseminate more females, because it is in the nature?”

The comment is not sexist by itself, but it contains sexist vocabulary. Here the problem could be solved with the words co-occurrence: the word “bullshit” could work as a signal not to take the message literally.

But another example proves to be even more complicated. This message is a reaction to the news that fewer female football fans will be showed on TV during the matches to avoid objectivization:

“Извините, но это какой пиздц. Может вообще женщинам запретим матчи посещать? Ну чтоб прям наверняка? Можно и в паранджу всех закутать. Это вообще 100% вариант.”

Translation:

“I’m sorry, but that’s fucked up. Maybe we should ban women from attending matches at all then? Just to be sure? You can also dress them all in burqas. That should work 100%.”

1.3.2. Special slang.

The forum, which was used as the main source of sexist speech, has very specific slang, not typical to Russian language overall. It could be problematic, because the system shouldn’t be limited just to specific small group of people. The compensation of it was attempted with the use of non-sexist topics (discussion about the phones and attempts to agree on the real life meeting date) of the same forum, to ensure that there would be enough of the non-sexist material with the same manner of speech.

1.2.3. Single annotator.

Of course, considering that we had a single annotator, the labeling process was far from perfect and quite subjective. Even though the definition and guidelines were discussed in details with my mentor, some cases were still providing a challenge, and possibly imperfect solutions were picked in the end. (More about it in the guidelines part) That is also the reason, why I can’t recommend data right now for immediate use. It still needs several of other annotators to look at it, preferably, familiar with sexism on the expert level (see 1. The definition of “hate speech” and problems and difficulties of the task.)