Twitter is bad at identifying offensive tweets.

All the way down, at the bottom of the list of replies to a tweet, you might see an ominous warning like this one.

If you're familiar with the quality of the typical Twitter reply, you will wonder what higher class of horror lurks beyond the "Show" button. Well, rest easy. The potentially offensive tweets are more or less indistinguishable from normal tweets. I can prove it, but even better, you can prove it.

Offensive Quiz!

In the quiz below you'll see a "parent" tweet and two replies to that tweet. One of these replies has been flagged by Twitter as potentially offensive. The other has not. Your task is to select which of the two tweets you find most offensive. Will you agree with Twitter? The pictures in the tweets show up as links by default. Follow these links at your own peril. No cheating!




Now, I know, it's easy (and fun) to criticize from afar. How do I know that identifying offensive tweets isn't just, like, super hard? Could I do better? Well, let's find out.

This next quiz is the same, but different. What you do is the same, try to identify the offensive tweet. What I've done is different though. Rather than let Twitter tell me what tweets are offensive or not, I have applied my own function to identify offensive tweets. Will you agree with my algorithm more than you agree with Twitter's?



Who Twitter-Count Twitter-Accuracy inteoryx-Count inteoryx-Accuracy
inteoryx 100 48% 100 95%

When I started this investigation I assumed I'd have to think about measures like sensitivity, specificity, and the general prevalence of offensive tweets. My quiz results show that more sophisticated analysis isn't really warranted. As far as I can tell, Twitter seems to be identifying tweets as offensive at random. They may not be literally using a random number generator, but the product of their efforts, whatever those efforts are, isn't much better than if they were using a random number generator.

Conversely, my effort at identifying offensive tweets was much more successful. Now, I know what you're thinking. Since I am the author of my own algorithm to identify offensive tweets it is not surprising that I recognize the output of my own work. That's a good point! If you look at the table on top of this section, on the "Internet" row you'll see the aggregated results of everyone who has taken the quiz. I expect that the wisdom of crowds will anoint my algorithm as superior. That is, the people of the Internet will more frequently agree with my algorithm as to which tweet is offensive than they will agree with Twitter's.

The most mind boggling quiz question I saw was this one:

Yes, one of these two options really is marked as potentially offensive by Twitter. It's the top one "Shukriya sista". These are two, almost identical tweets, by the same person, and yet, Twitter considers one offensive and the other not. By the way, "shukriya" means "Thankful".

Twitter really is bad at identifying offensive tweets. Here are three reasons why this matters.

  1. Benign comments are unfairly hidden. Messages that deserve to be seen are not.
  2. Genuinely offensive comments are not hidden.
  3. Twitter is launching a feature to ask tweeters to review potentially offensive tweets before they send them. My research indicates that Twitter will do a poor job with this feature because they don't seem that good at offensive tweet identification.


Offensive Tweet Quiz

To gather tweets I sampled Twitter's "decahose" API. The decahose provides the caller a stream of "real time" tweets - i.e. tweets that have just been tweeted. I searched the decahose for tweets that were replies to other tweets and tracked these "other tweets" as "tweets with replies".

I then went through the list of "Tweets with replies" and found which of these tweets had both offensive and inoffensive replies. Twitter's API does not designate potentially offensive tweets as far as I can tell, so I used selenium to visit each Tweet in the browser, as a user might, and check to see if there was an "offensive" warning of the kind shown above. I then created what I call a Twitter Replies Collection which includes the parent tweet (i.e. the tweet with replies), a randomly chosen tweet from the list that was not marked offensive, and a randomly chosen tweet from the list that was hidden behind the offensive warning banner. The Twitter Replies Collections form the quiz questions that you vote on.

My Offensive Indicator

Machine learning? GPT-3? NLP? No. I found an offensive words list from the fine people at this project and considered a tweet offensive if it contained any of the highest-offense words from the English list. (Note: I didn't read the license on that github project, if it doesn't permit my usage as described, then this paragraph is a joke and I really did something else.)

With my own offensive indicator I then evaluated the same set of tweets that I created the first quiz from. It is not guaranteed that the same tweets are used in each quiz because I only create a quiz question for a tweet if that tweet has replies that are both offensive and inoffensive, so some tweets can only provide a question for Twitter's method or for my method.


I looked at both of the tweets on Twitter and neither showed as offensive for me!

It seems Twitter marks different tweets as potentially offensive for different people. My collection and report is for the "unauthenticated user". If you open the parent tweet in an incognito browser, or while you aren't signed in to Twitter, you should see one of the two tweets in the main list of replies and the other of the two hidden behind the potentially offensive warning.

Another possibility is that it seems like some tweets are "vanished" by Twitter. I'm not sure exactly what's going on here, but over time some of the tweets become invisible by browsing and are only visible through direct linking. You may experience this as simply not seeing any offensive tweets or not seeing the specific offensive tweet you are looking for. In this case, you should consider that the tweet that is no longer visible was the one which was previously marked as offensive. You'll just have to take my word for it. It seems to happen to between 10-20% of tweets that were marked offensive. Vanishing also takes some time (minutes+) because my script could only have found the tweets via browsing originally.

One of the tweets doesn't seem to exist when I look for it on Twitter.

The replies can be either direct replies to parent or indirect replies, meaning, they might be a reply to a reply to the parent. My script to gather the replies actually just scrolls down the list of tweets like a user might and while scrolling it picks up which tweets are offensive and which are not. However, you may notice that sometimes replies of replies show up by default and other times replies of replies are hidden. It turns out that which replies of replies are shown changes over time - possibly related to tweets getting different amounts of attention. Long story short: You may have to dig to find one of the tweets.

Some of these questions are in a foreign language.

So? You can't read every language? Well - do you have access to Google Translate? You might think I'm just being lazy by refusing to filter out non-English tweets, but in reality, maybe I'm helping you grasp the scale of the offensiveness identification problem. Plus, I think they're fun to see.

Maybe Twitter has other data to look at - e.g. people reporting the tweet, tweeter behavior, etc.

Yeah, probably. I've seen tweets that seem identical, or that are just an emoji, or just a gif, that are sometimes marked as offensive and sometimes not. It seems like there must be some other data going into Twitter's decision or they really are just acting at random. Regardless, I don't think you can escape the conclusion that their results aren't very good. If you can't tell the difference between what they call offensive and what they don't, then what are they really doing?

If you have any additional questions about this, you can let me know by email: inteoryx at protonmail dot com.