Working Around Twitter API Restrictions To Identify Bots

Twitter is by far the easiest social media platform to work with programmatically. The Twitter API provides developers with a clean and simple interface to query Twitter’s objects (Tweets, users, timelines, etc.) and bindings to this API exist for many languages. As an example, I’ve been using Tweepy to write Python scripts that work with Twitter data.

While seemingly powerful at first, developers will inevitably bump into one of the many restrictions imposed on usage of Twitter’s API, likely not all that long after they start using it. And if they’re like me, they’ll probably put a great deal of time and effort into figuring out if they can circumvent those restrictions. Here are a few that I’ve bumped into:

  • The Twitter API imposes rate limits on every action you can perform. These rate limits vary depending on what it is you’re trying to do. Here’s a table that lists them. Rate limits almost completely destroy one’s ability to create forensic tools that iterate follower/following lists looking for associations between accounts (that can sometimes be useful for mapping out bot networks.)
  • The number of results returned by queries is capped. A while back I tried to retrieve lists of followers for Twitter’s top 100 most followed accounts (all of which have millions of followers). The API only let me retrieve the most recent 5000 followers. Likewise, if you want to iterate Tweets published by a specific user, the API will only return about 3,200 items, even if the user’s Tweet history contains more.
  • Searches will only retrieve data from the last 7 days. This prevented me from creating a tool to retrieve the first Tweet that contains a specific string, URL, or hashtag (which would be useful for forensic purposes). In order to see back past 7 days, you’d have to save all Tweets, all the time. And that would be expensive, if it were even possible, but…
  • You can’t listen to a stream of all Tweets that are happening. That stream is referred to as “the firehose”, and Twitter only grants a few customers access to it. You can, however, listen to a stream of 1% of all Tweets. That’s referred to as “the garden hose”. Alternatively, and this is the method I use, you can listen to a stream based on a set of search terms. This is a more targeted approach, and ends up being more useful than listening to a ton of noise in most cases.
  • As I mentioned in a previous post, some objects aren’t returned as you might expect. In the case of a “Quote Tweet”, you can access a static representation of the quoted Tweet, but not the full Tweet object. This prevents a script from iterating through nested quote Tweets without performing multiple queries and, yeah, you guessed it, hitting the rate limit wall.
  • Some information is missing. At the time of querying, I’d like to see how many replies a Tweet has accrued. Not possible. You can retrieve an integer corresponding to the number of times a Tweet has been liked or Retweeted, but you’d have to perform additional queries to get a list of users or Tweets associated with that action.

Regardless of the numerous restrictions, working with the Twitter API is fun. And figuring out how to retrieve the data you need while working under these restrictions is often a nice challenge. Saying that, I can’t help but feel like I’ll never operate at a level above “hobbyist”. The data that Twitter themselves have access to puts them in a much better position to find patterns associated with bots. For instance, they’ll likely have direct access to data about when accounts followed/unfollowed other accounts, name changes, Tweet deletions, and perhaps even the IP addresses of clients connecting to Twitter. A recent Brian Krebs article alluded to the fact that Twitter do have automation in place to detect bot-like behavior. Twitter’s back end logic appears to, in some cases, take automatic action against bots. It makes sense that they can’t reveal the logic behind their bot-detection algorithms, but you can definitely see it in play. In a recent example, Joseph Cox‘s Twitter account was automatically restricted after bots targeted some of his Tweets.

Here’s another example. While writing this article, I pointed a script at the garden hose (1% of all Tweets) and collected some metadata about each Tweet I encountered. That metadata included a count of all hashtags seen in Tweets. Here’s the top 10 hashtags my script encountered during the run.

Top 10 hashtags seen from Twitter’s garden hose between 11:00 and 12:00 EEST on 31st August 2017.

Right at the top of that list is the hashtag #izmirescort, a tag used in predominantly Turkish language Tweets to advertise escort services. However, that hashtag doesn’t show up in global trends. During the last several months, every time I’ve run a script against the garden hose, #izmirescort was the top hashtag. So it seems obvious that Twitter has some behind-the-scenes filtering going on to prevent certain hashtags from showing up in trends.

The Twitter streaming API supplies a Tweet object for every Tweet retrieved from the stream. That object doesn’t just contain information about the Tweet, it also contains information about the user who published the Tweet. Hence, by listening to a stream, a script can harvest information about both Tweets and users. This is one of the best ways of getting around rate limiting. For the cost of one API transaction, you can listen to a stream as long as the connection holds, and gather interesting data. While attached to the garden hose stream, I configured my script to fetch a few pieces of metadata associated with the user who posted each Tweet. By obtaining the account creation date and number of Tweets that account has published, I can calculate an average value for Tweets per day over the lifetime of that account. There are some accounts out there that post a phenomenal number of Tweets per day. Here’s an example.

A snapshot of high activity Twitter users obtained from a few minutes of listening to the garden hose stream.

If you listen to a stream for long enough, you’ll observe some accounts Tweeting multiple times. Recording the time interval between Tweets allows you to build up an “interarrival” map. You can also build an interarrival map on a individual user by obtaining previous Tweets from that user and examining the timestamp of each Tweet. Here’s an interarrival map of the last 3200 Tweets from the top listed account above (Love_McD).

0 | 1396
1 | 1249
2 | 341
3 | 99
4 | 28
5 | 6
6 | 1
7 | 1

The above data shows that 1396 Tweets were published with less than one second interval between them, 1249 Tweets one second apart, 341 Tweets 2 seconds apart, and so on. This account literally tweets every few seconds, non-stop.

Performing a standard deviation calculation on the numbers from the second column, you can obtain a floating point number that represents the “machine-like” behavior of that account. Normal accounts tend to have a standard deviation value very close to zero. This account’s value was 549.79.

Of course, by visiting the above account’s Twitter page, you’ll notice that it’s a verified account belonging to McDonalds Japan. Simple numerical analysis on account activity isn’t enough to determine whether it’s a “bad” bot. And there are plenty of legitimate bots on Twitter.

Some bots attempt to hide their activity, while pushing an agenda, by replying to other users. Anyone with a high-profile enough Twitter account has probably had a random Tweet of theirs replied to by a p0rn bot. Thus, using a script to track the percentage of Tweets from an account that were replies to other Tweets is a useful way of determining suspiciousness.

As Ben Nimmo has pointed out, some Twitter botnets utilize multiple accounts to Retweet and Like specific Tweets in attempt to modify SEO on those posts. This is the tactic one botnet owner used to attack Joseph Cox’s account. Again, a script can be used to examine Retweet and Like behavior of an individual account, and by listening to a stream, one can build a list of suspicious accounts on-the-fly, as data arrives.

Another rather easy way of finding bots is to look at the “source” field in a Tweet. This field is set by the application that posted the Tweet. If you publish a Tweet from the Twitter app on your iPhone, source will be set to “Twitter for iPhone”, for example. If you are using the Twitter API to publish Tweets, you’ll create a name for the source field when you set up your API keys. Not all bots use the Twitter API to post Tweets, though, since it’s an obvious giveaway. The bots that recently harassed Ben Nimmo and others all report “legitimate” values in their source fields, indicating that the bot master has automated Tweeting from iPhones and web clients. However, since examining the source field of a Tweet is trivial, it’s still a nice way to determine suspisicousness. Here’s some interesting looking source fields I picked up, in conjunction with high-volume Tweeters.

Non-standard source fields (rightmost column) from high-volume Tweeters.

Note that IFTTT (If This Then That) is a legitimate service used by many companies and individuals to automate social media activities. Hence it’s also a great place for not so legitimate operations to hide.

As an aside, here’s a breakdown of the languages seen while my script was running (which can provide information about the popularity of Twitter in different regions at that hour of the day).

Language breakdown from the Twitter garden hose stream between 11:00 and 12:00 EEST August 31st 2017. The large brown segment is “ja”, but the legend got cut off (I’m still tweaking my visualization implementation).

The botnet recently examined by Ben Nimmo was evidently being used to promote content in multiple languages. Hence, and examination of the languages used in Tweets published by a single account may help determine suspiciousness. However, that particular form of analysis is somewhat at the whim of Twitter’s own language-determination algorithms. I did a language distribution analysis of my own Tweets and found this.

1: cs
1: de
807: en
3: es
3: fi
2: fr
1: ht
1: in
1: no
2: pt
1: tl
1: tr
36: und

Und means “undecided”. As you can see, Twitter may have incorrectly categorized the language of some of my Tweets. However, en is overwhelmingly represented. Accounts that have been used to push content in multiple languages may have double-digit percentage values for multiple languages, indicating that the account holder is either multilingual, or that the account is automated. Of course, Retweets should be factored into this calculation.

The use of scripts to analyze Twitter data opens up many ways to search for suspicious activity. This post has touched upon some of the simplest techniques one can use to build Twitter bot analysis scripts. I’ll cover some more complex techniques in future posts.



Articles with similar Tags