During the lead-up to the January 2018 Finnish presidential elections, I collected a dataset consisting of raw Tweets gathered from search words related to the election. I then performed a series of natural language processing experiments on this raw data. The methodology, including all the code used, can be found in an accompanying blog post. This article details the results of my experiments, and shows some of the visualizations generated.
I pre-processed the raw dataset, used it to train a word2vec model, and then used that model to perform analyses using word2vec.wv.most_similar(), T-SNE, and Tensorboard.
My first experiment involved creating scatterplots of words found to be similar to frequently encountered tokens within the Twitter data. I looked at the 50 most frequent tokens encountered in this way, and used T-SNE to reduce the dimensionality of the set of vectors generated in each case. Results were plotted using matplotlib. Here are a few examples of the output generated.
Here you can see that word2vec easily identified other hashtags related to the #laura2018 campaign, including #suomitakaisin, #suomitakas, #siksilaura and #siksips. Laura Huhtasaari was candidate number 5 on the voting slip, and that was also identified, along with other hashtags associated with her name.
Here’s an analysis of the hashtag #turpo (short for turvallisuuspolitiikka – National Security). Here you can see that word2vec identified many references to NATO (one issue that was touched upon during election campaigning), jäsenyys (membership), #ulpo – ulkopolitiikka (Foreign Policy), and references to regions and countries (venäjä – Russia, ruotsi – Sweden, itämeri – Baltic).
On a similar note, here’s a scatterplot of words similar to venäjä (Russia). As expected, word2vec identified NATO in close relationship. Names of countries are expected to register as similar in word2vec, and we see Ruotsi (Sweden), Ukraine, USA, Turkki (Turkey), Syria, Kiina (China). Word2vec also finds the word Putin to be similar, and interestingly, Neuvostoliito (USSR) was mentioned in the Twitter data.
Above is a scatterplot based on the word “presidentti” (president). Note how word2vec identified Halonen, Urho, Kekkonen, Donald, and Trump.
Moving on, I took the names of the eight presidential candidates in Sunday’s election, and plotted them, along with the 40 most similar guesses from word2vec, on scatterplots of the entire vocabulary. Here are the results.
As you can see above, all of the candidates occupied separate spaces on the graph, and there was very little overlap amongst words similar to each candidate’s name.
I created word embeddings using Tensorflow, and opened the resulting log files in Tensorboard in order to produce some visualizations with that tool. Here are some of the outputs.
The above shows word vectors in close proximity to #haavisto2018, based on the embeddings I created (from the word2vec model). Here you can find references to Tavastia, a club in Helsinki where Pekka Haavisto’s campaign hosted an event on 20th January 2018. Words clearly associated with this event include liput (tickets), ilta (evening), livenä (live), and biisejä (songs). The event was called “Siksipekka”. Here’s a view of that hashtag.
Again, we see similar words, including konsertti (concert). Another nearby word vector identified was #vihreät (the green party).
In my last experiment, I compiled lists of similar words for all of the top 50 most frequent words found in the Twitter data, and recorded associations between the lists generated. I imported this data into Gephi, and generated some graphs with it.
I got interested in Gephi after recently collaborating with Erin Gallagher (@3r1nG) to visualize the data I collected on some bots found to be following Finnish recommended Twitter accounts. I highly recommend that you check out some of her other blog posts, where you’ll see some amazing visualizations. Gephi is a powerful tool, but it takes quite some time to master. As you’ll see, my attempts at using it pale in comparison to what Erin can do.
The above is a graph of all the words found. Larger circles indicate that a word has more other words associated with it.
Here’s a zoom-in on some of the candidates. Note that I treated hashtags as unique words, which turned out to be useful for this analysis. For reference, here are a few translations: äänestää = vote, vaalit = elections, puhuu = to speak, presitenttiehdokas = presidential candidate.
Here is a zoomed-in view of the words associated with foreign policy and national security.
Finally, here are some words associated with #suomi (Finland). Note lots of references to nature (luonto), winter (talvi), and snow (lumi).
As you might have gathered, word2vec finds interesting and fairly accurate associations between words, even in messy data such as Tweets. I plan on delving further into this area in hopes of finding some techniques that might improve the Twitter research I’ve been doing. The dataset collected during the Finnish elections was fairly small (under 150,000 Tweets). Many of the other datasets I work with are orders of magnitude larger. Hence I’m particularly interested in figuring out if there’s a way to accurately cluster Twitter data using these techniques.