Data Mining The State Of The Internet

Every year, Aalto University, here in Helsinki, runs a software project course for second and third year bachelor students. The idea of the course is to get undergrads a taste of real software project work with real goals, real customers, and real deadlines. The course kicks off in the early-autumn, when students assemble into teams and choose from a list of projects submitted by local companies. Project work starts in late-October and runs through to mid-April of the following year. At the end of the course, the three best projects are chosen, and those three teams battle it out in a final demo showdown, leaving one standing as a victor. This year, the team sponsored by F-Secure took home the winner’s trophy, which was awarded on Friday 19th April.

The winning team

Our team from Aalto celebrating their win.

This time around, 16 teams were formed, and 42 project proposals were submitted. Our own proposal, entitled “Data Mining The State Of The Internet”, was submitted by Ville Lindfors, head of F-Secure’s Security Cloud team. Despite the fierce competition, one of the student teams selected our project.

The goal of the project was to develop a data mining framework using AWS services. This framework was to be used by F-Secure to analyze connections, correlations, and dependencies between Internet domains based on Whois data, using large-scale data-mining technologies. Examples of what we wanted to accomplish with this project included:

  • Finding domains that share name servers with known malicious domains
  • Detection of unknown harmful Internet sites
  • Ability to automatically predict the reputation of new domain registrations
  • Discovery of potentially unknown phishing sites
Data Science

This is what we ended up building.

The team working on our project included seven participants, one of whom works here in Labs. We also had fellows facilitating and guiding the team throughout the project. Here are a list of fellows who helped out in one way or another:

  • Jouni Kuusisto was a member of the team as part of his MSc studies at Aalto. He also took the role of scrum master.
  • Perttu Ranta-Aho assisted the team as a technical expert and product owner.
  • Jukka Haapala functioned as a product owner and facilitator.
  • Christine Bejerasco provided use cases that were used to build the framework.
  • Jaakko Harjuhahto coached the team for a second year in a row.
Project demo meeting - man pointing at thing.

Students and F-Secure fellows in one of the initial project meetings.

The project ended up using dozens of technologies. Apache Spark was selected to perform data-mining tasks such as extracting and processing data from external sources and performing queries on the extracted data. Amazon Web Services (AWS) was used extensively for data processing and storage, with Amazon EMR being used for subsequent data processing. Amazon DynamoDB was used to store configuration and service state of systems used. To handle streaming data sources, Amazon Kinesis was integrated. Finally, AWS CloudFormation was used for managing the collection of AWS resources in use.

Tech used in DMSI

Just some of the tech used in the DMSI project.

The F-Secure sponsored team was praised for commitment, use of a very challenging technology stack, thorough documentation, committed involvement with customers (analyst workshops, demos held for F-Secure Labs), high integration quality, exemplary development process, and the quality of their final presentation. What’s more, we’ll be bringing three of the team on for summer jobs, starting in May.

By the way, we’ve managed to participate in this course for several years already. We’ve frequently made it into the top six, and a few times into the top three. This was our first pole position. Win or not, this is a great way for our local undergrads to get involved in something industry-relevant, to see how software engineering works in the real world and to meet engineers from software houses in the region. If you’re interested in reading more about what our team accomplished, here are a couple of links:

Articles with similar Tags