This article is the first in a series of four articles on the work we’ve been doing for the European Union’s Horizon 2020 project codenamed SHERPA. Each of the articles in this series contain excerpts from a publication entitled “Security Issues, Dangers And Implications Of Smart Systems”. For more information about the project, the publication itself, and this series of articles, check out the intro post here.
This article details the types of flaws that can arise when developing machine learning models. It also includes some advice on how to avoid introducing flaws while developing your own models.
Machine learning is the process of training an algorithm (model) to learn from data without the need for rules-based programming. In traditional software development processes, a developer hand-writes code such that a known set of inputs are transformed into desired outputs. With machine learning, an algorithm is iteratively configured to transform a set of known inputs into a set of outputs optimizing desired characteristics. Many different machine learning architectures exist, ranging from simple logistic regression to complex neural network architectures (sometimes referred to as “deep learning”). Common uses of machine learning include:
Methods employed to train machine learning models depend on the problem space and available data. Supervised learning techniques are used to train a model on fully labelled data. Semi-supervised learning techniques are used to train a model with partially labelled data. Unsupervised learning techniques are used to process completely unlabelled data. Reinforcement learning techniques are used to train agents to interact with environments (such as playing video games or driving a car).
Recent innovations in the machine learning domain have enabled significant improvements in a variety of computer-aided tasks, including:
Machine learning-based systems are already deployed in many domains, including finance, commerce, science, military, healthcare, law enforcement, and education. In the future, more and more important decisions will be made with the aid of machine learning. Some of those decisions may even lead to changes in policies and regulations. Hence it will be important for us to understand how machine learning models make decisions, predict ways in which flaws and biases may arise, and determine whether flaws or biases are present in finished models. A growing interest in understanding how to develop attacks against machine learning systems will also accompany this evolution, and, as machine learning techniques evolve they will inevitably be adopted by ‘bad actors’, and used for malicious purposes.
This series of articles explore how flaws and biases might be introduced into machine learning models, how machine learning techniques might, in the future, be used for offensive or malicious purposes, how machine learning models can be attacked, and how those attacks can presently be mitigated. Machine learning systems present us with new challenges, new risks, and new avenues for cyber attackers. Further articles in this series will explore the implications of attacks against these systems and how they differ from attacks against traditional systems.
If a machine learning model is designed or trained poorly, or used incorrectly, flaws may arise. Designing and training machine learning models is often a complex process, and there are numerous ways in which flaws can be introduced.
A flawed model, if not identified as such, can pose risks to people, organizations, or even society. In recent years, machine-learning-as-a-service (such as Amazon SageMaker, Azure Machine Learning Service, and Google Cloud Machine Learning Engine) offerings have enabled individuals to train machine learning models on their own data, without the need for deep technical domain knowledge. While these services have lowered the barrier to adoption of machine learning techniques, they may have also inadvertently introduced the potential for widespread misuse of those techniques.
This article enumerates the most common errors made while designing, training, and deploying machine learning models. Common flaws can be broken into three categories – incorrect design decisions, deficiencies in training data, and incorrect utilization choices.
Machine learning models are essentially functions that accept a set of inputs, and return a set of outputs. It is up to a machine learning model’s designer to select the features that are used as inputs to a model, such that it can be trained to generate accurate outputs. This process is often called feature engineering. If a designer of a model chooses features that are signal-poor (have little effect on the decision that is made by the model), irrelevant (have no effect on the decision), or introduce bias (inclusion or omission of inputs and / or features that favour/disfavour certain results), the model’s outputs will be inaccurate.
If features do not contain information relevant to solving the task at hand, they are essentially useless. For instance, it would be impossible to build a model that can predict the optimal colour for targeted advertisements with data collected from customer’s calls for technical support. Unfortunately, the misconception that throwing more data at a problem will suddenly make it solvable is all too real, and such examples do occur in real life.
A good example of poor feature engineering can be observed in some online services that are designed to determine whether Twitter users are fake or bots. Some of these services are based on machine learning models whose features are derived only from the data available from a Twitter account’s “user” object (the length of the account’s name, the date the account was created, how many Tweets the account has published, how many followers and friends the account has, and whether the account has set a profile picture or description). These input features are relatively signal-poor for determining whether an account is a bot, which often manifests in incorrect classification.
This is the information (circled) that some “bot or not” services use to determine whether an account is a bot. Or not.
Another common design flaw is inappropriately or suboptimally chosen model architecture and parameters. A potentially huge number of combinations of architectures and parameters are available when designing a machine learning model, and it is almost impossible to try every possible combination. A common approach to solving this problem is to find an architecture that works best, and then use an iterative process, such as grid search or random search to narrow down the best parameters. This is a rather time-consuming process – in order to test each set of parameters, a new model must be trained – a process that can take hours, days, or even weeks. A designer who is not well-practiced in this field may simply copy a model architecture and parameters from elsewhere, train it, and deploy it, without performing proper optimization steps.
An illustration of some of the design decisions available when building a machine learning model.
Incorrect choices in a model’s architecture and parameters can often lead to the problem of overfitting, when a model learns to partition the samples it has been shown accurately, but fails to generalize on real-world data. Overfitting can also arise from training a model on data that contains only a limited set of representations of all possible inputs, which can happen even when a training set is large if there’s a lack of diversity in that dataset. Problems related to training data will be discussed in greater detail in the next subsection.
Overfitting can be minimized by architectural choices in the model – such as dropout in the case of neural networks. It can also be minimized by data augmentation – the process of creating additional training data by modifying existing training samples. For instance, in order to augment the data used to train an image classification model, you might create additional training samples by flipping each image, performing a set of crops on each image, and brightening/darkening each image.
It is common practice to evaluate a model on a separate set of samples after training (often called a test set). However, if the test set contains equally limited sample data, the trained model will appear to be accurate (until put into production). Gathering a broad enough set of training examples is often extremely difficult. However, model designers can iteratively test a model on real-world data, update training and test sets with samples that were incorrectly classified, and repeat this process until satisfactory real-world results are achieved. This process can be time-consuming, and thus may not always be followed in practice.
Supervised learning methods require a training set that consists of accurately labelled samples. Labelled data is, in many cases, difficult or costly to acquire – the process of creating a labelled set can include manual work by human beings. If a designer wishes to create a model using supervised learning, but doesn’t have access to an appropriate labelled set of data, one must be created. Here, shortcuts may be taken in order to minimize the cost of creating such a set. In some cases, this might mean “working around” the process of manually labelling samples (i.e. blanket collection of data based on the assumption that it falls under a specific label). Without manually checking data collected in this way, it is possible that the model will be trained with mislabelled samples.
If a machine learning model is trained with data that contains imbalances or assumptions, the output of that model will reflect those imbalances or assumptions. Imbalances can be inherent in the training data, or can be “engineered” into the model via feature selection and other designers’ choices. For example, evidence from the US suggests that models utilized in the criminal justice system are more likely to incorrectly judge black defendants as having a higher risk of reoffending than white defendants. This flaw is introduced into their models both by the fact that the defendant’s race is used as an input feature, and the fact that the historical data might excessively influence decision-making.
In another recent example, Amazon attempted to create a machine learning model to classify job applicants. Since the model was trained on the company’s previous hiring decisions, it led them to building a recruitment tool that reinforced their company’s historical hiring policies. The model penalized CVs that included the word “women’s”, downgraded graduates from women’s colleges, and highly rated aggressive language. It also highly rated applicants with the name “Jared” who had previously played lacrosse.
A further example of biases deeply embedded in historical data can be witnessed in natural language processing (NLP) tasks. The creation of word vectors is a common precursor step to other NLP tasks. Word vectors are usually more accurate when trained against a very large text corpus, such as a large set of scraped web pages and news articles (for example, the “Google News data” set). However, when running simple NLP tasks, such as sentiment analysis, using word vectors created in this manner, bias in English-language news reporting becomes apparent. Simple experiments reveal that word vectors trained against the Google News text corpus exhibit gender stereotypes to a disturbing extent (such as associating the phrase “computer programmer” to man and the word “homemaker” to woman).
Word vector examples. Source: https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c
The idea that bias can exist in training data, that it can be introduced into models, and that biased models may be used to make important decisions in the future is the subject of much attention. Anti-bias initiatives already exist (such as AlgorithmWatch (Berlin), and Algorithmic Justice League, (US), and several technical solutions to identify and fix bias in machine learning models are now available (such as IBM’s Fairness 360 kit, Facebook’s Fairness Flow, an as-yet-unnamed tool from Microsoft, and Google’s “what if” tool). Annual events are also arranged to discuss such topics, such as FAT-ML (Fairness, Accountability, and Transparency in Machine Learning). Groups from Google and IBM have proposed a standardized means of communicating important information about their work, such as a model’s use cases, a dataset’s potential biases, or an algorithm’s security considerations. Here are links to a few papers on the subject.
AI is reportedly transforming many industries, including lending and loans, criminal justice, and recruitment. However, participants in a recent Twitter thread started by Professor Gina Neff discussed the fact that imbalances in datasets is incredibly difficult to find and fix, given that it arises for social and organizational reasons, in addition to technical reasons. This was illustrated by the analogy that despite being technically rooted, both space shuttle accidents were ultimately caused by societal and organizational failures. The thread concluded that bias in datasets (and thus the machine learning models trained on those datasets) is a problem that no single engineer, company or even country can conceivably fix.
Machine learning models are very specific to the data they were trained on and, more generally, the machine learning paradigm has serious limitations. This is often difficult for humans to grasp – their overly high expectations come from naively equating machine intelligence with human intelligence. For example, humans are able to recognize people they know regardless of different weather and lighting conditions. The fact that someone you know is a different colour under nightclub lighting, or is wet because they have been standing in the rain does not make it any more difficult for you to recognize them. However, this is not necessarily the case for machine learning models. It is also important to observe that modelling always involves certain assumptions, so applying a machine-learning-based model in situations when the respective assumptions do not hold will likely lead to poor results.
Going beyond the above examples, people sometimes attempt to solve problems that simply cannot (or should not) be solved with machine learning, perhaps due to a lack of understanding of what can and cannot be done with current methodologies.
One good example of this is automated grading of essays, a task where machine learning with its current limitations should not be used at all. School districts in certain parts of the world have however created machine learning models using historically collected data – essays, and the grades that were assigned to them. The trained model takes a new essay as input and outputs a grade. The problem with this approach is that the model is unable to understand the content of the essay (a task that is far beyond the reach of current machine learning capabilities), and simply grades it based on patterns found in the text – sentence structure, usage of fancy words, paragraph lengths, and usage of punctuation and grammar. In some cases, researchers have written tools to generate nonsensical text designed to always score highly in specific essay grading systems.
The process of developing and deploying a machine learning model differs from standard application development in a number of ways. The designer of a machine learning model starts by collecting data or building a scenario that will be used to train the model, and writes the code that implements the model itself. The developer then runs a training phase, where the model is exposed to the previously prepared training data or scenario and, through an iterative process, configures its internal parameters in order to fit the model. Once training has ended, the resulting model is tested for the key task-specific characteristics, such as accuracy, recall, efficiency, etc. The output of training a machine learning model is the code that implements the model, and a serialized data structure that describes the learned parameters of that model. If the resulting model fails to pass tests, the model’s developer adjusts its parameters and/or architecture and perhaps even modifies the training data or scenario and repeats the training process until a satisfactory outcome is achieved. When a suitable model has been trained, it is ready to be deployed into production. The model’s code and parameters are plugged into a system that accepts data from an external source, processes it into inputs that the model can accept, feeds the inputs into the model, and then routes the model’s outputs to intended recipients.
Depending on the type and complexity of a chosen model’s architecture, it may or may not be possible for the developer to understand or modify the model’s logic. As an example, decision trees are often highly readable and fully editable. At the other end of the spectrum, complex neural network architectures can contain millions of internal parameters, rendering them almost incomprehensible. Models that are readable are also explainable, and it becomes much easier to detect flaws and bias in such models. However, these models are often relatively simple, and may be unable to handle more complex tasks. Tools exist to partially inspect the workings of complex neural networks, but finding bias and flaws in such models can be an arduous task that may often involve guesswork. Hence, rigorous testing is required to ensure the absence of potential flaws and biases. Testing a machine learning model against all possible inputs is impossible. In contrast, where an interface exists in traditionally built applications, defined processes and tools are available that enable developers to identify inputs that can catch all potential errors and corner cases.
An example of a decision tree. Source: https://lethalbrains.com/learn-ml-algorithms-by-coding-decision-trees-439ac503c9a4
Machine learning models receive inputs that have been pre-processed and then vectorized into fixed-size structures. Vectorization is the process of converting an input (such as an image, piece of text, audio signal, or game state) into a set of numerical values, often in the form of an array, matrix (two-dimensional array), or tensor (multi-dimensional array). Bugs may be introduced into the code that interfaces the model with external sources or performs vectorization; these may find their way in via code invoking machine learning methods implemented in popular libraries, or may be introduced in decision-making logic. Detecting such bugs is non-trivial.
Based on SHERPA partners’ experiences and knowledge gained while working in the field, we recommend following these guidelines while planning, building and utilizing machine learning models, so that they function correctly and do not exhibit bias:
The above guidelines do not include measures that designers might want to take to safeguard machine learning models from adversarial attacks. Adversarial attack techniques and mitigations against them are discussed in later articles in this series.
It is worth noting that design decisions made at an early stage of a model’s development will affect the robustness of the systems powered by that model. For instance, if a model is being developed to power a facial recognition system (which is used in turn to determine access to confidential data), the model should be robust enough to differentiate between a person’s face and a photograph. In this example, the trade-off between model complexity and efficiency must be considered at this early stage.
Some application areas may also need to consider the trade-off between privacy and feature set. An example of such a trade-off can be illustrated by considering the design of machine learning applications in the cyber security domain. In some cases, it is only possible to provide advanced protection capabilities to users or systems when fine-grained details about the behaviour of those users or systems are available to the model. If the transmission of such details to a back-end server (used to train the model) is considered to be an infringement of privacy, the model must be trained locally on the system that needs to be protected. This may or may not be possible, based on the resources available on that system.
There are many variables that can affect the outcome of a machine learning model. Designing and training a successful model often takes a great deal of time, effort, and expertise. There’s a lot of competition in the machine learning space, especially with regards to implementing novel services based on existing machine learning techniques. As such, shortcuts may be taken in the design and development of the machine learning models powering services that are rushed to market. These services may contain flaws, biases, or may even attempt to do things they shouldn’t. Due to the inherent complexity of the field, it is unlikely that lay people will understand how or why these services are flawed, and simply use them. This could lead to potentially unwelcome outcomes.
This concludes the first article in this series. The next article explores how machine-learning techniques and services that utilize machine learning techniques might be used for malicious purposes.