Explainable Artificial Intelligence (now fancily abbreviated as XAI) is a growing trend in the latest years. It is a logical consequence of the development of predictive algorithms of evergrowing complexity, namely deep learning systems such as CNNs. Such systems are undoubtedly qualified as black-boxes, because they don’t provide a straightforward interpretation of their decisions. You cannot easily ask a neural network why it classified a picture of you as a man or woman, in the sense of “what are the human-like traits the network is learning to make the decision”. These questions inevitably take us to the terrain of causality, that I introduced in my previous review of Judea Pearl’s latest book. But staying on the terrain of traditional (correlations-based) machine learning, what can we do about explanation?
Methods for Explainable Machine Learning
While this article is not intended to make an exhaustive review of interpretable machine learning techniques, let’s make clear that these can be divided in two blocks:
- Intrinsically Interpretable Models. Models with a straightforward interpretation available right after training, like the Lasso we’re introducing in this post.
- Post Hoc / Model-agnostic / Extrinsic Methods. These are general algorithms that can be applied to any learning model.
For more information on this topic I can only recommend Christoph Molnar’s Book on Interpretable Machine Learning, available online and my source for writing the lines above and for learning about this field.
Lasso Logistic Regression: the model
A classic of statistics and machine learning and probably well-known by most potential readers of this blog, this model is basically a regression with some tweaks. Given some data in a vector space, calculating a regression line (or hyperplane in more dimensions) consists in solving an Ordinary Least Squares problem, which means that the resulting line is the one that minimizes the sum of squared distances to the variable being predicted. If the regression function is linear, there we have linear regression. The coefficients of the model are the coefficients of the regression line, and learning a regression consists in calculating these coefficients.
If we move to a classification setting, where the data points are labeled in two categories, we need to move from a regression line to a decision boundary that separates the space in the two categories. We can achieve this simply by applying the logistic (or sigmoid) function to the output of the linear regression. The logistic function acts as a transformation that resizes all values to the interval [0,1], so they can be interpreted as probabilities, set the decision boundary e.g. at 0.5 and we can classify the data points.
At this point we could stop and start interpreting the results of a logistic regression model. Yet we’re adding one more twist: let’s talk about overfitting. In a very short summary, machine learning models tend to learn representations that are too adjusted to the data sample. The problem is that samples are almost always noisy with respect to the population. Then, when predicting on a test sample, the model could degrade its performance. Regularization techniques are the solution to overfitting, and Lasso is one of them. Also known L1 norm regularization, it sets some coefficients to 0, effectively making the decision boundary simpler. The learning process will find a configuration that has set to 0 coefficients that are unnecessary or redundant for prediction.
Explaining the model
Think of the regression line: its coefficients are the numbers that define the slope of the line. The higher the number, the larger the line grows in a certain dimension, i.e. the more proportional it is. If the number is negative, it is inversely proportional. If the number is 0, such dimension does not affect the outcome. If you translate this interpretation to the decision boundary obtained by the logistic function, the same rules apply but with probabilities: the higher the coefficient, the more an increase on this dimension increases probability of a certain class. With a simple transformation, we can move from coefficients to odds ratios, and we can interpret odds as real probabilities: if the coefficient of a dimension is 2, then an increase of 1 unit in such dimension doubles probability of a certain class; if it is -4, the same increase of 1 unit decreases by 4 times the probability. Instead of explaining odds, I refer you to this link from Molnar’s book, which is delightfully explained in simple maths.
This is what we were looking for! An explanation of the decisions taken by the machine learning algorithm, in this case the variables that affect positively or negatively the probability that the data is classified under a category. Besides, because we applied Lasso regularization, the non-zero coefficients are a subset of the total, i.e. we have fewer dimensions to care about its interpretation.
Practical advice on how to interpret Lasso coefficients
- Remember there is no cause-and-effect involved in a regression, just correlations that appear on the data sample. Hence, be careful expressing factual statements as “an increase in Age of 1 year causes a 5% increase in probability of disease”. Instead, it is more rigorous to say “an increase in Age of 1 year is associated to a 5% increase in probability of disease”.
- Be careful talking in absolute terms. Taking the dimensions with positive and negative coefficients, respectively, and stating “these words are used more often by men and those words by women” is wrong: maybe all words are used more often by men because they are more common in your data, but still some are more associated to women because of their relative frequency.
- Be careful talking in terms of relative frequencies too! They are not the same as regression coefficients. If they were the same, we would’ve just calculated the frequencies, which are much cheaper. But you wouldn’t have a probabilistic interpretation of what are the dimensions that make each class more likely. Logistic regression is a probabilistic classifier by nature.
- I consider “associated to”, “correlated to”, “linked to” equivalent and valid when talking about regression coefficients interpretation. Sometimes it helps if you specify that these are phenomena intrinsic to your data sample: “in our data, the appearance of this word doubles the probability that the speaker is male” is truer than saying “this word doubles the probability that the speaker is male”.
