3 Things to Consider When Reading Reports About an Artificial Intelligence Application

7 min readJul 23, 2020

Feature Image — Web Network Programming Artificial Intelligence, by geralt from pixabay.com — Web Network Programming Artificial Intelligence, by geralt from pixabay.com

Artificial Intelligence applications skyrocketed in recent years. We see reports on new applications of this amazing technology almost every day. For example, scientists and engineers are applying artificial intelligence and machine learning techniques on fraud detection, speech and image recognition, self-driving cars, and many more.

Having built and tested dozens of predictive models for identifying people who may be at risk of having gambling related problems (By the way, they are being used commercially worldwide, so these are not toy projects.), I feel, more so than ever, the urge to share some insights in this area to help the wider general public to better understand how these projects are done and how to critically read and understand their claims and announcements.

Machine Learning (ML) is a big topic. There are hundreds of textbooks on how to do these cool stuff. This articles will not get into those details. Readers are encouraged to extend reading to learn more. Instead, this article will ask a few general and high level questions that help to stir thoughts on this topic.

Sampling

Machine learning is about finding patterns (either with or without human assistance) in the data. When it comes to a machine learning application, the first thing you need to pay attention to is the sampling. That is, what data were fed to the ML process for it to use to identify the patterns.

It can not be emphasized enough on how important sampling is. When it is done right, the end result would be amazing. However, when it was done in a less optimal way, it often results in bad models. For example, there was a recent article about facial recognition bias that people with colour are 10 to 100 times more likely to be falsely identified. As a practitioner in this area I believe the root cause of these bias is the sample used to train these systems / algorithms. When people with colour are not properly represented in the sample the people with white faces will “dominate” the patterns found by the ML process thus leading to higher false identification rates among people with colour.

There are a few concepts that are related to sampling.

Population. The population is the complete set of objects that matters to the ML model. For example, when creating a model for identifying gambling risk the population could be all customers of the gambling product.
Sample. A sample is a subset of the population. Usually the population is either infinite or too large to be studied as whole, so data scientists select a subset of it to build their ML models.
Sampling Bias. Unless it was done at full random, sampling always involves certain level of bias. For example, many studies show that women are more likely to agree to take part in a survey. So if the data scientist uses a survey response sample to build models, women are usually over-represented while male, especially younger male, are under-represented.

So, when reading a report about ML application, first questions to ask are:

Are the samples used to build the ML models representative of the population that the ML models will be applied to?
If the sample used is not a fully random sample (and fully random sample is very difficult to get), how did the data scientist check and control for sampling bias?

Without satisfying answers to these questions, any claim that the ML application made should be viewed with a big question mark on it.

Training vs Validation

Many ML models have the end goal of making decisions or predictions, such as whether or not the given picture of a face matches another picture, or whether or not it will rain tomorrow. In these cases there will be a critical question about how good these decisions or predictions are (i.e. the “performance” of the ML model).

Readers of a ML report need to be very careful in understanding which sample was used to report the performance on. There usually two sets of sample, a training sample and a validation sample.

Training Sample is the sample fed into the ML process for it to extract patterns on.
Validation Sample is another set of sample, which was NOT used during the ML process in any way.

ML algorithms are powerful. They will tirelessly go through the training sample and in theory eventually identify any pattern contained in it, even when the pattern is just an exception in the training sample but not to be seen anywhere else. When this happens, it is referred to as “over-fitting” in ML lingo.

This implies that ML models will become very knowledgeable to the training sample, so they always performs well on the training sample. The true test is when applying the ML model to the validation sample. As the validation was never seen by the ML model, the model’s performance on this sample would be the true perform you get when applying the models to the population.

In practice, I often saw models having 90%+ accuracy on training samples. However, the performance varies on the validation samples, usually ranging from 20–40%. So you see there will be a significant difference.

Again, when reading a report about ML application, ask the following questions.

How the models were tested and on what sample?
Did the data scientist make clear distinction between the training and validation sample (i.e. making sure validation sample were never seen by the ML models before the final performance testing)?

Quite a few reports I came across in the past didn’t satisfy me with clear information on how the good-looking performance numbers were derived. I would be really impressed if they were from a true validation sample. Unfortunately, however, I had to suspect that the numbers were based on training samples.

Performance Metrics

We mentioned Accuracy in the previous section, which is just one of the performance metrics we use to measure the ML models. There are a few others that are most commonly used.

Accuracy. Among all the decisions / predictions made by the ML model, how many of them are correct? Let’s say our models made 100 face matches and 90 of them are correct, then the accuracy is 90%.
Reach / Coverage. Among all the target cases in the sample / population, how many of them were included the the cases identified by our ML model? If there would be 100 storms worldwide this year, our models made 50 predictions in which 20 of them did occur, then the Reach is 20% (20/100) while the Accuracy is 40% (20/50).
Lift. How did our ML model improved our ability to make correct decisions and predictions? For example, if randomly selecting gamblers will show that 1% of them are problem gambler (the “base rate”), and our model has an accuracy of 20%, it is a lift of 20 (meaning our models give us 20 times better capabilities in finding them). This performance metrics is particularly interesting when the base rate is low. More on this later.

It is worth noting that Accuracy and Reach are usually against each other. When you want to be very accurate, you “narrow the net” so only those subjects that you are very certain will be included in your prediction. This gives you high accuracy, but your reach would normally drop. On the other hand, when you “broaden the net” you would find more subject of interest but you would also include more false-positives so your accuracy would drop.

Not very many reports include both numbers. Most the news releases and marketing materials I saw focus on just one number (presumably the good-looking one).

There are another two tricks for reporting good-looking numbers, which is especially misleading when the base rate is low. The first one is using the wrong target. For example, using the same problem gambling example from above, let’s say the base rate of problem gambling is 1% in the population. I had a model that blindly says everyone is not a problem gambler. Guess what? This model has an accuracy of 99%, and obviously useless.

Now, some extra comments on Lift. The second trick to report good numbers is by saying “our models are 50 times better”. I once saw a report about modelling for targets with just about 0.01% base rate (my estimate). That report didn’t include the base rate but claimed to have improved prediction capability 50+ times. If you do the math, this means the accuracy of the model would be pathetically as low as 0.5%. Of course, the marketing department would never show you this number. They would focus on the lift of 50.

So, ask these questions when you look at performance numbers.

Are all information required to properly understand the performance provided? This includes base rate, reach, accuracy and lift. If some of them are missing, ask why and read with more cautious.

Summary

Well, I hope you’ve all read through to reach this point and haven't been scared away by the intimidating terminologies.

Machine Learning and Artificial Intelligence are exciting. They have great potentials. However I feel it starts become eye-grabbing buzz words than a rigorous science practice.

Hope this article would provide you with a better understanding on how to look through the mist and get to the true value-proposition of a ML application.

3 Things to Consider When Reading Reports About an Artificial Intelligence Application

Sampling

Training vs Validation

Performance Metrics

Summary

Written by Frank Ye