Blog, Our Thinking

The challenges in building a reliable AI marketing solution

ai marketing cover image

AI marketing Introduction

Recently, machine learning and Artificial Intelligence (AI) have been gaining significant attention in marketing. According to Forbes, 84% of marketing organisations implemented or expanded AI in 2018, and 75% of enterprises which used AI and machine learning enhanced customer satisfaction by more than 10%. 

Although the application of machine learning and AI to problem-solving in marketing is promising, there are risks that we should all be aware of before celebrating too much about this new age of AI success. 

Previously, we presented examples of how computer vision and AI can be applied to the real world and how we use AI to provide insight for our customers. In this blog post we would like to discuss how we evaluate the performance of our AI applications and why this is important.

What are metrics? And why do we need them?

ai in marketing metrics image

AI has already produced some great results in marketing. For example, telecoms are using AI to predict customer churn, and Harley-Davidson used AI to increase sales leads by 2,930%. However, machine learning wouldn’t succeed without well defined metrics. 

In the field of AI, there are no quick answers. To achieve success in solving a specific marketing problem, it is normal to have to experiment on a variety of models using different hypotheses. This is analogous to inventing a new drug, which will need the approval of regulators, and therefore has to be tested in many ways to prove it is effective. Naturally, to achieve satisfactory results it’s critical to have indicators or measurements. Without them we can’t possibly solve the problem. 

Solving complex automotive problems in particular requires a specific balance of metrics during development. The level of effort made to understand and apply the correct balance is what makes the difference between developing AI solutions that are genuinely useful for brands vs those that exist merely as a marketing ploy. Let’s look at an example.

Imagine we want to build an application that classifies a given image as either a car or a non-car. This type of problem is called classification. The classifier identifies an image (eg. a car or non-car image) from a set of categories (car/non-car). If the objective of the car classifier is to detect a car, for the sake of convenience we assign a label 1 to a car image to indicate a positive sample, and a label 0 to a non-car image to indicate a negative sample. If the objective is to detect a non-car, then the labels would be reversed. 

Before going any further, we need to quickly discuss some essential metric terms that we employ to classify results in machine learning:

1. True Positive (TP): A true-positive is a result that indicates a given condition exists, when it truly does.

For example, the car classifier detects a car image as a car.

2. True Negative (TN): A true-negative is a result that indicates a given condition doesn’t exist, when it truly does not.

For example, the car classifier detects a non-car image as a non-car.

3. False Positive (FP): A false-positive is a result that indicates a given condition exists, when it does not.

For example, the car classifier detects a non-car image as a car.

4. False Negative (FN): A false-negative is a result that indicates a given condition doesn’t exist, when it does.

For example, the car classifier detects a car image as a non-car.

In the case of an example car classifier, the easiest way to evaluate how well it performs is to work out its accuracy at identifying a set of images containing cars and non-cars. The accuracy of the classifier would simply be the fraction of all examples it classified correctly.

Unfortunately, while using the accuracy metric as a performance measurement is intuitive, it has its drawbacks. For example, imagine we want to build an image classifier to automatically find cars from a brand called GMA and a specific model called T.44. That classifier simply has to be able to know if an image is one with a T.44 or not one with a T.44. 

If we fed 100 images into our classifier and the result showed the model to be 92% accurate, we’d be forgiven thinking it must be very good (because 92 out of 100 is a great result in basic terms). However, the classifier may have been terrible at identifying T.44 images, so much so that it missed all of them because it actually ‘saw’ every one of the 100 images as non-T.44s, but because there were 92 non-T.44’s, it was correct 92 times. The False-Negative outcomes are high here, but the accuracy metric doesn’t reflect this. This might not be a serious problem if the risk associated with False-Negative is low. However, if we were building a classifier that could detect a rare disease, for example, accuracy would not be a metric we would want to use. This simple example of a misleading metric is already a classic illustration of one that an AI developer could leverage to present attractive looking stats on their effectiveness while the reality is quite different under the surface.

More than Accuracy

Accuracy is a good measurement if the classes of data are spread out evenly, for example if we have a balanced dataset containing 50 T.44 car images and 50 other not T.44 images. Then, in the case above, the accuracy would fall to 50%, since the outcomes contain 50 True Negatives and 0 True Positives, the latter because the classifier did not define any cars as T.44s. As we can see, the accuracy metric becomes biased if one class dominates the dataset like it does in our original example of just 8 images in 100 containing cars we want to identify.  

So what other metrics could we use in this case? In machine learning/AI, classification is all about identifying which set of categories a new sample (e.g. an image) belongs to. To solve classification problems we commonly use metrics called precision and recall as our more insightful performance measurements. Essentially, these two metrics tell us the misclassification rate or in simple terms, the error rate. That is, those classifications which are False Positives or False Negatives.

Precision in machine learning marketing

machine learning marketing blog

As the name suggests, Precision tells us how precise the model is. Precision is about the volume of images the model identifies correctly. In the T.44 example above, the precision would be 8% if the classifier had identified all 100 images as being T.44 (100 samples, 8 of which were actually correct) that’s quite a difference from our initial 92% accuracy metric. For illustration, that number would be 0% if the classifier detected all 100 images as being non-T.44 (i.e all detected as negative outcomes). 

On the other hand, if it detected every T.44 car image correctly and there were no False Positives, the precision would be 100% – and a great classifier!


The recall metric is the percentage of total relevant samples (i.e total ‘positive’ samples) that the model missed. In other words, how many images are detected as False Negative (FP). It tells us how many correct samples the model should have detected, but failed to detect. In the T.44 example above, the recall would be 100% if the classifier had identified all 100 images as being T.44 (8 relevant samples, didn’t miss any ‘positive’ samples). That number would be 0% if the classifier detected all 100 images as being non-T.44 (i.e all detected as negative outcomes, so missed every relevant sample)

Choosing the right metrics

In the T.44 example above we mentioned that a model which misclassified every image as non T.44 would have an accuracy of 92%. However, both precision and recall would be 0% since the model is both imprecise and missed every T.44 image. In contrast, if the model classified every image as T.44, recall would be 100% as the model caught every image it was looking for. On the other hand, precision would be just 8%. As you can see, often there is a tradeoff between precision and recall, where it is possible to increase one at the cost of the other. It is tweaking of this finely tuned balance and an understanding of the resulting impact that defines high quality AI development. 

In practice, we compare all three metrics, accuracy, precision and recall, for all the models we test at StoryStream. The metric we choose to be the most important depends on the problem itself, such as what type of car imagery we want to surface or how finely tuned a customer needs search results to be when looking for images within our platform.

Our team typically aims for higher recall if the danger associated with non-detection is high, like missing an incredible piece of content among thousands of images when we know there are only a few in there and each one could make an entire campaign succeed.

In this way we can ensure our application doesn’t miss the target too often. On the other hand, we aim for higher precision if the danger is relatively low. In this case, the model will be less sensitive but more precise. 

These decisions are ones that our AI team tackle everyday at StoryStream. Our proprietary image tagging and recognition solutions are built to solve specific challenges for automotive marketers, like identifying specific car makes, models and even sub variants in specific surroundings or scenes so that they can be surfaced for use in a campaign. That typically means they can’t be solved with off the shelf software from major tech providers, so each customer problem we face requires a unique balance of metrics to help guide our solution development. 

To hear more about our AI solutions that are already making content marketing easier and more effective, get in touch with our team today!