Have you ever wondered why data scientists discuss the lack of data in ai marketing when they seemingly have endless assets to work with? Here’s an explanation to help solve the mystery.

Blog, Our Thinking

Why data is still the biggest obstacle to AI marketing, and how StoryStream is tackling it

ai marketing cover image

AI marketing – an introduction

Ten years ago, Google’s chief economist Hal Varian said in a New York Times article: “I keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” This quote came from the fact that the capacity for storing data was increasing rapidly and every company in the world from major tech players to SMBs were putting in place strategies to gather as much data as possible. Ultimately,  of course, someone would need to analyse it.

In the last few years, the generated amount of data has exponentially increased, peaking with 2.5 quintillion bytes of data created each day according to internet service companies like Blazon. This increase in the amount of data has dramatically changed the type of algorithms that are used for analysis. Initially, algorithms used for data modelling needed to work under a lot of assumptions, to work around low data volumes, often making results inaccurate.

Nowadays, the approach is to feed masses of data to AI models in the hope that a model learning from lots of high quality data (images for example) will produce high quality results, or accurate results to be more precise. 

It’s here that concepts such as neural networks (NN) have become so attractive as a pursuit of many tech companies alongside the idea of deep learning, an ever trending topic today. In simple terms a neural network is a computing system that learns to perform a task based on examples provided to it – for example, learning how to identify a specific model of car based on previously seeing thousands of images of that car, it’s a simplistic function of a biological brain.

In the case of tagging and organising imagery, deep learning is a very attractive approach as it allows the system to decompose pictures in a hierarchical and reliable format, i.e first the network learns to identify and analyse the edges of objects like a car bonnet, then circular elements like headlights and then complex visual patterns such as air intakes, and eventually simplified versions of the object of interest can be wholly identified within images e.g a BMW 3 series. 

Once learned, a well trained network can even create entirely new images seamlessly with many incredible examples of this already working today. 

So, back to our main point, data is rapidly increasing, and deep learning models are efficient at learning from high volumes of data, but a big issue still remains with getting high volumes of good quality data – the right data.

Choosing the correct Data 


The issue in giving a neural network flexibility to learn purely from the data it’s provided is that a resulting model will adapt to noise in that data, or in other words it can find patterns that are not correct.

This is particularly important for automotive marketing. Imagine that the dataset you’re going to feed to a neural network is one where all Porsches happen to be red and all Fords are blue, this might make the network believe that red cars are always Porsches and blue ones are always Fords. 

So, it’s critical to make sure the collected dataset properly represents a realistic spectrum and mix of imagery that will force the network to learn to differentiate images in the right way. Continuing with the previous example, to differentiate between Ford and Porsche, the model should have samples of cars with different colors, different backgrounds and of different models from each brand.

In the case of real world images used for ai marketing, there is an extra difficulty in the fact that objects are 3D.  In some cases, such as images of isometric objects like cans, fruits, trees etc… this isn’t a big problem since images of those objects at different angles look alike. 

For more varied objects however, and in particular automotive images, this is not the case which makes the data gathering process far more challenging. 

To add further complexity to the mix, it is critical to make sure that the region of the automobile that makes it differentiable is visible in the image being analysed, a difficult task when analysing sub models and variations.

For example, when analysing the new Porsche 911 Carrera and Turbo models, the visual difference is mainly on the side vents. So, it cannot be expected that an AI model could learn to differentiate these two car models using front car views (figures (c) and (d) in the second image below), but rather it will need side or diagonal views (figures (a) and (b)).

ai marketing blog post

ai marketing solutions blog post

This means that the data gathering process for automotive AI marketing solutions still requires a cleaning process to eliminate those unsuitable images which are going to confuse the model – like the images shown in (c) and (d). 

At StoryStream, we’ve solved this problem by building a semi-automated pipeline for data management that allows our data engineers to quickly find the right kind of high quality automotive datasets. We have many feeds from our customers that are constantly bringing in new images of varying quality and usefulness.  The next step in the pipeline is to eliminate those images that do not meet the minimum quality. In addition, images are organised and displayed so that our engineers spend the minimum amount of time manually checking them and verifying the correct labels.

However, even with those proprietary systems, the need to train AI models using limited datasets is still essential in automotive marketing, given the many possible angles, colors, model variants, new releases etc..  and the little available data that is readily usable.

That’s where our AI research team are constantly innovating in new ways to fill our data gaps with a blend of real life imagery, processing systems and the creation of synthetic data,  making our AI models as robust as we need them to be. By using what’s known as GAN models, we can do things like quickly create new variations of training content such as recoloring a car featured in an image or editing background elements as we need. At times we can even make use of computer aided design (CAD) programs to generate completely new images to exact specifications to fill a gap we have in real world content. The challenges of course with these activities is that costs for high quality CAD programs can be significant and complex, so frequently it’s our ability to think creatively to work around a particular problem that determines success or failure in data gathering.


Although the quantity and quality of data available to researchers has improved in recent years, it is still a significant bottleneck for many use cases in ai marketing, especially in the automotive industry. To build more effective, real world AI applications that actually help marketers in their daily work, sample variety and a proper pipeline for gathering data are ultimately what will separate leading solutions from mere afterthought features.