MediaMath Blog - Intelligence

main

Education Intelligence Media

Navigating the Brave New World

April 4, 2017 — by MediaMath

As a part of a module I teach at the University of West London, Emerging Technology, Other Realities, I have students read a work of science fiction in addition to practical training about programmatic advertising and augmented/virtual reality campaigns. How do you prepare students to work in a world that often resembles classics in science fiction? How do experienced industry professionals manage to stay ahead of the game in which the “present” is often already too late? There’s one rather old-fashioned answer: training.

The digital skills gap is an equal opportunity employer

In the Government’s “Digital Skills for the UK Economy,” approximately 1 in 5 job vacancies relate to the digital skills gap. This is even more pronounced in creative industries, like advertising. Experienced industry professionals often fall hard into the digital skills gap; yesterday’s media planner is today’s statistician. Not that there’s anything wrong with statisticians, of course. However, as companies embrace the technological transformation of advertising, they fail to create a world that takes experienced staff along for the journey.

This comes at a business cost in addition to the obvious human one. Many companies making such technology shifts lose out on the deep knowledge that comes with experience. For example, I have worked with many media buyers who came up through radio and TV who would have been perfectly suited to more tech-centric media trading. In fact, they might have been better than the average entry-level employee, as they had years of experience negotiating and evaluating deals—something that’s hard to teach other than on the job. Those colleagues didn’t get the chance, however. No one trained them. The businesses in question lost out on all that knowledge, the people in question left wondering why they ever cared about their careers in the first place.

One might think that overlooking experience would lead employers to be overly satisfied with recent graduate hires, given that everyone always talks about tech-savvy Millennials. However, that is not the case. Both as an employer or an advisor to employers, I have found communications graduates woefully unprepared for the automated world we’re creating. While Millennials and Gen Z are adept at using their phones to use Snapchat filters, very few can tell you how Snapchat/Facebook/Instagram/display advertising might work to target them or be tailored to them. As Accenture’s Mohini Rao wrote in The Digital Skills Gap in the UK, “This generation are consumers of digital technology, not creators.”

Creating opportunities to learn about machines vs. machine learning

Of course, we, as a society (with our industry strongly implicated in this), have created these consumers who lack context. But because we created that environment, we can also create a new one with opportunities to code, to use industry tools, and to understand how the machines in our lives work. (Note: This should probably start in pre-school, but since I teach university, I’ll focus on Uni students.)

This is where courses like our new Advertising & Public Relations course come in. At UWL, we are committed to ensuring that our students understand the basics of automation so that they can succeed in their careers. Even creatives need to know that programmatic ads are often certain standard sizes, much like we used to teach the dimensions of standard print. Admittedly, we have an uphill battle here. When you start diving into these topics with university students, responses like “WTF is programmatic advertising?” or “You can see all that data about me? Is that legal?” or “But making skyscraper banners is boring!” are common.

To help overcome such obstacles and to create a more positive environment for the next generation of professionals, our course is working with visionary organisations like the New Marketing Institute (NMI). A few weeks ago, the excellent instructor from NMI came in and managed to transform boredom and cynicism into excitement. Several students afterward came to me to express interest in knowing more about things like programmatic and in exploring ad tech as a career path. That’s both a testimonial to the quality of NMI’s instruction and also to the fact that, once this aspect of advertising is explained clearly, it doesn’t have to be a boring diet of acronym soup.

Would that more organisations like NMI visited universities. Would that more students were exposed to the advanced aspects of our practice. Would that my former colleagues had the chance to change with our industry, rather than outside of it. So let’s change all that, shall we?

Intelligence

Why Optimization in Marketing Matters More Than Ever

March 14, 2017 — by MediaMath

Read any marketing trade publication and “optimization” is a term that regularly pops up in the context of digital ad campaigns. But what does it mean to really “optimize” a campaign—and why should you consider it as part of your long-term marketing strategy?

Why you should use optimization

Any time you launch a campaign, you have a goal in mind. If you can measure this goal, you can optimize to it. You just need to understand which dials to turn—which users to target, which sites to target them on, what time of day is most effective and more.

As more and more data becomes available, it is quickly becoming impossible for one person to manage this process by hand. There are multiple formats, individual images and messages to consider. What team—let alone individual—could look at every impression multiple times a day and score it? You need machine learning—the science and practice of designing computer algorithms that find patterns in large volumes of data—to manage the scale of the opportunity. A machine learning algorithm not only improves how well opportunities are assessed and continually improves upon itself over time; it also saves time and efficiency for employees and companies, freeing up traders to do less manual and more strategic work.

How to overcome the challenges of optimization

As the first demand-side platform, we’ve seen hundreds of advertisers, thousands of campaigns and trillions of impressions over the last decade. During this time, we’ve often encountered certain scenarios that pose a challenge to traditional optimization:

Short-lived campaigns: You have a campaign that won’t be active long enough to generate learnings. For example, a retailer may want to run a short promotional campaign to highlight an upcoming sales event.
- Solution: Seed the new campaign with data from previous campaigns the advertiser has run. There’s no need to start from scratch every time!
Rare events: You care about a certain event but it doesn’t happen frequently enough. For example, a cruise operator wants to maximize new bookings through their website, but cruises are expensive and so there are relatively few merit events from which to learn.
- Solution: Move up the sales funnel and choose a proxy for the event you care about. Users may not book a new cruise very often, but a repeat visit to the itinerary page may be a reliable signal of intent. You can also optimize towards audience members that look like the users you want to reach. Look-alike models like this are even more powerful when they utilize second-party data, which augments an advertiser’s first-party data with behavioral data from non-competitive advertisers to build a richer profile of each user.

How to be smart about optimization

Every advertiser faces a fundamental choice when setting up a new campaign: Is it more important to spend the budget in full or to stay within a certain performance threshold? Most advertisers probably aspire to the second goal, but it can be difficult to have high conviction in performance KPIs that rely on last-touch conversion credit. In other words, what does it really mean if a user who clicked an ad or saw an ad went on to buy something? Would that purchase have happened anyway?

Answering questions like these and understanding the relationship between last-touch conversion credit and real, causal marketing outcomes requires marketers to think about incrementality—something we will cover in our next blog post.

Data Intelligence Media Trends

MediaMath Recognized in the Latest Gartner Magic Quadrant for Digital Marketing Hubs

February 16, 2017 — by MediaMath

We’ve done it again.

MediaMath is being recognized in the latest Gartner Magic Quadrant for Digital Marketing Hubs for our scalable omnichannel platform that enables marketers to seamlessly manage and activate their audience data in media across channels, formats and devices and optimize campaigns in real-time.

The Gartner Magic Quadrant provides both quantitative and qualitative analysis on technology and service markets. It highlights the trajectory of key companies who are assessed on their ability to execute and for completeness of vision in a given category. You can access the full report here.

We are the leading DSP in the “Challengers” quadrant, and believe our strengths include:

Programmatic marketing at scale: We provide all of the tools a marketer or agency needs to execute programmatic marketing at scale. These include audience management powered by our fully-featured, next-generation DMP, integrated with our omnichannel DSP; a depth and breadth of privileged media access – across supply sources and markets – that’s unparalleled in the industry; and machine learning to help marketers optimize their campaigns to reach their audiences most efficiently and effectively, and to understand how each marketing touchpoint leads to incremental lift in revenue.
Integrated data management: Our integrated DMP connects data management directly with media execution and decision solutions to enable holistic audience management and omnichannel audience buying at scale. This DSP + DMP approach enables marketers and agencies to manage their programmatic marketing more seamlessly through activation of data right within our buying platform. Removing the need for an external DMP means there’s no data loss or latency, and audience profile data can be linked with media behavior for smarter, better informed planning, buying and optimization.
Real-time decisioning and optimization: Our platform leverages the centralized intelligence of our Brain algorithm to allocate budget intelligently across channels. Powered by advanced machine learning and oriented towards true incremental business performance, The Brain brings to bear increasingly intelligent optimizations and, therefore, business outcomes that are driven and measured by math, across all addressable channels, transparently.

Our clients are sophisticated marketers who are as obsessed as we are with data and technology as the fuel for next generation marketing and business performance. This year, we will continue to build and innovate on technology products and solutions that help them execute relevant 1:1, real-time marketing at scale that delivers against their most important business objectives.

Disclaimer

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Intelligence Trends

Machine Learning: The Factors to Consider for Marketing Success

October 4, 2016 — by Ari Buchalter

This byline originally appeared in MarketingTech.

Adoption of machine learning is moving up a gear. From diagnosing diseases, to driving cars, to stopping crime, the ability of machines to learn from data and apply those learnings to solve real problems is being leveraged all around us at an accelerating pace.

As data volumes continue to grow, along with advances in computational science, machine learning is poised to become the next great technological revolution.

Digital marketing represents one of the most exciting arenas where machine learning is being applied. Across websites, mobile apps, and other digital media, there are hundreds of billions of opportunities for advertisers to deliver ads to consumers every single day.

Typically, those opportunities each come with a wealth of data attached and are made available in a real-time auction where potential buyers (i.e., advertisers or the agencies working on their behalf) have mere milliseconds to analyse the opportunity and respond with a bid to serve their ad to a particular consumer on a particular device at the perfect time.

The speeds and scales involved dwarf any other real-time transactional medium from financial exchanges to credit cards.

A lot of brainpower has therefore gone into defining what machine learning algorithms do when it comes to digital marketing. In plain English, it consists of things like identifying the most attractive consumers for a given brand or product, determining how much to bid on opportunities to show ads to those consumers in different contexts and on different devices, personalising the ads delivered to ensure relevance, and accurately measuring the impact of those ads on bottom-line sales.

As the pace of innovation accelerates, it’s important for brands to define guiding principles that ensure this amazing technology yields maximum impact

Understanding how to best leverage the vast amounts of data about digital audiences and the media they consume can be the difference between success and failure for the world’s largest brands.

That is why smart marketers are investing heavily in Data Science and machine learning to drive competitive advantage and are increasingly seeking out partners with this expertise.

With so much hanging in the balance, it’s instructive to consider how marketers should approach machine learning.

There are several important factors marketers should bear in mind when implementing machine learning technology to help ensure success:

Start with business goals

When adopting machine learning technology, marketers should begin at the end. Define the specific, measurable business outcomes you want to achieve and gear the machine learning around that.

Avoid ‘shallow’ objectives like page visits or clicks and use deep metrics like incremental sales or ROI. The deeper the metric, the better the results.

Don’t be fooled into the traditional approach of buying media based on indexed demographics or other coarse proxies. The beauty of digital marketing is that it can be optimised to measure success using the same metrics that matter in the board room, achieving the best possible results.

Select a robust platform

When it comes to machine learning, there’s a huge difference between theory and practice. Solutions that work in a small-scale testing environment may fail spectacularly in large-scale production.

It’s critical the machine learning runs on a platform proven to handle the required scales and speeds with the performance, reliability, and security demanded by enterprise-class marketers.

Furthermore, look for platforms that are flexible enough to enable easy customisation, both of the data and the models, to meet the unique needs of your business.

But a word to the wise: platforms cannot serve two masters. Some marketing platforms aimed at advertisers are actually operated by companies who make their money selling media. That conflict of interest should make you think twice.

Read the rest of the article here.

Intelligence Trends

Machine Learning Demystified, Part 3: Models

September 15, 2016 — by MediaMath

In Part 2 of our series, the ML novice realized that generalization is a key ingredient of a true ML algorithm. In today’s post we continue the conversation (the novice is in bold italics) to elaborate on generalization and take a first peek at how one might design an algorithm that generalizes well.

Last time we started talking about how one would design an ML algorithm to predict the click-through-rate (CTR) on an ad impression, and you designed two extreme types of algorithms. The first one was the constant algorithm, which simply records the overall historical click-rate a, and predicts a for every new ad opportunity.

This was fine for overall accuracy but would do poorly on individual ad impressions, because its predictions are not differentiated based on the attributes of an impression, and CTRs could be very different for different types of impressions.

Exactly, and the other extreme was the memorization algorithm: it records the historical CTR for every combination of feature-values, and for a new ad impression with feature-value-combination x (e.g., browser = “Safari”, gender = “male”, age = 32, city = “NYC”, ISP = “Verizon”, dayOfWeek = “Sunday”), it looks up the historical CTR a(x) (if any) for x, and outputs a(x) as its prediction.

And we saw that this algorithm would be highly impractical and it would not generalize well to new ad impressions.

Yes, it would not generalize well for two reasons: (a) the specific feature-value-combination x may not have been seen before, and (b) even if x occurred in the training data, it may have occurred too infrequently to reliably estimate CTR for cases resembling x.

So how would we design an algorithm that generalizes well?

Let’s do a thought experiment: what do you do when you generalize in daily situations?

I start from specific facts or observations and formulate a general concept.

How, exactly, do you do that?

I abstract common aspects of the specific observations, so that I can make a broader, more universal statement. I build a mental model that’s consistent with my observations.

In terms of detail, would you say that your mental model is relatively simple or complex?

Simple, definitely. I abstract away the non-essential details.

Great, you’ve just identified some of the characteristics of a good generalization: a relatively simple, abstract, less detailed, model that is consistent with (or fits, or explains) the observations , and is more broadly applicable beyond the cases you have seen.

That’s how humans think, but how does this help us design a computer algorithm that generalizes well?

This helps us in a couple of ways. Firstly, it gives us a framework for designing an ML algorithm. Just like humans build mental models based on their observations, an ML algorithm should ingest training data and output a model that fits, or explains the training data well.

What does a “model” mean specifically?

A model is the mathematical analog to the human idea of a “concept” or “mental model”; it’s the mathematical formalization of what we have been informally calling “rules” until now. A model is essentially a function takes as input the characteristics (i.e. features) of a case (i.e. example), and outputs a classification (e.g., “cat” or “not cat”) or score (e.g. click-probability).

Wow, so a model is essentially a program, and you’re saying that an ML program is producing another program?

Sure, that’s a good way to look at it, if you think of a function as a program that takes an input and produces an output.

We seem to be back in voodoo-land: a program that ingests training data, and spits out a program…

Well the ML algorithm does not magically output code for a model; instead the ML algorithm designer usually restricts the models to a certain class of models M that have a common structure or form, and the models in the class only differ by certain parameters p. Think of the class of models M as a generic template, and a specific model from the class is selected by specifying the values of the parameters p. Once restricted to a certain class of models, the ML algorithm only needs to find the values of the parameters p such that the specific model with these parameter-values fits the training data well.

For example suppose we’re trying to predict the sale price y of a home in a certain locality based only on the square footage x. If we believe home prices are roughly proportional to their square footage, a simple class of models for this might be the class of linear models, i.e., the home price y is modeled as a*x. The reason these models are called linear is that, for a fixed value of a, a plot of x versus y would be a straight line.

Notice that a*x is a generic template for a linear model, and a value for a needs to be specified to get a specific linear model, which would then be used to predict the sale price for a given square-footage x. The ML algorithm would be trained on examples of pairs (x,y) of square-footage and home-price values, and it needs to find the value of the parameter a such that the model a*x “best fits” the training data. I hope this does not sound so mysterious any more?

It certainly helps! Instead of magically spitting out code for some model, an ML algorithm is actually “merely” finding parameters p for a specific class of models.

Right…

But how do we know a model is “correct” or is the “true” one?

Remember that a model expresses a relationship between features and response (e.g. classification or score). All non-trivial real-world phenomena are governed by some “true” underlying model (think of this as the signal, or the essence, of the relationship between the variables involved) combined with random noise. The noise could result from errors in measurement or inherent random variation in the relationship. However an ML algorithm does not need to discover this “true” underlying model in order to be able to generalize well; it suffices to find a model that fits the training data adequately. As the statistician George Box famously said,

All models are wrong, but some are useful,

meaning that in any non-trivial domain, every model is at best an approximation when examined closely enough, but some models can be more useful than others.

What kinds of models would be more useful?

In the ML context, a useful model is one that generalizes well beyond the training data. This brings us to the second way in which thinking about how humans generalize helps, which is that it suggests a guiding principle in designing a true learning algorithm that has good generalization abilities. This principle is called Occam’s Razor:

Find the simplest model that explains the training data well. Such a model would be more likely to generalize well beyond the training examples.

And what is a “simple” model?

In a given context or application-domain, a model is more complex if its structure or form is more complex and/or it has more parameters (the two are often related). For example when you’re predicting CTR, and you have thousands of features you could be looking at, but you’re able to explain the training data “adequately” using just 3 of them, then the Occam’s Razor principle says you should just use these 3 to formulate your CTR-prediction model. Or when predicting house prices as a function of square-footage, if a linear model fits the training data adequately, you should use this model (rather than a more complex model such as quadratic, etc) as it would generalize better to examples outside the training set. I’ve used italics when talking about fitting/explaining the training data because that’s something I haven’t explained yet.

I see, so the constant algorithm I designed to predict CTR produces an extremely simple model that does not even fit the training data well, so there’s no hope it would generalize beyond the training data-set?

Exactly, and on the other hand the memorization algorithm produces an extremely complex model: it records the historical CTR for each feature-value combination in the training data. When tested on any example (i.e. feature-value combination) that already occurred in the training data, this model would produce exactly the historical CTR observed for that example. In this sense this complex model fits the training data perfectly. When an overly complex model fits training data exceptionally well, we say the model is overfitting the training data. Such models would not generalize well beyond the training data-set.

Interesting, so we want to find a relatively simple model that fits the training data “adequately”, but we don’t want to overdo it, i.e. we shouldn’t overly complicate our model in order to improve the fit to the training data. Is it obvious why complex models in general wouldn’t generalize well?

It’s actually not obvious, but here’s an intuitive explanation. Remember we said that any real-world phenomenon is governed by some true model that represents the signal or essence of the phenomenon, combined with random noise. A complex model-class has so many degrees of freedom (either in its structure or number of parameters) that when a best-fitting model is found from this class, it overfits the training data: it fits the signal as well as the random noise. The random noise gets baked into the model and this makes the model perform poorly when applied to cases not in the training set.

Can you give me an example where a simpler model generalizes better than a complex one?

Let’s look at the problem of learning a model to predict home sale prices in a certain locality based on the square-footage. You’re given a training data-set of pairs (x,y) where x is the square-footage in thousands, and y is the sale price in millions of dollars:

(2, 4.2), (3, 5.8), (4, 8.2), (5, 9.8), (6, 12.2), (15, 27.8),

and your task is to find a model that “best” fits the data (you decide how to interpret “best”). Any ideas?

I notice from the training data that when the square-footage x is an even number, the home price is (2*x + 0.2), and when it’s an odd number the home price is (2*x — 0.2).

Great, your model fits the training data perfectly, but in order to do so you’ve made your model complex because you have two different expressions based on whether the square footage is even or odd. Can you think of a simpler model?

I see what you mean — I notice that the home prices (in millions of dollars) in the training data-set are fairly close to twice the number of thousands of square feet, so my simpler model would be 2*x.

This would be a simpler model, even though it does not exactly fit the training data: the home prices are not exactly 2*x but close to it. We can think of the home price as being 2*x plus some random noise (which could be positive or negative). It just so happens that in the specific training examples you’ve seen, the noise part in the even-square-footage examples is 0.2, and -0.2 for the others. However it could well be that for a new example with x=7 (an odd number) the price is exactly 2*x or slightly above 2*x, but if we were to use your previous complex model, we would be predicting 2*x — 0.2, which would not be a good prediction.

The point is that in an effort to fit the training data very well (in your case perfectly), you made your model overly complex and latched on to a spurious pattern that is unlikely to hold beyond the training data-set. By definition the noise part is random and un-predictable, so any attempt to model it and predict would result in errors. Instead, the simpler model captures only the “signal” and is oblivious to the random noise.

Here, then, is the recipe for designing a true ML algorithm that generalizes well:

Pick a relatively simple class of models M, parametrized by a set of parameters p
Find values of the parameters p such that the model from M with these parameters fits the training data optimally. Denote this model as M[p]. M[p] is a function that maps a feature-value combination x to a response (e.g. classification or score).
Output M[p] as the predictive model for the task, i.e. for an example x, respond with M[p](x) as the predicted classification or score.

What does it mean for a model to fit or explain the training data, and how do I find the parameters that “optimally” fit the data?

Informally, a model fits the training data well when it is consistent with the training data. This is analogous to how you form a mental model based on your observations: your mental model is consistent with your observations (at least if you’re observant enough and think rationally). Of course to implement a computer algorithm we need to make precise the notion of an “optimal fit”, and specify how the computer should go about finding model parameters that “best” fit the training data. These are great topics for a future post!

Analytics Data DIGITAL MARKETING Intelligence Media Technology Uncategorized

Machine Learning Without Tears, Part Two: Generalization

August 22, 2016 — by MediaMath

In the first post of our non-technical ML intro series we discussed some general characteristics of ML tasks. In this post we take a first baby step towards understanding how learning algorithms work. We’ll continue the dialog between an ML expert and an ML-curious person.

Ok I see that an ML program can improve its performance at some task after being trained on a sufficiently large amount of data, without explicit instructions given by a human. This sounds like magic! How does it work?

Let’s start with an extremely simple example. Say you’re running an ad campaign for a certain type of running shoe on the NYTimes web-site. Every time a user visits the web-site, an ad-serving opportunity arises, and given the features of the ad-opportunity (such as time, user-demographics, location, browser-type, etc) you want to be able to predict the chance of the user clicking on the ad. You have access to training examples: the last 3 weeks of historical logs of features of ads served, and whether or not there was a click. Can you think of a way to write code to predict the click-rate using this training data?

Let me see, I would write a program that looks at the trailing 3 weeks of historical logs, and if N is the total of ad exposures, and k is the number of those that resulted in clicks, then for any ad-opportunity it would predict a click probability of k/N.

Great, and this would be an ML program! The program ingests historical data, and given any ad-serving opportunity, it outputs a click probability. If the historical (over the trailing 3-weeks) fraction of clicked ads changes over time, your program would change its prediction as well, so it’s adapting to changes in the data.

Wow, that’s it? What’s all the fuss about Machine Learning then?

Well this would be a very rudimentary learning algorithm at best: it would be accurate in aggregate over the whole population of ad-exposures. What if you want to improve the accuracy of your predictions for individual ad-opportunities?

Why would I want to do that?

Well if your goal is to show ads that are likely to elicit clicks, and you want to figure out how much you want to pay for showing an ad, the most important thing to predict is the click probability (or CTR, the click-through-rate) for each specific ad opportunity: you’ll want to pay more for higher CTR opportunities, and less for lower CTR opps.

Say you’re running your ad campaign in two cities: San Francisco and Minneapolis, with an equal number of exposures in each city. Suppose you found that overall, 3% of your ads result in clicks, and this is what you predict as the click-probability for any ad opportunity. However when you look more closely at the historical data, you realize that all ad-opportunities are not the same: You notice an interesting pattern, i.e. 5% of the ads shown to users in San Francisco are clicked, compared to only 1% of ads shown to users logging in from Minneapolis. Since there are an equal number of ads shown in the two cities, you’re observing an average click-rate of 3% overall, and …

Oh ok, I know how to fix my program! I will put in a simple rule: if the ad opportunity is from San Francisco, predict 5%, and if it’s from Minneapolis, predict 1%. Sorry to interrupt you, I got excited…

That’s ok… in fact you walked right into a trap I set up for you: you gave a perfect example of an ad-hoc static rule: You’re hard-coding an instruction in your program that leverages a specific pattern you found by manually slicing your data, so this would not be an ML program at all!

So… what’s so bad about such a program?

Several things: (a) this is just one pattern among many possible patterns that could exist in the data, and you just happened to find this one; (b) you discovered this pattern by manually slicing the data, which requires a lot of time, effort and cost; (c) the patterns can change over time, so a hard-coded rule may cease to be accurate at some point. On the other hand, a learning algorithm can find many relevant patterns, automatically, and can adapt over time.

I thought I understood how a learning algorithm works, now I’m back to square one!

You’re pretty close though. Instead of hard-coding a rule based on a specific pattern that you find manually, you write code to slice historical data by all features. Suppose there were just 2 features: city (the name of the city) and IsWeekend (1 if the opportunity is on a weekend, 0 otherwise). Do you see a way to improve your program so that it’s more general and avoids hard-coding a specific rule?

Yes! I can write code to go through all combinations of values of these features in the historical data, and build a lookup table showing for each (city, IsWeekend) pair, what the historical click-through-rate was. Then when the program encounters a new ad-opportunity, it will know which city it’s from, and whether or not it’s a weekend, and so it can lookup the corresponding historical rate in the table, and output that as its prediction.

Great, yes you could do that, but there are a few problems with this solution. What if there were 30 different features? Even if each feature has only 2 possible values, that is already 2^30 possible combinations of values, or more than a billion (and of course, the number of possible values of many of the features, such as cities, web-sites, etc could be a lot more than just two). It would be very time-consuming to group the historical data by these billions of combinations, our look-up table would be huge, and so it would be very slow to even make a prediction. The other problem is this: what happens when an ad opportunity arises from a new city that the campaign had no prior data for? Even if we set aside these two issues, your algorithm’s click-rate predictions would in fact most likely not be very accurate at all.

Why would it not work well?

Your algorithm has essentially memorized the click-rates for all possible feature-combinations in the training data, so it would perform excellently if its performance is evaluated on the training data: the predicted click-rates would exactly match the historical rates. But predicting on new ad opportunities is a different matter; since there are 30 features, each with a multitude of possible values, it is highly likely that these new opportunities will have feature-combinations that were never seen before.

A more subtle point is that even if a feature-combination has occurred before, simply predicting the historical click-rate for that combination might be completely wrong: for example suppose there were just 3 ad-opportunities in the training data which had this feature-combination: (Browser = “safari”, IsWeekend = 1, Gender = “Male”, Age = 32, City = “San Francisco”, ISP = “Verizon”), and the ad was not clicked in all 3 cases. Now if your algorithm encounters a new opportunity with this exact feature-combination, it would predict a 0% click-rate. This would be accurate with respect to the historical data your algorithm was trained on, but if we were to test it on a realistic distribution of ad opportunities, the prediction would almost certainly not be accurate.

What went wrong here? Suppose the true click-rate for ads with the above feature-combination is 1%, then in a historical sample where just 3 such ad-opportunities are seen, it’s statistically very likely that we would see no clicks.

But what could the learning algorithm do to avoid this problem? Surely it cannot do any better given the data it has seen?

Actually it can. By examining the training data, it should be able to realize, for example, that the ISP and Browser features are not relevant to predicting clicks (for this specific campaign), and perhaps it finds that there are a 1000 training examples (i.e. ad-opportunity feature-combinations) that match the above example when ISP and Browser are ignored, and 12 of them had clicks, so it would predict a 1.2% click-rate.

So your algorithm, by memorizing the click-rates from the training data at a very low level of granularity, was “getting lost in the weeds” and was failing to generalize to new data. The ability to generalize is crucial to any useful ML algorithm, and indeed is a hallmark of intelligence, human or otherwise. For example think about how you learn to recognize cats: you don’t memorize how each cat looks and try to determine whether a new animal you encounter is a cat or not by matching it with your memory of a previously-seen cat. Instead, you learn the concept of a “cat”, and are able to generalize your ability to recognize cats beyond those that exactly match the ones you’ve seen.

In the next post we will delve into some ways to design true learning algorithms that generalize well.

Ok, looking forward to that. Today I learned that generalization is fundamental to machine-learning. And I will memorize that!

Analytics Data Intelligence Technology Uncategorized

Machine Learning: A Guide for the Perplexed, Part One

July 21, 2016 — by MediaMath

With the increasingly vast volumes of data generated by enterprises, relying on static rule-based decision systems is no longer competitive; instead, there is an unprecedented opportunity to optimize decisions, and adapt to changing conditions, by leveraging patterns in real-time and historical data.

The very size of the data however makes it impossible for humans to find these patterns, and this has lead to an explosion of industry interest in the field of Machine Learning, which is the science and practice of designing computer algorithms that, broadly speaking, find patterns in large volumes of data. ML is particularly important in digital marketing: understanding how to leverage vast amounts of data about digital audiences and the media they consume can be the difference between success and failure for the world’s largest brands. MediaMath’s vision is for every addressable interaction between a marketer and a consumer to be driven by ML optimization against all available, relevant data at that moment, to maximize long-term marketer business outcomes.

In this series of blog posts we will present a very basic, non-technical introduction to Machine Learning. In today’s post we start with a definition of ML in the form of a dialog between you and an ML expert. When we say “you”, we have in mind someone who is not an ML expert or practitioner, but someone who has heard about Machine Learning and is curious to know more.

Can we start at the beginning? What is Machine Learning?

Machine learning is the process by which a computer program improves its performance at a certain task with experience, without being given explicit instructions or rules on what to do.

I see, so you’re saying the program is “learning” to improve its performance.

Yes, and this is why ML is a branch of Artificial Intelligence, since learning is one of the fundamental aspects of intelligence.

When you say “with experience,” what do you mean?

As the program gains “practice” with the task, it gets better over time, much like how we humans learn to get better at tasks with experience. For example an ML program can learn to recognize pictures of cats when shown a sufficiently large number of examples of pictures of “cat” and “not cat”. Or an autonomous driving system learns to navigate roads after being trained by a human on a variety of types of roads. Or a Real-Time-Bidding system can learn to predict users’ propensity to convert (i.e. make a purchase) when exposed to an ad, after observing a large number of historical examples of situations (i.e. combinations of user, contextual, geo, time, site attributes) where users converted or not.

You said “without being given explicit instructions.” Can you expand on that a bit?

Yes that is a very important distinction between an ML program and a program with human-coded rules. As you can see from the above examples, an ML system in general needs to respond to a huge variety of possible situations: e.g., respond “cat” when shown a picture of a cat, or turn the steering wheel in the right direction in respond to the visual input of the road, or compute a probability of conversion when given a combination of features of an ad impression. The sheer variety of number of possible input pictures, or road-conditions, or impression-features is enormous. If we did not have an ML algorithm for these tasks we would need to anticipate all possible inputs and program explicit rules that we hope will be appropriate responses to those inputs.

I still don’t understand why it’s hard to write explicit rules for these tasks. Humans are very good at recognizing cats, so why can’t humans write the rules to recognize a cat?

That’s a great question. It’s true that humans excel at learning certain tasks, for example recognizing cats, or recognizing handwriting, or driving a car. But here’s the paradoxical thing — while we are great at these tasks, the process by which we accomplish these tasks cannot be boiled down to a set of rules, even if we’re allowed to write a huge number of rules. So these are examples of tasks where explicit rules are impossible to write.

On the other hand there are tasks at which humans are not even good at: for example trying to predict which types of users in what contexts will convert when exposed to ads. Marketing folks might have intuition about what conditions lead to more conversions, such as “users visiting my site on Sundays when it’s raining are 10% likely to buy my product”. The problem though is that these intuition-guided rules can be wrong, and incomplete (i.e. do not cover all possible scenarios). The only way to come up with the right rules is to pore through millions of examples of users converting or not, and extract patterns from these, which is precisely what an ML system can do. Such pattern extraction is beyond the capabilities of humans, even though they are great at certain other types of pattern extraction (such as visual or auditory).

I see, so ML is useful in tasks where (a) a response is needed on a huge number of possible inputs, and (b) it’s impossible or impractical to hard-code rules that would perform reasonably well on most inputs. Are there examples where the number of possible inputs is huge, but it’s easy to write hard-coded rules?

Sure: I’ll give you a number, can you tell if it’s even or odd? Now you wouldn’t need an ML program for that!

In a future post we will discuss at a conceptual level how ML algorithms actually work.

Analytics Events Intelligence Trends Uncategorized

Make This Your Best Back-to-School Season

July 20, 2016 — by MediaMath

The back-to-school season is one of the biggest retail events in the US—in fact, the $68 billion industry comes second only to the winter holidays in terms of spend. Back-to-school shoppers in 2015 planned to spend an average of $630, with most of the spend going toward apparel and electronics, according to data from the National Retail Federation’s annual back-to-school survey. To help marketers capitlize on this popular shopping period, MediaMath analyzed 110 campaigns from previous back-to-school campaigns to see what trends and performance results stand out. Some highlights of our short guide include:

30% of conversions happen in two weeks in August
Consumer goods and clothing and accessories make up 65% of all campaigns
Back-to-school and mom segments vastly outperform college segments

To download the full ebook, click here.

Analytics Intelligence Technology Uncategorized

The Other Half of the Battle Against Fraud

June 11, 2015 — by Ari Buchalter

It’s been well-established that fraud, and in particular non-human traffic, is a problem in the digital advertising industry, but I’d like to spend a few moments exploring why it is such a problem. No, I’m not asking why there are unscrupulous people out there looking to hack the system to make a dishonest buck (that part I recognize from every other commercial endeavor ever undertaken). And no, I’m not asking about the industry norms and perverse incentives that can motivate publishers, intermediaries, and yes, even agencies and advertisers, to turn a blind eye to the problem. I’m asking why our marketing programs are so easily fooled by bots in the first place.

There’s no doubt the fraudsters are getting more sophisticated. While long-standing tactics like click fraud are still sadly alive and thriving, they have been joined by numerous other insidious new breeds of fraud. From visiting advertiser sites to attract retargeting dollars, to intentionally adhering to MRC-defined viewability criteria, the bots are getting better at blending in and looking like everyone else. No channel is unaffected and no publisher, no matter how premium or niche, is immune.

This state of affairs has led to a reactive mentality in our industry where the goal is to “avoid fraud,” which is completely rational and understandable. When you are under attack, you defend yourself. It’s why many major publishers, ad exchanges, and SSPs have implemented rigorous quality measures to filter fraud and other forms of undesirable traffic at the source. It’s why the leading DSPs have developed sophisticated algorithms to identify anomalous patterns at the user, site, IP address, and other levels, and quarantine fraud away from live buying environments before marketing budgets are exposed to it. And it’s why a wave of old and new verification & measurement vendors are offering an array of new fraud-related products. All with the determination to stay a step ahead of the increasing scale and growing variety of digital advertising fraud.

But is that it? As an industry, are we just to spiral forward in a never-ending arms race, trying to build new techniques to keep up with the ever-evolving new strains of fraud, playing a high-stakes game of whack-a-mole with hundreds of billions of dollars on the line?

Thankfully, the story doesn’t have to end there.

The reason it’s so easy for bots to mimic people is because the marketing definition of people is often so simplistic. Despite all the amazing advances in ad tech over the past decade, many digital campaigns are still just going after weakly-defined audiences characterized by generic demographic and/or broad-based behavioral targeting, overlaid with easily-mimicked behaviors like views and clicks. These approaches are, in effect, propagating the broadcast mentality of the old offline world, where targeting “18-49 year-old males who make over $100K/yr and are interested in electronics” might have been considered pretty decent (and pretty hard to fake, from an offline standpoint). But in the online world that’s about as easy to fake as the age on your dating profile.

Starting with a generic picture of a very broad audience is what I refer to as “guess-based marketing.” Those audiences, whether defined by characteristics like age, gender, and income (the estimation of which is often of dubious quality to begin with) or by simple behaviors like visiting sites or clicking on ads, are really just proxies to the advertiser’s desired business outcomes. The problem is those characteristics are easy for bots to fake and those behaviors are easy for bots to demonstrate, so a guess-based approach is playing right into the fraudsters’ sweet spot. If you’re doing that, you are broadcasting a signal that bots are tuned into, and you should work with a buy-side partner well-equipped to fight the fraud arms race with you, who combines proven proprietary pre-bid fraud detection and filtration with best-in-class third-party technologies.

But simply playing defense is not truly taking advantage of what programmatic is all about. The real power of programmatic is that it enables what I call “goal-based marketing.” Goal-based marketing is about applying the principles of marketing science across the entire funnel, with the realization that all marketers are performance marketers. What I mean by that is no matter whether you are a brand marketer, a direct-response marketer, a loyalty marketer, etc. there is some quantifiable business goal you are looking to drive (whether brand awareness, purchase intent, social engagement, customer loyalty, lifetime value, you name it), and against which you are judging success. And therein is the key to goal-based marketing: if it can be measured, it can be made better by math. Made better by exposing all available data – about audiences, about media, about creatives – to a smart system that can determine the optimal combination of those elements to drive your business goals at scale, automating the right decision at every consumer touchpoint, in real time.

If you are using programmatic technology to drive goal-based marketing, the fraud picture becomes very different. It shifts from a purely defensive and reactive mentality of “avoid fraud” to a proactive posture of “generate business outcomes.” The fact that bots are getting better at blending in and looking like everyone else is suddenly not their strength but rather their weakness, because your customers are not just like everyone else’s and the goals you are trying to drive are not the same as everyone else’s. Browsing and clicking are easy markers to fake, but the combined online and offline data you use to define truly actionable audiences, the category-, brand- and product-specific behaviors that become the triggers for your marketing actions, and the specific and measurable outcomes that matter to you as a marketer – these are things not known to the fraudsters and therefore much harder for bots to fake (not to mention economically infeasible, in the case of actual purchases). Moreover (and somewhat ironically), those true business outcomes are often more accurately and reliably measurable than the easily-spoofed guess-based audiences that were supposed to be the proxies for those outcomes in the first place.

A guess-based approach might simply be looking to buy those 18-49 year-old males who make over $100K/yr and are interested in electronics – an easy target for bots to fake (it’s also worth noting that even in a bot-free world, the accuracy of that kind of data is often extremely poor, based on coarse extrapolations from very limited data). By contrast, a goal-based approach might look to raise awareness for a particular brand by X%. Or do so specifically among consumers who have actually purchased a competitive brand, online or offline, in the past year. Or increase purchase intent among lapsed customers by X%. Or drive conversion of consumers who have expressed interest in a particular category or product, at an average $X cost per conversion. Or drive an overall return on ad spend of greater than X:1 from combined online & offline sales. Or convert X% of current customers to a loyalty program every month. And so on. Bots won’t easily show up in those audience definitions and won’t easily contribute to those outcomes. The avoidance of fraud simply becomes a natural consequence of goal-based marketing. And achieving your business goals at scale is what programmatic is all about.

Moreover, non-human traffic isn’t the only kind of fraud addressed by the use of goal-based marketing. Many of the various types of ad laundering and publisher misrepresentation tactics that can be perpetrated by malware or other forms of browser manipulation, even on the browsers of actual people, are also minimized. Common examples include “invisible ads” (either stacked atop each other or rendered as an invisible 1×1 pixel), or the impersonation of legitimate publishers via “URL masking”. But since ads never actually rendered to a user don’t drive true business outcomes, and impostor sites don’t actually drive business outcomes like the legitimate publishers they are spoofing, goal-based techniques naturally optimize away from such traffic and towards the quality environments that do generate those outcomes.

The evidence is in the data. Goal-based campaigns see no inhumanly high click-through rates, no droves of site visitation with little to no engagement, no lack of bona fide purchase events – all things commonly associated with fraudulent activity. Moreover, when fraudulent publishers are outed in the press, these campaigns see little to no delivery against such publishers. When conversion events have some post-conversion measure of quality, these campaigns strongly outperform their guess-based counterparts.

That’s not to say it’s an either/or proposition. The best results, by far, come when you combine goal-based marketing with powerful pre-bid anti-fraud technology. A guess-based approach invites an onslaught of fraud to begin with, relying solely on anti-fraud measures to take things from bad to good. By contrast, a goal-based approach aligned with your true business objectives intrinsically blunts the onslaught of fraud so you’re starting at good, targeting audiences bots can’t easily resemble and outcomes bots can’t easily reproduce. The overlay of industry-leading anti-fraud technology atop a goal-based approach then imposes additional filters to take good to great. We who build anti-fraud solutions believe the good guys will win the arms race – through technology, through the definition of standards & policies, through education, and through industry-wide data-sharing, transparency, and collaboration. In the meantime, fraud will continue to be a problem in the digital advertising industry, just a much smaller problem for those using programmatic technology built to drive goal-based marketing.

Analytics Intelligence Uncategorized

Move to the Head of the Class with OPEN Certification

January 23, 2014 — by MediaMath

By definition, the word confusion means “disorder, jumble, bewilderment, perplexity, lack of clarity, indistinctness, abashment”–a term that is frequented in the ad tech industry. There is an abundance of information, systems, platforms, and technologies that are bandied about as buzzwords, tech talk, and sales-speak. The result is misinformation, posturing, and fear among digital marketers.

There is no question in any digital executive’s mind that knowledge gaps and deficiencies result in setbacks and redundancy rather than best-of-breed thinking and the boundless innovation the world expects from our industry. The prospect of creating industry experts with a knack for innovation is not a simple proposition, but MediaMath’s new OPEN Buyer Certification programis designed for any grade of buyer and organization that aspires to a higher level of technological competency and promote that competency to grow their business.

MediaMath has been a longstanding advocate of education within the market. In 2012, to remedy some of the head pounding and teeth grinding the average digital marketer experiences on any given day, MediaMath birthed the New Marketing Institute (NMI) with the intention of carrying over its core business tenets of driving marketer ROI, transparency, and creating automation across the board. The result was an educational certification program that cuts through the clutter and creates a new class of smarter, savvier marketers.

Based on a grassroots concept of educating entry-level buyers about technology and marketing decisions and how those collide with ROI, the next evolution was to further educate the industry with the OPEN program and portal which educates and connects MediaMath’s partners and buyers to breed innovation.

OPEN Buyer Certification, a three-tiered program, was built to acknowledge those who the experts are among us and give them the opportunity to showcase their capabilities and take advantage of lead referral opportunities from MediaMath. We’re creating a community of expert users of digital marketing software who can go out and represent their brands and be role models in the space.

There is no better way than to just jump in and start doing it.

OPEN aligns MediaMath with collaborative partnerships that strengthen its platform view of marketing initiatives and creates an entire channel of resellers for its technology. By disrupting the underlying framework of how marketing is thought about and how agencies function, we are changing and informing the ecosystem.

The Buyer Certification program is stacked in three levels: Silver, Gold, and Platinum, with each level evaluated on three distinct categories with qualifying criteria. The program levels focus on different marketplace components as expertise and skill-set progress, the first category focusing on establishing a marketplace presence, provisioning to lead the market, publishing case studies, keynote speaking, and cultivating thought leadership. The second category focuses on platform usage. Are buyers/organizations doing everything they are talking about in the marketplace? Are they embracing programmatic, data integration, and other key industry initiatives? Finally, the third category is based on strategic initiatives; how buyers are innovating and bringing those strategies to market, and how they’re advancing the industry to get to a programmatic and platform viewpoint.

Highly motivated and accomplished buyers and their organizations can power through the certification’s curriculum in a short period of time, or it can be parsed out over several weeks or months to accommodate different schedules.

To date, six companies have participated in the certification program, including the first, platinum-certified partner Epsilon, as well as Adroit Digital, Mediasmith, The Big Lens, 3Q Digital, and Huddled Masses. As of that, there are more than 35 at various stages of the process.

Early adopters are already incorporating certification benchmarks into their 2014 strategic plans, powerful assurance that, a few years out, program participants will move the needle within the industry and rise above the competition.

Newer Posts Older Posts