How Programmatic Will Change the Brand-Agency Relationship

May 15, 2017 — by Parker Noren0


This byline by Parker Noren, Senior Director, Programmatic Strategy & Optimization, originally appears in CommPRO.Biz. 

Ten years after its birth, programmatic advertising is in its awkward teenage years. The existing technology has allowed those who have fully embraced its capabilities to build relationships with their consumers in ways previously inaccessible. But, we need to look forward to see when programmatic will fully deliver its value for all brands. Programmatic allows us to achieve greater brand growth as the result of algorithm-driven optimization and a better understanding of the consumer—all accomplished with less manpower due to automation.

Technology improvements will play a critical role in making this value more accessible. Algorithms will get smarter and more independent from human guidance. More touchpoints will become addressable as we’ve seen with channels like out-of-home. And, we will get to full integration of what today are considered separate but connected silos (e.g. DMP and DSP). Yet, what’s potentially more interesting is how the improvements in technology will influence the roles of brand and agency in advertising execution. Brand marketers will take direct control of the execution of their advertising dollars, while media agencies will need to find new avenues to prove value.

Some brands are already there, especially those with a heavy e-commerce presence where the benefits of direct control were most obvious from the start. But, for many brands, the marketer is still far removed from actual execution. The disconnect was necessary (and still is in linear channels) because of the heavy lifting and knowledge required to execute an ad buy. As we look forward, tactical skills specialization will no longer be needed to execute media to the degree necessary in the past or even today. Scalable technology will fulfill this role, and the job of a campaign planner/manager will shift towards campaign definition and strategy.

The move in this direction—enabled through technology—will reshape roles, activation strategy and success criteria:

  • Media agency as a strategy agency.

Agencies will largely pivot from value provided in execution manpower and knowledge to value as a strategic adviser. They’ll consult their clients on how to form the proper tech stack for their business needs and push them to update their approach to campaign definition so those capabilities can better be leveraged. This includes helping the brand reform channel- and product-based budgeting systems to focus purely on the relationship with the consumer.

  • Greater connective tissue between marketing strategy and advertising execution.

With marketers directly involved in execution, brands will gain ownership of a holistic view of their data and begin to develop a feedback loop between marketing strategy and advertising execution. This will produce more purposeful approaches to campaign definition, supply curation and consumer targeting better aligned with the marketer’s ingoing strategic intent.

  • Heightened pressure on marketers to demonstrate real performance.

Those easy, unsophisticated (“spend the budget”) dollars will all move towards accountability in true business outcomes. The move will be spurred by greater ease of accessibility to sophisticated measurement approaches, marketers having direct transparency and control of how budgets are spent, and the overall trend towards marketing as an accountable revenue center.

For more insights on programmatic’s past, present and future, check out our #10Then10Ahead Content Hub!


AdWeek: 3 Ways Programmatic Can Graduate From Adolescence

May 8, 2017 — by Michael Lamb0


This article by MediaMath President, Mike lamb, originally appeared on AdWeek

As programmatic advertising hits its 10th anniversary, it represents more than 50 percent of all non-search digital ad spending. What began as a quirky set of technologies and protocols for monetizing long-tail display inventory is poised to become the dominant framework for all of digital advertising.

The initial disruption is behind us, we’ve clearly reached the end of the beginning and the future is full of promise. But it’s equally clear that, in order to deliver on that promise, we need to make some fundamental changes, as the status quo is satisfying neither the marketer  nor the consumer.

The rapid adoption of programmatic has demonstrated that marketers aren’t willing to settle for mere automation — and why should they? Consumers have been very clear that they expect control and relevance . I believe that, if accorded this respect, they will respect the business model that supports the internet.

This is the crucial year in our adolescence in which we begin taking responsibility for our actions. There are three urgent tasks the ad tech industry must address if we are to make it to full maturity without alienating our clients and consumers in 2017. Indeed, there are three promises unfulfilled that left ignored, will stunt our continued growth if we as an industry do not address them now.

Provide true addressability

Marketers were promised full addressability across all channels, and consumers were promised a personalized, customer-centric experience online. Marketers want to communicate with people, not “users” and “users” want to be treated like people. Strong, reliable identity solutions needs to go hand-in-hand with strong privacy, data governance and a consumer “bill of rights.”

Pull the curtain back

Marketers are demanding an experience that is reliably fraud-free and brand safe, with transparent and rational economics. Consumers deserve absolute protection from fraud and malware. Quantification will set us free here. There’s no retreat from granular, buy-side measurement of advertising effectiveness, and publishers deserve an equally rigorous toolset for quantifying the contribution of advertising to consumer experience.

Stop our own infighting

True interoperability and transparency will require a strong partnership between marketers and publisher with regard to business outcomes and consumer experience. Antagonistic, arms-length models won’t get us there, nor will closed, monolithic systems.

We’ve talked this talk for a long time, but if we continue to fail to deliver on this, then we are going to lose the faith of consumers and marketers, and we will have lost the opportunity to shape the internet. Luckily, I already see a burst of activity in meeting all of these challenges. Holding companies are holding Google accountable for lack of control and transparency over where their ads appear. DSPs are beginning to draw the line against fraud by refusing to pass the cost on to marketers.

And on Thursday, several of the largest independent companies that have traditionally competed with one another banded together to launch a “standard identity framework that enables buyers and sellers of programmatic digital advertising to create more relevant campaigns and improve consumer experience.”

It might take us another decade to get to the ultimate vision of digital — one where consumers are opting into an experience that empowers brands and publishers alike. It’s in our power to build the programmatic experience that marketers and consumers deserve, and it’s time to get started.

To read the full article via AdWeek, click here.


MediaMath Weighs in on the Evolution of Machine Learning and Artificial Intelligence

April 27, 2017 — by Lauren Fritsky0


Machine learning and artificial intelligence are being used across industries to help mine vast quantities of data. For a recent article on, Heather Blank, MediaMath SVP of Audiences, talks about how marketers are utilizing these technologies. 

Marketers are also harnessing machine learning to better predict how certain customers react to various marketing efforts and how likely they are to make a purchase in what’s known as conversion.

This helps brands and agencies run more holistic marketing campaigns, targeting the right audience through the most optimal channels and at the best times, says Heather Blank, senior vice-president at MediaMath.

However, Blank says machine learning is not necessarily always predictive in nature and can be descriptive instead.

“Studying machine learning models can help explain which features or individual characteristics are important in predicting an event – usually a conversion – and which features may be meaningless or even predictive of the event not occurring at all,” she says.

“This can help marketers understand their consumer patterns more clearly, by filtering out the noise. It can also help challenge status quo notions of what is important to the purchase cycle or even what an ideal consumer looks like.”


How is AI Really Changing Advertising? We Asked Two Experts

April 24, 2017 — by Todd Wasserman0


AI. It’s one of the hottest topics in advertising these days. It seems like every ad tech firm is promoting their AI acumen.

As with “big data” a year or so ago though, it can be hard to separate the truth from bluster. Some argue that there isn’t enough data to make AI effective for advertising and AI is merely the industry’s latest buzzword.

To get a different perspective on this issue, we talked to two professors who specialize in AI, Richard Baraniuk of Rice University and Tim Kraska of Brown University. We interviewed both separately. Excerpts of those conversations appear below:

MM: What is the difference between deep learning and machine learning?

Richard Baraniuk: Deep learning is just a special case or an example of a certain kind of machine learning. Machine learning is a broad classification of a set of techniques for solving certain kinds of problems that involve using data to construct systems that can aid in solving problems like making decisions. Deep learning is just a particular example of one machine learning technique.

Tim Kraska: Deep learning is a sub-field of machine learning. It’s one very particular type. Machine learning is a bunch of techniques, including regression models. They are different types of algorithms you can use. The most traditional ones would be linear regressions, polynomial regressions. They are ways to find data to explain your model to predict future events while deep learning is a very different type of model building which inspired by what the brain does.

MM: How would deep learning come up with different solutions than machine learning?

RB: Let me think of a good analogy…Machine learning is like talking about automobiles and deep learning would be talking about Ford, which is a particular kind of automobile. So a particularly powerful set of algorithms in machine learning are these deep learning methods. It’s basically a rebranding of neural networks or artificial neural networks which folks have been working on for 60 years or more and are very loosely based on how the brain works or might work. The reason they’re called deep learning algorithms is you take some basic machine learning algorithm and you connect it to another one and connect it to another one and so forth so you make this cascade or chain of many algorithms in a row and that creates one of these so-called deep learning approaches. And they are probably the hype-iest of machine learning approaches today.

MM: Facebook’s News Feed is often cited as an example of machine learning. Is that correct?

RB: Exactly. Any mathematical algorithm that’s able to use data and learn from that data in order to do an even better job next time, that’s a machine learning technique.

TK: Machine learning nowadays is in almost every type of product you can think of. If you go to Netflix and get a video recommendation, that’s machine learning. If you go to a store on your phone and it makes a recommendation of what to buy next, that’s machine learning.

MM: Is there a threshold, like X-amount of terabytes in which it becomes more effective?

RB: Absolutely. There are thresholds, but there’s not a general rule of thumb you can apply. It really depends on the kind of data that you have access to and the kind of prediction that you want to make. Whether you’re dealing with text data or image data or medical records data or click data, those all contain very different amounts of information, and it becomes a difference between making some kind of judgment about whether a person has found some resource or ad useful or intriguing versus trying to find some fine-grained demographic data about a person like what age were they, what gender were they, etc. As you ask more and more deep questions, you need more data.

MM: Machine learning has been around for a while. Why is it getting so much hype now?

RB: The main reason is machine learning systems have gotten way better over the last few years. Going from yesterday’s really terrible telephone number recognition – speaking into a telephone and having a machine barely understand you – all the way to basically being able to drive a driverless car around. Really the reason these methods have gotten better is three things: Folks have come up with new kinds of algorithms, but, even more importantly, computers have become so much more powerful that we can crunch on the third element, which is that we have so much data available now. There’s so much data now and so much computing power that algorithms that would have been impossible to apply five years ago are now commonplace.

TK: On the one hand, the amount of data available is very different from even 10 or 15 years ago. The other thing is that processing power got much more advanced, so you can do more things were previously possible. Plus, there’s a huge advancement in the techniques used. But the biggest influence is the tools available – tools people can use in an open source way. The whole tool ecosystem exploded. More people are using them.

MM: Do you think at some point AI will be able to make creative decisions in advertising?

TK: We are still far from that. Maybe eventually it might happen, but it’s hard to tell.

RB: Well, machine learning is already being applied to advertising, big time. Any of the ads that are placed in a Google search page, any of that is done via machine learning. But [regarding creativity], this is where we start to move from machine learning to artificial intelligence, which is a label that’s used for any problem that’s too hard to solve right now. I would say, currently, machine learning systems are extraordinarily effective in taking some training data – some data that they used to apply rules from – and then applying that data to new data. But really, they’re just learning some rules and applying them to a piece of data. Figuring out the creative process and how to mimic that in a machine is still an unsolved, very difficult problem.


Navigating the Brave New World

April 4, 2017 — by Kristin Brewe0


As a part of a module I teach at the University of West London, Emerging Technology, Other Realities, I have students read a work of science fiction in addition to practical training about programmatic advertising and augmented/virtual reality campaigns. How do you prepare students to work in a world that often resembles classics in science fiction? How do experienced industry professionals manage to stay ahead of the game in which the “present” is often already too late? There’s one rather old-fashioned answer: training.

The digital skills gap is an equal opportunity employer

In the Government’s “Digital Skills for the UK Economy,” approximately 1 in 5 job vacancies relate to the digital skills gap. This is even more pronounced in creative industries, like advertising. Experienced industry professionals often fall hard into the digital skills gap; yesterday’s media planner is today’s statistician. Not that there’s anything wrong with statisticians, of course. However, as companies embrace the technological transformation of advertising, they fail to create a world that takes experienced staff along for the journey.

This comes at a business cost in addition to the obvious human one. Many companies making such technology shifts lose out on the deep knowledge that comes with experience. For example, I have worked with many media buyers who came up through radio and TV who would have been perfectly suited to more tech-centric media trading. In fact, they might have been better than the average entry-level employee, as they had years of experience negotiating and evaluating deals—something that’s hard to teach other than on the job. Those colleagues didn’t get the chance, however. No one trained them. The businesses in question lost out on all that knowledge, the people in question left wondering why they ever cared about their careers in the first place.

One might think that overlooking experience would lead employers to be overly satisfied with recent graduate hires, given that everyone always talks about tech-savvy Millennials. However, that is not the case. Both as an employer or an advisor to employers, I have found communications graduates woefully unprepared for the automated world we’re creating. While Millennials and Gen Z are adept at using their phones to use Snapchat filters, very few can tell you how Snapchat/Facebook/Instagram/display advertising might work to target them or be tailored to them.  As Accenture’s Mohini Rao wrote in The Digital Skills Gap in the UK, “This generation are consumers of digital technology, not creators.”

Creating opportunities to learn about machines vs. machine learning

Of course, we, as a society (with our industry strongly implicated in this), have created these consumers who lack context. But because we created that environment, we can also create a new one with opportunities to code, to use industry tools, and to understand how the machines in our lives work. (Note: This should probably start in pre-school, but since I teach university, I’ll focus on Uni students.)

This is where courses like our new Advertising & Public Relations course come in. At UWL, we are committed to ensuring that our students understand the basics of automation so that they can succeed in their careers. Even creatives need to know that programmatic ads are often certain standard sizes, much like we used to teach the dimensions of standard print. Admittedly, we have an uphill battle here. When you start diving into these topics with university students, responses like “WTF is programmatic advertising?” or “You can see all that data about me? Is that legal?” or “But making skyscraper banners is boring!” are common.

To help overcome such obstacles and to create a more positive environment for the next generation of professionals, our course is working with visionary organisations like the New Marketing Institute (NMI). A few weeks ago, the excellent instructor from NMI came in and managed to transform boredom and cynicism into excitement. Several students afterward came to me to express interest in knowing more about things like programmatic and in exploring ad tech as a career path. That’s both a testimonial to the quality of NMI’s instruction and also to the fact that, once this aspect of advertising is explained clearly, it doesn’t have to be a boring diet of acronym soup.

Would that more organisations like NMI visited universities. Would that more students were exposed to the advanced aspects of our practice. Would that my former colleagues had the chance to change with our industry, rather than outside of it. So let’s change all that, shall we?


Why Optimization in Marketing Matters More Than Ever

March 14, 2017 — by Michael Neiss0


Read any marketing trade publication and “optimization” is a term that regularly pops up in the context of digital ad campaigns. But what does it mean to really “optimize” a campaign—and why should you consider it as part of your long-term marketing strategy?

Why you should use optimization

Any time you launch a campaign, you have a goal in mind. If you can measure this goal, you can optimize to it.  You just need to understand which dials to turn—which users to target, which sites to target them on, what time of day is most effective and more.

As more and more data becomes available, it is quickly becoming impossible for one person to manage this process by hand. There are multiple formats, individual images and messages to consider. What team—let alone individual—could look at every impression multiple times a day and score it? You need machine learning—the science and practice of designing computer algorithms that find patterns in large volumes of data—to manage the scale of the opportunity. A machine learning algorithm not only improves how well opportunities are assessed and continually improves upon itself over time; it also saves time and efficiency for employees and companies, freeing up traders to do less manual and more strategic work.

How to overcome the challenges of optimization

As the first demand-side platform, we’ve seen hundreds of advertisers, thousands of campaigns and trillions of impressions over the last decade. During this time, we’ve often encountered certain scenarios that pose a challenge to traditional optimization:

  • Short-lived campaigns: You have a campaign that won’t be active long enough to generate learnings. For example, a retailer may want to run a short promotional campaign to highlight an upcoming sales event.
    • Solution: Seed the new campaign with data from previous campaigns the advertiser has run. There’s no need to start from scratch every time!
  • Rare events: You care about a certain event but it doesn’t happen frequently enough. For example, a cruise operator wants to maximize new bookings through their website, but cruises are expensive and so there are relatively few merit events from which to learn.
    • Solution: Move up the sales funnel and choose a proxy for the event you care about. Users may not book a new cruise very often, but a repeat visit to the itinerary page may be a reliable signal of intent. You can also optimize towards audience members that look like the users you want to reach. Look-alike models like this are even more powerful when they utilize second-party data, which augments an advertiser’s first-party data with behavioral data from non-competitive advertisers to build a richer profile of each user.

How to be smart about optimization

Every advertiser faces a fundamental choice when setting up a new campaign: Is it more important to spend the budget in full or to stay within a certain performance threshold?  Most advertisers probably aspire to the second goal, but it can be difficult to have high conviction in performance KPIs that rely on last-touch conversion credit.  In other words, what does it really mean if a user who clicked an ad or saw an ad went on to buy something?  Would that purchase have happened anyway?

Answering questions like these and understanding the relationship between last-touch conversion credit and real, causal marketing outcomes requires marketers to think about incrementality—something we will cover in our next blog post.


MediaMath Recognized in the Latest Gartner Magic Quadrant for Digital Marketing Hubs

February 16, 2017 — by Joanna O'Connell0


We’ve done it again.

MediaMath is being recognized in the latest Gartner Magic Quadrant for Digital Marketing Hubs for our scalable omnichannel platform that enables marketers to seamlessly manage and activate their audience data in media across channels, formats and devices and optimize campaigns in real-time.

The Gartner Magic Quadrant provides both quantitative and qualitative analysis on technology and service markets. It highlights the trajectory of key companies who are assessed on their ability to execute and for completeness of vision in a given category. You can access the full report here.

We are the leading DSP in the “Challengers” quadrant, and believe our strengths include:

  • Programmatic marketing at scale: We provide all of the tools a marketer or agency needs to execute programmatic marketing at scale. These include audience management powered by our fully-featured, next-generation DMP, integrated with our omnichannel DSP; a depth and breadth of privileged media access – across supply sources and markets – that’s unparalleled in the industry; and machine learning to help marketers optimize their campaigns to reach their audiences most efficiently and effectively, and to understand how each marketing touchpoint leads to incremental lift in revenue.
  • Integrated data management: Our integrated DMP connects data management directly with media execution and decision solutions to enable holistic audience management and omnichannel audience buying at scale. This DSP + DMP approach enables marketers and agencies to manage their programmatic marketing more seamlessly through activation of data right within our buying platform. Removing the need for an external DMP means there’s no data loss or latency, and audience profile data can be linked with media behavior for smarter, better informed planning, buying and optimization.
  • Real-time decisioning and optimization: Our platform leverages the centralized intelligence of our Brain algorithm to allocate budget intelligently across channels. Powered by advanced machine learning and oriented towards true incremental business performance, The Brain brings to bear increasingly intelligent optimizations and, therefore, business outcomes that are driven and measured by math, across all addressable channels, transparently.

Our clients are sophisticated marketers who are as obsessed as we are with data and technology as the fuel for next generation marketing and business performance. This year, we will continue to build and innovate on technology products and solutions that help them execute relevant 1:1, real-time marketing at scale that delivers against their most important business objectives.


Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.


Machine Learning: The Factors to Consider for Marketing Success

October 4, 2016 — by Ari Buchalter0


This byline originally appeared in MarketingTech.

Adoption of machine learning is moving up a gear. From diagnosing diseases, to driving cars, to stopping crime, the ability of machines to learn from data and apply those learnings to solve real problems is being leveraged all around us at an accelerating pace.

As data volumes continue to grow, along with advances in computational science, machine learning is poised to become the next great technological revolution.

Digital marketing represents one of the most exciting arenas where machine learning is being applied. Across websites, mobile apps, and other digital media, there are hundreds of billions of opportunities for advertisers to deliver ads to consumers every single day.

Typically, those opportunities each come with a wealth of data attached and are made available in a real-time auction where potential buyers (i.e., advertisers or the agencies working on their behalf) have mere milliseconds to analyse the opportunity and respond with a bid to serve their ad to a particular consumer on a particular device at the perfect time.

The speeds and scales involved dwarf any other real-time transactional medium from financial exchanges to credit cards.

A lot of brainpower has therefore gone into defining what machine learning algorithms do when it comes to digital marketing. In plain English, it consists of things like identifying the most attractive consumers for a given brand or product, determining how much to bid on opportunities to show ads to those consumers in different contexts and on different devices, personalising the ads delivered to ensure relevance, and accurately measuring the impact of those ads on bottom-line sales.

As the pace of innovation accelerates, it’s important for brands to define guiding principles that ensure this amazing technology yields maximum impact

Understanding how to best leverage the vast amounts of data about digital audiences and the media they consume can be the difference between success and failure for the world’s largest brands.

That is why smart marketers are investing heavily in Data Science and machine learning to drive competitive advantage and are increasingly seeking out partners with this expertise.

With so much hanging in the balance, it’s instructive to consider how marketers should approach machine learning.

There are several important factors marketers should bear in mind when implementing machine learning technology to help ensure success:

Start with business goals

When adopting machine learning technology, marketers should begin at the end. Define the specific, measurable business outcomes you want to achieve and gear the machine learning around that.

Avoid ‘shallow’ objectives like page visits or clicks and use deep metrics like incremental sales or ROI. The deeper the metric, the better the results.

Don’t be fooled into the traditional approach of buying media based on indexed demographics or other coarse proxies. The beauty of digital marketing is that it can be optimised to measure success using the same metrics that matter in the board room, achieving the best possible results.

Select a robust platform

When it comes to machine learning, there’s a huge difference between theory and practice. Solutions that work in a small-scale testing environment may fail spectacularly in large-scale production.

It’s critical the machine learning runs on a platform proven to handle the required scales and speeds with the performance, reliability, and security demanded by enterprise-class marketers.

Furthermore, look for platforms that are flexible enough to enable easy customisation, both of the data and the models, to meet the unique needs of your business.

But a word to the wise: platforms cannot serve two masters. Some marketing platforms aimed at advertisers are actually operated by companies who make their money selling media. That conflict of interest should make you think twice.

Read the rest of the article here.


Machine Learning Demystified, Part 3: Models

September 15, 2016 — by Prasad Chalasani0


In Part 2 of our series, the ML novice realized that generalization is a key ingredient of a true ML algorithm. In today’s post we continue the conversation (the novice is in bold italics) to elaborate on generalization and take a first peek at how one might design an algorithm that generalizes well.

Last time we started talking about how one would design an ML algorithm to predict the click-through-rate (CTR) on an ad impression, and you designed two extreme types of algorithms. The first one was the constant algorithm, which simply records the overall historical click-rate a, and predicts a for every new ad opportunity.

This was fine for overall accuracy but would do poorly on individual ad impressions, because its predictions are not differentiated based on the attributes of an impression, and CTRs could be very different for different types of impressions.

Exactly, and the other extreme was the memorization algorithm: it records the historical CTR for every combination of feature-values, and for a new ad impression with feature-value-combination x (e.g., browser = “Safari”, gender = “male”, age = 32, city = “NYC”, ISP = “Verizon”, dayOfWeek = “Sunday”), it looks up the historical CTR a(x) (if any) for x, and outputs a(x) as its prediction.

And we saw that this algorithm would be highly impractical and it would not generalize well to new ad impressions.

Yes, it would not generalize well for two reasons: (a) the specific feature-value-combination x may not have been seen before, and (b) even if x occurred in the training data, it may have occurred too infrequently to reliably estimate CTR for cases resembling x.

So how would we design an algorithm that generalizes well?

Let’s do a thought experiment: what do you do when you generalize in daily situations?

I start from specific facts or observations and formulate a general concept.

How, exactly, do you do that?

I abstract common aspects of the specific observations, so that I can make a broader, more universal statement. I build a mental model that’s consistent with my observations.

In terms of detail, would you say that your mental model is relatively simple or complex?

Simple, definitely. I abstract away the non-essential details.

Great, you’ve just identified some of the characteristics of a good generalization: a relatively simple, abstract, less detailed, model that is consistent with (or fits, or explains) the observations , and is more broadly applicable beyond the cases you have seen.

That’s how humans think, but how does this help us design a computer algorithm that generalizes well?

This helps us in a couple of ways. Firstly, it gives us a framework for designing an ML algorithm. Just like humans build mental models based on their observations, an ML algorithm should ingest training data and output a model that fits, or explains the training data well.

What does a “model” mean specifically?

A model is the mathematical analog to the human idea of a “concept” or “mental model”; it’s the mathematical formalization of what we have been informally calling “rules” until now. A model is essentially a function takes as input the characteristics (i.e. features) of a case (i.e. example), and outputs a classification (e.g., “cat” or “not cat”) or score (e.g. click-probability).

Wow, so a model is essentially a program, and you’re saying that an ML program is producing another program?

Sure, that’s a good way to look at it, if you think of a function as a program that takes an input and produces an output.

We seem to be back in voodoo-land: a program that ingests training data, and spits out a program…

Well the ML algorithm does not magically output code for a model; instead the ML algorithm designer usually restricts the models to a certain class of models M that have a common structure or form, and the models in the class only differ by certain parameters p. Think of the class of models M as a generic template, and a specific model from the class is selected by specifying the values of the parameters p. Once restricted to a certain class of models, the ML algorithm only needs to find the values of the parameters p such that the specific model with these parameter-values fits the training data well.

For example suppose we’re trying to predict the sale price y of a home in a certain locality based only on the square footage x. If we believe home prices are roughly proportional to their square footage, a simple class of models for this might be the class of linear models, i.e., the home price y is modeled as a*x. The reason these models are called linear is that, for a fixed value of a, a plot of x versus y would be a straight line.

Notice that a*x is a generic template for a linear model, and a value for a needs to be specified to get a specific linear model, which would then be used to predict the sale price for a given square-footage x. The ML algorithm would be trained on examples of pairs (x,y) of square-footage and home-price values, and it needs to find the value of the parameter a such that the model a*x “best fits” the training data. I hope this does not sound so mysterious any more?

It certainly helps! Instead of magically spitting out code for some model, an ML algorithm is actually “merely” finding parameters p for a specific class of models.


But how do we know a model is “correct” or is the “true” one?

Remember that a model expresses a relationship between features and response (e.g. classification or score). All non-trivial real-world phenomena are governed by some “true” underlying model (think of this as the signal, or the essence, of the relationship between the variables involved) combined with random noise. The noise could result from errors in measurement or inherent random variation in the relationship. However an ML algorithm does not need to discover this “true” underlying model in order to be able to generalize well; it suffices to find a model that fits the training data adequately. As the statistician George Box famously said,

All models are wrong, but some are useful,

meaning that in any non-trivial domain, every model is at best an approximation when examined closely enough, but some models can be more useful than others.

What kinds of models would be more useful?

In the ML context, a useful model is one that generalizes well beyond the training data. This brings us to the second way in which thinking about how humans generalize helps, which is that it suggests a guiding principle in designing a true learning algorithm that has good generalization abilities. This principle is called Occam’s Razor:

Find the simplest model that explains the training data well. Such a model would be more likely to generalize well beyond the training examples.

And what is a “simple” model?

In a given context or application-domain, a model is more complex if its structure or form is more complex and/or it has more parameters (the two are often related). For example when you’re predicting CTR, and you have thousands of features you could be looking at, but you’re able to explain the training data “adequately” using just 3 of them, then the Occam’s Razor principle says you should just use these 3 to formulate your CTR-prediction model. Or when predicting house prices as a function of square-footage, if a linear model fits the training data adequately, you should use this model (rather than a more complex model such as quadratic, etc) as it would generalize better to examples outside the training set. I’ve used italics when talking about fitting/explaining the training data because that’s something I haven’t explained yet.

I see, so the constant algorithm I designed to predict CTR produces an extremely simple model that does not even fit the training data well, so there’s no hope it would generalize beyond the training data-set?

Exactly, and on the other hand the memorization algorithm produces an extremely complex model: it records the historical CTR for each feature-value combination in the training data. When tested on any example (i.e. feature-value combination) that already occurred in the training data, this model would produce exactly the historical CTR observed for that example. In this sense this complex model fits the training data perfectly. When an overly complex model fits training data exceptionally well, we say the model is overfitting the training data. Such models would not generalize well beyond the training data-set.

Interesting, so we want to find a relatively simple model that fits the training data “adequately”, but we don’t want to overdo it, i.e. we shouldn’t overly complicate our model in order to improve the fit to the training data. Is it obvious why complex models in general wouldn’t generalize well?

It’s actually not obvious, but here’s an intuitive explanation. Remember we said that any real-world phenomenon is governed by some true model that represents the signal or essence of the phenomenon, combined with random noise. A complex model-class has so many degrees of freedom (either in its structure or number of parameters) that when a best-fitting model is found from this class, it overfits the training data: it fits the signal as well as the random noise. The random noise gets baked into the model and this makes the model perform poorly when applied to cases not in the training set.

Can you give me an example where a simpler model generalizes better than a complex one?

Let’s look at the problem of learning a model to predict home sale prices in a certain locality based on the square-footage. You’re given a training data-set of pairs (x,y) where x is the square-footage in thousands, and y is the sale price in millions of dollars:

(2, 4.2), (3, 5.8), (4, 8.2), (5, 9.8), (6, 12.2), (15, 27.8),

and your task is to find a model that “best” fits the data (you decide how to interpret “best”). Any ideas?

I notice from the training data that when the square-footage x is an even number, the home price is (2*x + 0.2), and when it’s an odd number the home price is (2*x — 0.2).

Great, your model fits the training data perfectly, but in order to do so you’ve made your model complex because you have two different expressions based on whether the square footage is even or odd. Can you think of a simpler model?

I see what you mean — I notice that the home prices (in millions of dollars) in the training data-set are fairly close to twice the number of thousands of square feet, so my simpler model would be 2*x.

This would be a simpler model, even though it does not exactly fit the training data: the home prices are not exactly 2*x but close to it. We can think of the home price as being 2*x plus some random noise (which could be positive or negative). It just so happens that in the specific training examples you’ve seen, the noise part in the even-square-footage examples is 0.2, and -0.2 for the others. However it could well be that for a new example with x=7 (an odd number) the price is exactly 2*x or slightly above 2*x, but if we were to use your previous complex model, we would be predicting 2*x — 0.2, which would not be a good prediction.

The point is that in an effort to fit the training data very well (in your case perfectly), you made your model overly complex and latched on to a spurious pattern that is unlikely to hold beyond the training data-set. By definition the noise part is random and un-predictable, so any attempt to model it and predict would result in errors. Instead, the simpler model captures only the “signal” and is oblivious to the random noise.

Here, then, is the recipe for designing a true ML algorithm that generalizes well:

  • Pick a relatively simple class of models M, parametrized by a set of parameters p
  • Find values of the parameters p such that the model from M with these parameters fits the training data optimally. Denote this model as M[p]. M[p] is a function that maps a feature-value combination x to a response (e.g. classification or score).
  • Output M[p] as the predictive model for the task, i.e. for an example x, respond with M[p](x) as the predicted classification or score.

What does it mean for a model to fit or explain the training data, and how do I find the parameters that “optimally” fit the data?

Informally, a model fits the training data well when it is consistent with the training data. This is analogous to how you form a mental model based on your observations: your mental model is consistent with your observations (at least if you’re observant enough and think rationally). Of course to implement a computer algorithm we need to make precise the notion of an “optimal fit”, and specify how the computer should go about finding model parameters that “best” fit the training data. These are great topics for a future post!

AnalyticsDataDIGITAL MARKETINGIntelligenceMediaTechnologyUncategorized

Machine Learning Without Tears, Part Two: Generalization

August 22, 2016 — by Prasad Chalasani0


In the first post of our non-technical ML intro series we discussed some general characteristics of ML tasks. In this post we take a first baby step towards understanding how learning algorithms work. We’ll continue the dialog between an ML expert and an ML-curious person.

Ok I see that an ML program can improve its performance at some task after being trained on a sufficiently large amount of data, without explicit instructions given by a human. This sounds like magic! How does it work?

Let’s start with an extremely simple example. Say you’re running an ad campaign for a certain type of running shoe on the NYTimes web-site. Every time a user visits the web-site, an ad-serving opportunity arises, and given the features of the ad-opportunity (such as time, user-demographics, location, browser-type, etc) you want to be able to predict the chance of the user clicking on the ad.  You have access to training examples: the last 3 weeks of historical logs of features of ads served, and whether or not there was a click. Can you think of a way to write code to predict the click-rate using this training data?

Let me see, I would write a program that looks at the trailing 3 weeks of historical logs, and if N is the total of ad exposures, and k is the number of those that resulted in clicks, then for any ad-opportunity it would predict a click probability of k/N.

Great, and this would be an ML program! The program ingests historical data, and given any ad-serving opportunity, it outputs a click probability. If the historical (over the trailing 3-weeks) fraction of clicked ads changes over time, your program would change its prediction as well, so it’s adapting to changes in the data.

Wow, that’s it? What’s all the fuss about Machine Learning then?

Well this would be a very rudimentary learning algorithm at best: it would be accurate in aggregate over the whole population of ad-exposures. What if you want to improve the accuracy of your predictions for individual ad-opportunities?

Why would I want to do that?

Well if your goal is to show ads that are likely to elicit clicks, and you want to figure out how much you want to pay for showing an ad, the most important thing to predict is the click probability (or CTR, the click-through-rate) for each specific ad opportunity: you’ll want to pay more for higher CTR opportunities, and less for lower CTR opps.

Say you’re running your ad campaign in two cities: San Francisco and Minneapolis, with an equal number of exposures in each city. Suppose you found that overall, 3% of your ads result in clicks, and this is what you predict as the click-probability for  any ad opportunity. However when you look more closely at the historical data, you realize that all ad-opportunities are not the same: You notice an interesting pattern, i.e. 5% of the ads shown to users in San Francisco are clicked, compared to only 1% of ads shown to users logging in from Minneapolis. Since there are an equal number of ads shown in the two cities, you’re observing an average click-rate of 3% overall, and …

Oh ok, I know how to fix my program! I will put in a simple rule: if the ad opportunity is from San Francisco, predict 5%, and if it’s from Minneapolis, predict 1%. Sorry to interrupt you, I got excited…

That’s ok… in fact you walked right into a trap I set up for you: you gave a perfect example of an ad-hoc static rule: You’re hard-coding an instruction in your program that leverages a specific pattern you found by manually slicing your data, so this would not be an ML program at all!

So… what’s so bad about such a program?

Several things: (a) this is just one pattern among many possible patterns that could exist in the data, and you just happened to find this one; (b) you discovered this pattern by  manually slicing the data, which requires a lot of time, effort and cost; (c) the patterns can  change over time, so a hard-coded rule may cease to be accurate at some point. On the other hand, a learning algorithm can find many relevant patterns, automatically, and can adapt over time.

I thought I understood how a learning algorithm works, now I’m back to square one!

You’re pretty close though. Instead of hard-coding a rule based on a specific pattern that you find manually, you write code to slice historical data by all features. Suppose there were just 2 features: city (the name of the city) and IsWeekend (1 if the opportunity is on a weekend, 0 otherwise). Do you see a way to improve your program so that it’s more general and avoids hard-coding a specific rule?

Yes! I can write code to go through all combinations of values of these features in the historical data, and build a lookup table showing for each (city, IsWeekend) pair, what the historical click-through-rate was. Then when the program encounters a new ad-opportunity, it will know which city it’s from, and whether or not it’s a weekend, and so it can lookup the corresponding historical rate in the table, and output that as its prediction.

Great, yes you could do that, but there are a few problems with this solution. What if there were 30 different features? Even if each feature has only 2 possible values, that is already 2^30 possible combinations of values, or more than a billion (and of course, the number of possible values of many of the features, such as cities, web-sites, etc  could be a lot more than just two). It would be very time-consuming to group the historical data by these billions of combinations,  our look-up table would be huge, and so it would be very slow to even make a prediction. The other problem is this: what happens when an ad opportunity arises from a new city that the campaign had no prior data for? Even if we set aside these  two issues, your algorithm’s click-rate predictions would in fact most likely not be very accurate at all.

Why would it not work well?

Your algorithm has essentially memorized the click-rates for all possible feature-combinations in the training data, so it would perform excellently if its performance is evaluated on the training data: the predicted click-rates would exactly match the historical rates. But predicting on new ad opportunities is a different matter; since there are 30 features, each with a multitude of possible values, it is highly likely that these new opportunities will have feature-combinations that were never seen before.

A more subtle point is that even if a feature-combination has occurred before, simply predicting the historical click-rate for that combination might be completely  wrong: for example suppose there were just 3 ad-opportunities in the training data which had this feature-combination: (Browser = “safari”, IsWeekend = 1, Gender = “Male”,  Age = 32, City = “San Francisco”, ISP = “Verizon”), and the ad was not clicked in all 3 cases. Now if your algorithm encounters a new opportunity with this exact feature-combination, it would predict a 0% click-rate. This would be  accurate with respect to the historical data your algorithm was trained on, but if we were to test it on a realistic distribution of ad opportunities, the prediction would almost certainly not be accurate.

What went wrong here? Suppose the true click-rate for ads with the above feature-combination is 1%, then in a historical sample where just 3 such ad-opportunities are seen, it’s statistically very likely that we would see no clicks.

But what could the learning algorithm do to avoid this problem? Surely it cannot do any better given the data it has seen?

Actually it can. By examining the training data, it should be able to realize, for example, that the ISP and Browser features are not relevant to predicting clicks (for this specific campaign), and perhaps it finds that there are a 1000 training examples (i.e. ad-opportunity feature-combinations) that match the above example when ISP and Browser are ignored, and 12 of them had clicks, so it would predict a 1.2% click-rate.

So your algorithm, by memorizing the click-rates from the training data at a very low level of granularity, was “getting lost in the weeds” and was failing to generalize to new data. The ability to generalize is crucial to any useful ML algorithm, and indeed is a hallmark of intelligence, human or otherwise. For example think about how you learn to recognize cats: you don’t memorize how each cat looks and try to determine whether a new animal you encounter is a cat or not by matching it with your memory of a previously-seen cat. Instead, you learn the concept of a “cat”, and are able to generalize your ability to recognize cats beyond those that exactly match the ones you’ve seen.

In the next post we will delve into some ways to design true learning algorithms that generalize well.

Ok, looking forward to that. Today I learned that generalization is fundamental to machine-learning. And I will memorize that!