Friday, September 16, 2016

Renewology: The Four Big Steps

As promised in the basic intro, this is a more detailed look at the process of Renewology: converting adults 18-49 ratings into renewal percentages. I'm not putting out many of the specific numbers that go into the model, but I hope this is a pretty comprehensive and honest look about how the process works, including the things that worry me.

This will probably be too long and dry for most, but I think it's important to have it out there as a reference while it's in action. I welcome your feedback, if you can manage to get through all of this. While I like to think I have developed some expertise on the TV ratings world, I'm much less of an expert on predictive modeling. I've tried to keep the portions that involve that pretty simple, but I'd love to hear if something seems particularly unsound mathematically. I can't promise to get a whole lot more specific than what is included here, but I'll do my best.

Step One: Adults 18-49 Ratings -> "True" (timeslot-adjusted) Ratings

The "True" part of this process has been around for years, and I hope this year's version of the formula will be the best yet. True takes adults 18-49 ratings and adjusts them for the main factors that define the difficulty of a timeslot: overall TV viewing, timeslot competition, and the surrounding programs. (Yes, for 2016-17 that means both lead-in and lead-out!) This year's updates to the True formula get their own post next week.

My very first draft of this model actually used raw numbers instead of True. But when I tried doing the same thing with True ratings, it was immediately clear that the percentages with True numbers absolutely blew the raw ones out of the water. Making the model work well with raw numbers might be possible; other people seem to get along OK with them. But when the later steps of the process are this objective, you really have to have good input numbers; when a show has a special out-of-timeslot episode, you'd have to adjust the future ratings projections for the difficulty of a future timeslot. You'd have to throw in an expected decline after DST. You may have to account for an expected change in lead-in or lead-out. Why go through all those pains when I already have something that (ideally) weeds out the effects of all those things already?

But there are a couple downsides to doing this with True numbers that I think you should know about:

1. Smaller sample size. Using True numbers limits me to just the 2014-15 and 2015-16 seasons for the model inputs, which makes for a limited sample size in some cases and a perhaps dangerously small one in others. In some of the more worrisome cases, I've tried the raw-numbers version of the model both with two years and more than two years to see if there are big deviations, and have been reasonably OK with the results. But I try to highlight some of the times that worries me below (mostly with projecting network averages).

Why just those seasons? Updating True every year allows me to keep a lot of things constant within the one year for which it is intended. In a historical True measurement, the changing nature of television would require many more parts of the formula to become more fluid. I don't think it would be responsible to extend the current formula as is much beyond the last two years, and even doing that would take a lot of extra work. But this sample size problem may finally be what inspires me to work on a more dynamic True formula in the future!

2. Communicating results. Ideally, I'd like all of these projections of future ratings to be easily translated into raw numbers, so you can see things like where we project a show's final ratings average and A18-49+ average will end up. Instead, all of these projections have to be communicated in True ratings that you can't easily match up with what you see when the day's ratings come out. I may work on some way of converting True numbers back into raw, with all the headaches of future timeslots involved above. But doing it this way makes it a problem of delivery, rather than a problem that affects the actual predictions themselves. Lesser of two evils.

Step Two: True Ratings -> Future True Ratings

When testing models in this process, it seemed better to go with late-season ratings rather than full season averages, even though early data points still correlate pretty well. ABC's The Muppets and NBC's Heroes Reborn both had major post-premiere collapses, and got cancelled even though their complete season averages were not really that bad.

As I said in the intro, Renewology's biggest "innovation" from a prediction standpoint is in this step, where we convert the most recent ratings into a ratings level where we expect them to settle down the line. It's not that big an adjustment once a show has been on the air for a month or so, but it can be very interesting in the earliest weeks of a show's season (especially a new show).

This is something I've been working on for the last couple years, and you can see them in both the SpotVaults and the Plus Power Rankings from those seasons. Past years have gone with two separate equations, one for new shows and one for returning shows. The returnee equation was based entirely on each individual show's year-to-year trends, with the expectation that a show would continue its recent year-to-year trend going forward. And the newbie model was based on a broader approach, looking at the average week-to-week behavior for new series in the past.

I took a look at these projections and was interested to notice that the newbie model, where there's no previous data for that show, was actually better than the returnee one which had an entire previous season to go on. The returnee one was particularly bad with second season shows, because the early weeks of the previous season are often heavily inflated.

So these ratings projections have gone with something resembling the new show model for all shows. Instead of focusing on just the one show's past history, it aggregates a ton of previous seasons from many different shows, and looks at how ratings have developed from this point forward historically.* But I should note that there are still separate trajectories for new shows and returning shows, because new shows tend to drop a lot more in the early weeks.

*- Why is there no unscripted Renewology? Part of it is general lack of interest, which is why unscripted gets the shaft at other sites. But another key factor is that many unscripted shows need a different model. With scripted shows, it's usually fine to assume they will be steady in True throughout the season unless their actual strength is changing. There are a few shows that tend to spike more for premieres/finales, but they are not nearly as egregious as the unscripted exceptions. Shows like The Voice and The Bachelor have clear show-specific trajectories. We have seen audition-based shows like The Voice and American Idol get much weaker in the later part of the season, then come back very strong at the beginning of the next season, while the Bachelor franchise is just the opposite. The late-season results are clearly not a weakening/strengthening of the franchise as a whole, but something inherent to the content in those different parts of the season. Scripted shows don't have that issue. So projections with these unscripted shows work better by just looking at their own individual histories, and judging their strength should be done with the full season rather than just the later episodes. Those things can be incorporated, and might be in the future, but I just couldn't get to it in this earliest edition.

In this first run, I don't do any subsetting beyond the new/returning separation; in other words, all new shows with the same episode order are expected to drop the same amount post-premiere. This can probably be refined, but the sample sizes are not that big within the subsets, and the relationships tested didn't bear a lot of compelling fruit. I was a bit surprised to see that comedies and dramas have almost the exact same trajectory. Though I would tend to think bigger premieres drop harder, the evidence there was not that compelling either. (Might be because the biggest premiere (Empire) also had the best post-premiere trend!) There did seem to be some connection with network; ABC and NBC shows seem to drop more than CBS and Fox ones, but I'm holding off on that for now. I'm also interested in looking at whether things like timeslot difficulty and critical acclaim have any effect on the trajectory. Maybe next year.

This Point Forward

I want to clarify what I mean by projecting the ratings from "this point forward." For "this point," I am not usually talking about literally this one data point. The True "rolling average" is calculated the same way it has been done for years in the True Power Rankings: an average of the most recent one-third of episodes this season, rounded up. That means in weeks one, two, and three of the season, it is based on just one data point, but beyond that it begins aggregating multiple points. I think this does a lot more good than harm; it immediately discards very early season episodes, where there is a ton of movement. And it keeps the model from overreacting to noisy week-to-week fluctuations once we have a lot of data.

As for "forward": what we are projecting is not the full season average, but the "rolling average" for the end of the season. In another words, it's the True average for the last one-third of the full-season episode order. I have to stay somewhat on top of episode orders, but most new shows with a back-nine option are just given an order of 22 to begin with, because they'll likely need to get an extension to get renewed, and the episodes 9-13 projection is not that different from the episodes 15-22 projection anyway. (The size of the order would make a much bigger difference if we were actually projecting the full season average, since it would matter a lot how much to weight the inflated early episodes.)

Simulating the Future Drop

It is also important to note that the formula does not simply find the single projected average and run that one number into the model in Step Four. Instead, it uses the mean and standard deviation of these past cases to generate 20,000 random normal simulations of the drop from "this point forward." It runs all of those simulations through Steps Three and Four individually, and the final projection is the average of all 20,000 resulting probabilities.

Why bother with the simulations? Why wait to average those points until Step Four? Basically it's because all ratings points are not created equal. If you look at the illustration in the Step Four section, you'll see the logistic regression model looks like a steep cliff. If the opening projection is on the top edge of the cliff, that'd be a very healthy place to be... if correct. But if you end up a couple tenths below that projection, your probability starts dropping fast. If you overachieve it, there's not really that much to be gained because you're already at the top of the cliff. So the simulated average ends up being a lot less certain, because it accounts for the much lower probability in the bad scenarios. While we are predicting future ratings, we're trying to account for the uncertainty of those predictions as well.

As more results come in, the uncertainty will shrink, and it'll become safer to reside on that top edge of the cliff. Even if the projected averages are exactly the same, the model will be more decisive about the outcome later in the order.

Step Three: Future True Ratings -> Network-Adjusted Future True Ratings

Now we get into the steps that are not particularly original in the renew/cancel industry, but I hope I am bringing a slightly more sophisticated take to the table. The first one is to adjust the ratings for the network's ratings. When vying for a spot on next season's schedule, you're fighting with other shows on your network. So the ratings of those shows is a major factor.

The other guys' baseline for this is the average of the network's entire scripted department. That is a fine approach, but I tried a few different things and found two that I thought added a bit more value. So I use a blend of those two.

The first one is the average of the network's entire original department, including unscripted series. Why would this be better than the scripted-only average? It's because it gives at least a slightly better sense about how many shows the network will renew. When you adjust for the scripted-only average, you're essentially saying that each network's scripted department is equal, and each network will pretty much renew at the same levels.

This can be a problematic assumption when a network has very strong unscripted programming and weak scripted, like NBC for most of the time since the rise of The Voice. In the last couple years (and probably before that), NBC has renewed at least a couple fewer shows than a scripted-only approach says they "should," because a large chunk of the schedule is already dedicated to reliable unscripted shows. Weighing in the network's considerable unscripted strength can help account for some of that.

The old Cancellation Bear analogy says that it's not about outrunning the bear, it's about outrunning the other guys. I would add that it's about outrunning the other guys until the bear has gotten full. I think this adds at least some acknowledgement of how far up the totem pole the bear will dine.

If the first average is more broad than the traditional approach, the second average is more specific: the network average of only its category, with "category" meaning comedy vs. drama on the big four. When a network has a significant imbalance between comedies and dramas, it is typically kinder to the lesser department and harsher to the stronger department. I have always separated the True Power Rankings into separate sections by network and category, so I like to think this is not a new development even at this site. But it's a definitely a more formal version.

This helps to reduce the odds with shows like Angel from Hell and The McCarthys on CBS, which actually had pretty good numbers compared with the entire network, but looked a lot less favorable compared with the rest of the network's stout comedy department. And it also boosts the odds somewhat on something like NBC's The Carmichael Show, which looked like a big reach compared to the entire network but merely a small reach when looking at the other dire comedies.

The category average is different on the CW; rather than comedy vs. drama, it is distribution studio (Warner Bros. vs. CBS). This is probably the closest thing to a "non-ratings" factor in the formula, but it's still derived by looking at the ratings average from each studio. The network has had a blatant renewal bias toward CBS Studios shows in recent years, so those shows get a big bonus in the formula from the weakness of the studio average. This is one of the formula's more significant adjustments, and we could end up looking foolish if the network suddenly starts treating CBS shows on a level playing field. But I think there's a solid pile of recent evidence to justify its inclusion, from Beauty and the Beast to Reign to Crazy Ex-Girlfriend. We may add a studio component for the big four networks in the future, but it seems to be a lot less blatant on those networks.

Finally, each average gets a minor adjustment that I call the "network generosity" factor. This is a slight tweaking of the average specific to each network based on its historic renewal tendencies. It's basically designed to get the projected total number of renewals to approximately in line with what each individual network has done historically. I took kind of an unscientific approach to this for now, but basically it correlates with how many short-order/midseason/summer renewals a network tends to make. Without this, the formula would vastly underestimate the renewal behavior of the CW, which put three returning dramas on the regular season bench and sent another one to summer. And it would overestimate the CBS renewals, since they have a modest bench and such a high volume of shows that are intended to air for the full season. The sample size is worrisome here, and I could get pretty screwed by this if, say, the CW suddenly scales way back on renewals, but I take some solace that the numbers were quite consistent from 2014-15 to 2015-16 within each network. Still, due to those concerns these are pretty small adjustments; if the networks play exactly to form again in 2016-17, they won't go quite far enough.

The final network average is a weighted blend of the two averages described above: 60% entire original department, 40% category average, plus the "network generosity" adjustment.

Future Network Averages

The model was initially built with full seasons, where we knew what the final target averages would be at the end. In practice, as the season is ongoing, we won't have that luxury. So like with individual shows, we're also projecting where each network's final True averages will be at the season's end. The broad network averages are projected kinda similarly to the individual show averages: applying historical data about how the average develops over the course of the season. This can help account for a network's future weak points in the season (like a hiatus for The Voice). Admittedly, it's another place where the sample size is very small (just two years of averages for each network). But the general smoothing effect of the True formula means that these adjustments are usually not terribly far from where the season-to-date averages are.

And the category projections, at least in the early weeks, use data from the end of past seasons about how category averages relate to full network averages. Generally, these categories are quite volatile early in the season, particularly with networks that have uneven rollouts (like CBS due to Thursday football). So for at least the first half of the season we put much more weight on the end-of-season ratios from past seasons, rather than actual data. It shifts exclusively to the actual data from the current season after week 20, which is about when these ratios became reliably pretty close to the end-of-season numbers.

These projected network averages are calculated just once per week, at the end (after Sunday numbers are in), and then used for all points over the next week (the next Monday to Sunday). The idea here is that each point within a week has a consistent baseline, that should be solidified with Saturday and Sunday finals before it's ever actually put to use. There may be some weeks when some preliminary numbers go into these averages, especially when there are Nielsen delays due to holidays and stuff, and they get tweaked to a tiny degree in the period before Sunday finals come in. But hopefully it won't be too noticeable.

By the way, I had to come up with some very rough preseason baselines to use during week one (and prior to week one). So they assume each network will drop 5% from its final True average the previous season. This is just a bit less than the league average True decline from 2014-15 to 2015-16. In order to account for the lack of American Idol on Fox, and the detrimental effect that may have, I excluded it from the 2015-16 average used in Fox's projections. These may end up being really bad but fortunately they'll only be used as placeholders for one week.

Network Averages on Display

I want to note something about the "target" averages, either in the SpotVault or through other Renewology articles. Please keep in mind that these are not the actual network averages. Instead, they're what this formula perceives to be the "bubble": the point where renewal is exactly 50%. If you refer to the illustration below, you can see that a 0.5 renewal probability equates to almost exactly 75% of the network average. So you can multiply any of these "targets" by 1.33333333 and get a very close approximation of the actual network average being used. I also plan on introducing a Climate Center page that lists the components of the network averages used in Renewology.

Step Four: Network-Adjusted Future True Ratings -> Renewal Percentages

The big idea behind this step isn't very original either: take those relative numbers and convert them to percent chance that the show will be renewed. At other sites, these are usually communicated in broad categories, ranging from "certain to be cancelled" (one smiley face) to "certain to be renewed" (five smiley faces). This model will produce actual percentage numbers, which is cool, but I still wouldn't put a ton of weight into small percentage fluctuations. I still think it's helpful to break them into those same kinds of broader categories.

With the other guys, this conversion process is pretty subjective, and can be colored by "rules" based on scheduling tendencies from outside of the ratings realm. In Renewology, for better or worse, it is very objective. These percentages come straight from a logistic regression model, a useful way of approaching predictions with only two possible outcomes (like renew vs. cancel). Trying many of these is how I landed on the criteria described above: late-season, network-adjusted True ratings. This creates an equation in which you can plug in any network-adjusted True rating and come up with an actual percentage number. The curve looks like this:

A very important thing to note is that this logistic regression model was created using only new shows, though we will be using it to put out forecasts for all scripted series. Why only new shows? Because new shows tend to be more of a true ratings meritocracy. There are a handful of exceptions, but in general, new series haven't been on the air long enough to become bogged down by exorbitant costs, and they also usually don't have huge back-end windfalls secured. For the most part, new shows have to earn their way to season two with ratings, which is the kind of thing we're trying to model here.

In practice, the choice to use only new shows helps create a slightly more decisive model, because it removes most of the shows for which ratings are basically irrelevant. The all-shows model introduces much more uncertainty: both with very low-rated series, due to the renewals of shows like The Good Wife, Madam Secretary and Elementary, and with high-rated series. Most of the big misses on that side are with pre-announced final seasons that ended the season with slam-dunk renewal ratings like Two and a Half Men and Parenthood. Basically, I decided I didn't want those decisions bleeding into other forecasts, because they're separate from the ratings realm and don't really have an impact on the probability of some random new show getting renewed. It does mean that the actual forecasts for the Madam Secretary's of the world will be way, way too pessimistic. And the fact that we give any chance to announced final seasons is obviously incorrect. But those problems would've been the case even if they were included in the logistic regression, just not quite as bad. I'd rather miss bigger on those particular shows than have them bleed into the other shows. I may end up having to devise some way of indicating where I think you should ignore the projections, though that can be a slippery slope.

That said, while the curve is kinda steep, there's definitely still some uncertainty in this thing. When you see a show you think is completely dead given a 10-20% chance, I would not totally discount that. Those percentages come from actual reach renewals like the first seasons of Galavant, American Crime and Scream Queens. We had Supergirl at 90%ish to return to CBS. They would've been big "misses" in this formula, and I really don't have a problem with that because I was surprised by them when they happened. That's what the percentages are for. Maybe everything plays to form this year, and we end up looking really under-confident. But I'm not holding my breath. I'm a lot more worried about there being more reach decisions.

 What Is Not In Renewology?

So that's basically the process in a nutshell. I guess you could probably infer this from reading all of that, but here are some things that this hard numbers equation doesn't have a way of addressing (but may in the future):

1. Costs/syndication/streaming windfalls. It has long been conventional wisdom that networks will completely ignore first-run ratings in certain situations, in the pursuit of getting their shows into syndication. The wording behind this "rule" has changed a lot over the years. I think there's probably a version of it that would have a lot of value , but for now I am admitting I'm not smart enough and/or didn't have enough time to find it.

2. Media buzz. Renewology doesn't adjust for anything that anybody else on the Internet says about a show's fate. In a lot of cases with connected insiders, it'd help to weigh that, but not always. Even if the buzz seems to be pointing one way, seeing the state of a show's ratings might help show when the buzz could be wrong.

3. Actual outcomes. We don't have any adjustment based on a show getting an episode extension or an order-trimming. I'm not really worried about this in most cases; if a show actually gets its order trimmed, hopefully we'll have been rather bearish about it already. And while an episode extension can't hurt, plenty of poor-rated shows have gotten extensions but not second seasons.

Another aspect that might be a bit more worrisome is that the model does not react to decisions on other shows. In other words, if a network early-renews 15 shows, that probably should reduce the odds of the ones that got left behind. But it won't do that here, at least not for now.

4. DVRs and Sub-Demos. I have really ramped up my collection of multi-day DVR and additional demographic data in the last year, but these numbers are not a part of Renewology at this point. Looking at recent ad rates surveys, there's some evidence that it's better to be a heavily-DVRed show rather than a lightly-DVRed show, Live+SD ratings being equal. And it's also quite possible that, much like how we reward comedies on a comedy-weak network, certain shows should benefit from reaching demographic audiences that the network often struggles to reach.

#4 is omitted not because of ability, but because of availability. They often don't come out till days or even weeks later, and I'm really interested in the Renewology numbers coming out in real-time and not continually shifting from what is initially reported. I also think the True formula helps account for some of what these would add... shows in bad timeslots are often the ones that get big DVR boosts, and True rewards them for that. And some of the demographic thing might be captured in part by looking at skew compatibility with a show's lead-in and lead-out. But these are still factors I anticipate talking about some in the more subjective "roundup" columns.

Why Are These Factors Not Included?

I'm leaving them out because I don't (yet) have a good way of quickly quantifying them and, frankly, I'm not sold that anyone else does either. Over the years, I've been a skeptic of a lot of the non-ratings factors, and I think that skepticism has proven fairly warranted. The conventional wisdom says that first-run ratings are becoming less a part of the picture, and there's truth in that. But most of the biggest "mistakes" from the other guys in recent years have actually come from not trusting the ratings enough. Shows like The Millers and The Mindy Project had marginal or worse ratings but were considered "certain renewals" due to perceived syndication factors. Shows like Code Black and iZombie, with marginal or better ratings, were completely dismissed because of some historical scheduling tendency surrounding partial episode extensions.

I'm not saying there isn't value in some of these things. I would quibble about how big of a reach some of the renewals used to create the syndication rule really were. But enough marginal shows have survived that there does seem to be something there. Still, it was a big mistake to assume these rules were 0% or 100% propositions just because they had been that way in the past, and it has led to a bevy of misses on "certain" predictions when the ratings were telling a very different story. When you're missing multiple "certain" calls per year and amending the rules each time, claiming they have that level of authority starts to look kinda irresponsible. For now, I'm just admitting that I don't know.

So for better or worse, Renewology is a measurement of ratings merit. Instead of hiding the actual ratings picture behind subjective rules, we're just showing you what the ratings picture is. If you're looking for a technical definition of Renewology, or if I'm looking to cover my rear end, it might be better to say: "What percentage of shows at this ratings level get renewed?" rather than: "What are this specific show's chances of being renewed?"

What's Ahead

Much like the True formula*, the Renewology model isn't going to be changed as the season is ongoing, with the exception of blatantly obvious bugs. It was mostly built using full seasons, but there are a lot of additional complications involved in making it work with incomplete information. So I may have to do some emergency fix if things are clearly getting thrown egregiously off. However, the actual assumptions are locked in as of the first tables posting on Tuesday and do not change, period.

*- There is one exception with the True formula: I've usually given myself a week or two at the start of the season to see some actual results, just to verify that True is on pace to finish with the same league average as raw numbers. In past years, that has meant withholding the formula entirely in the early weeks. This year, that ain't happening, but I will still do that check after a week or two are in the books. So there may be a minor tweak to the True numbers you see at the beginning of the season. However, it will be a global adjustment, and every show will be increased or reduced equally, so it will not have a significant effect on the R%.

It will stay as is mostly for integrity reasons; it loses its credibility if I'm re-training it and wiping away past predictions while it's in action. But I am also really looking forward to getting a break from working on the formula itself (as well as these methodological write-ups)! If it becomes a major hassle having to explain away the exceptions of the world, I'm not ruling out some kind of enhancement to the formula that somehow addresses the shows with huge non-ratings factors. But that will be displayed alongside the initial numbers, not as a replacement.

Though the model doesn't change, there may be some changes in how I illustrate what it spits out. The biggie on display is R%. That's not going away. But the objective nature of this allows for some pretty cool alternative stuff... the "Collapse" and "Resilient" scenarios on the SpotVault page are only scratching the surface. Here are a few others:

R% Without Ratings Uncertainty. The R% is a blend of the future ratings uncertainty (Step Two) and the network decision-making uncertainty (Step Four). Obviously, I think that's the best way to do it. But it's also possible to look at these numbers as if there were no ratings uncertainty. In other words, how likely is the network to renew if the current projection of future ratings happens to be exactly right? It can be an interesting snapshot.

Other Scenarios Without Ratings Uncertainty. Perhaps even more interestingly, I could translate those "Collapse" and "Resilient" numbers into percentages using the same method. A way of communicating this in practice (using the actual opening week of Quantico), "We open Quantico at 84% to be renewed. If the ratings projection turns out to be exactly right, it's 97%. But if it has a major collapse, it's only 30%."

Looking at Just This Week. The R% aggregates multiple recent points, and the formula's definition of "current ratings" is the average of those points. But it can also display the R% if "current ratings" is defined by just the most recent week. As I said above, I usually prefer multiple points once we get beyond the very first few weeks of the season, because a lot of fluctuations are just noise. But it can be fun (or terrifying, I guess) to look at what a show's one-week situation would be if it has a big stinker or spike. This aspect might actually join the formula down the line; many shows have become completely different animals when they come back from lengthy hiatuses, so those may be cases where the super-recent results should be weighted more heavily.

At the moment, I don't have an actual plan any for these last few things. A lot of it might get sussed out through the Renewology "roundup" columns. I'm just kinda teasing you with some of the possibilities, just in case what I've described in the last 10,000 words of intros is not enough!!!

No comments:

Post a Comment

Newer Post Older Post Home
© 2009-2016. All Rights Reserved.