Predictors of Success
I recently ran an idea by a professional sports mathematician, and ended up with some interesting ideas. Here is the email chain slightly edited. Its not a doctoral thesis, just a ‘back-of-a-napkin discussion on stats.
At Home on the Court mentioned in a recent post about the statistical significance of the sample size in volleyball stats - and made the point that the sample sizes are not big enough in a match to really mean anything from a predictive point of view.
I suspect there are both simple and complicated ways to determine if the sample size is appropriate to be predictive or not (as opposed to simply a mathematical representation of what happened).
So - what's the quick answer to this?
The short version is “it depends”. There’s not really any such things as a “big enough” sample size, only sample sizes that give enough “power”.
The more accurate you want to be, the more samples you need. I would say that 20-30 games is about what you’d need in most instances though. Also, if you want to say X = 70 then you need a lot more data than if you just want to say X > Y. For things like single stats, saying Team A is better than Team B, you can get a big enough sample from just a couple of games. For Kill Percentage, you’re looking at potentially 100s of spikes from within a single game, so that’s not really just one sample. The only issue is that you kind of have to assume that everyone’s opposition was close to equal.
Longer, more rambling, less comprehensible version below:
If you’re looking at predictability of a percentage (eg. Win this stat and you win X% of games), then the accuracy of your prediction is about the square root of P x (1 – P) / N. Where P = Observed probability and N = Sample size. So if seven of 10 teams won after having a better kill efficiency, then the estimate of winning probability for that stat is 7/10 = 70%. The error on that estimate is sqrt(70% x 30% / 10) = sqrt(0.021) = 14%. Through bad application of statistics we can, at best (with 95% accuracy), say that the true winning probability is within two standard deviations of the observed number, which is this case gives us the range of 42% to 98%. Increase this to 70 of 100 teams and the estimate doesn’t change, but the error drops from 14% to 5% (square root of 70% x 30% / 100), which means we can say that the true number is between 61% and 79%. You’d need about 8000 games to be able to accurately say that the true number is between 69% and 71%, but that’s not really what we’re interested in. In most cases you just want to know that you’re a better chance than random luck if you win this stat. To do that, you just need the lower bound of the estimated range to be above 50% and that happens with 21 or more games at 70% win rate. If it’s only 60% win rate you’d need about 100 games, but if it’s an 80% win rate you’d only need seven games.
Predictability in itself is an interesting concept. I’ve never seen a single computer prediction model (even those using “big data”) that can predict league-wide results at better than around 72% over the long term. I’ve had a few models go above 70% for a couple of years, only to drop to below 65% the following year with no changes to the algorithm. I’m actually planning on writing something soon. I haven’t done much work, except for some very early stuff looking at Player Rating points, and how many games back through history you need to look at to optimise the prediction of a player’s next game. It maxes out at about 20 games based absolute error in the prediction, error weighted on standard deviations and correlation so that looks to be a good number.
I think it means I'm a geek that I thought it was both fascinating and kinda followed it too. So what you're saying is that you need a lot of games to determine whether a certain kill percentage is a predictor of success (ie: reaching that level means you'll win), but even when you get there, it will only mean you have a higher chance of success than if you didn't reach it. Which isn't a whole lot different from saying that the better you hit the more chance there is of winning (ok - its a fair bit different, but still in the ballpark).
I'm reminded of a stat I heard about a while ago: There is a very very strong correlation between scoring first and not losing in the EPL. Which is not that surprising but interesting nonetheless. The interesting part is that the only teams who really broke this mould were the teams who won the league.
Yep, pretty much on the money. Your final point about EPL is right too. Great teams win because they’re really good at lots of things, so they’re less reliant on any one thing. You can see it in the AFL too. 70% of teams who win contested possessions win the game. Hawthorn has won 67% of games since 2012 when LOSING contested possessions. Melbourne has won 39% of games since 2012 when WINNING contested possessions.
Thanks for taking the time KJ