Probability and Prediction Markets – What matters and what doesn’t

August 14th, 2007

Probability is a dangerous topic in discussing prediction markets.  Many people don’t really understand probability and thus see some of the prediction market “misses” as “failures” instead.  Some of these have been well-publicised, such as the InTrade market for the Senate in 2006, and widely criticised.  But there are two issues that we need to dis-entangle here:

  1. The only way to evaluate accuracy of predictions is with a sufficient group or series of predictions.
  2. Most prediction market “failures” are one-off events.

Chris Masse at MidasOracle is trying to say that the Karl Rove resignation market at NewsFutures wasn’t “predictive” because it was trading at ~20% when the announcement was made.  This is horse-(manure).  Based on this argument, at what point does a market become “predictive”?  Only when it trades above 50%?  Or perhaps when it trades above 80%, or some other figure?

There is no way to evaluate accuracy of a single binary prediction because it either happens or it doesn’t.  Evaluation requires lots and lots of predictions; only then you can start determining accuracy.  If all the events that were judged to have a 20% chance of occurring actually occur 20% of the time, then the market is calibrated and accurate.  (If five people were all judged to have a 20% chance of resigning and one of them actually did, the judgments would be calibrated.)

This runs into the second issue above: most “failures,” to include the 2006 Senate market, the Pope prediction market, Olympics choice market, and more are one-off events.  They are generally “misses” or “failures” because they a) occur so infrequently that determining accuracy as described above isn’t possible and b) because each time they’re run traders have to learn all over again what signals and information are important.  I believe that all of these markets could be accurate if they were run frequently enough that traders could simply learn from their mistakes, which is a key feedback in a normal market.  When a Pope is elected every 10-30 years, it’s very difficult to re-learn papal politics to trade effectively, and even harder to determine if it’s effective because it won’t occur again for another 10-30 years.  Since that won’t happen, prediction markets are still a good way of aggregating opinions about any of these events happening.

Even election markets fall prey to this phenomenon, though the attention paid to elections means they are traditionally fairly accurate, though I don’t remember seeing any paper on the 2006 elections specifically.  The issue here sometimes becomes one of timescale.  In politics there is a big risk of a candidate doing or saying something stupid pretty much up to the last minute, so traders factor that into their prices, leaving a favourite at 80%, when perhaps they should be at 95%.  That gap gets made up only once they really can’t say anything stupid anymore, such as the day of the election itself.

In summary, recognise that a single prediction is just that: the traders’ aggregated opinion of the likelihood of that event occurring.  Once enough of these judgements are put together, then the accuracy can be determined, and only then.  Remember that an event that only has a 1% chance of happening will still actually take place one in a hundred chances!

——————-

I’d like to also point out David Pennock’s blog post here for more detail.  Calibration, the test described above, is a good test, but not the only test of accuracy.  There are more statistical tests that should be run (again, only with a sufficient number of predictions), but even these are only useful when comparing two prediction methods against each other.

  • oyvind
    Yeah, but you can use a scoring rule, like Good's scoring rule. ln(r / 0.5) if the WTA turns out to happen, ln(1 - r / 0.5) if not. A prediction of, say 0.20 is bad according to the scoring rule.

    Of course, maybe the PM will get a good score in the long run. But if we are speaking single events, then Chris Masse has a point. If another forecasting tool turns out to get a better score, and we only have access to one prediction, then the other forecasting tool is the best.
  • Hello, Oyvind, and thanks for commenting.

    Scoring rules still are really only appropriate for repeated decisions, and using them on a one-off forecast is still problematic to me.

    I believe that there is simply no way of assessing the quality of a single (binary) forecast.
  • oyvind
    Say that you are going to evaluate 100 forecasting tools, and you only have access to one event. (for example probability of nuclear attack or of finding space aliens). The 100 different scores will have a correlation with the average scores when predictions are repeated. A tool with larger score on predicting a single event, will also have a higher probability of having a large score in the long run.

    So in one way, it makes sense to evaluate single events. Evaluating a tool based on this is flawed, but it is not impossible.
blog comments powered by Disqus
Clicky Web Analytics