How about this:
Two people give the odds for the result of a coin flip of non-weighted coins.
Person A: Heads = 50%, Tails = 50%
Person B: Heads = 75%, Tails = 25%
The result of the coin flip ends up being Heads. Which person had the more accurate model? Did Person A get something wrong?
Person B’s predicted outcome was closer to the truth.
Perhaps person A’s prediction would improve if multiple trials were allowed. Perhaps their underlying assumptions are wrong (ie the coins are not unweighted).
Perhaps person A’s prediction would improve
But in this hypothetical scenario of explicitly unweighted coins, Person A was entirely correct in the odds they gave. There’s nothing to improve.
We are talking about testing a model in the real world. When you evaluate a model, you also evaluate the assumptions made by the model.
Let’s consider a similar example. You are at a carnival. You hand a coin to a carny. He offers to pay you $100 if he flips heads. If he flips tails then you owe him $1.
You: The coin I gave him was unweighted so the odds are 50-50. This bet will pay off.
Your spouse: He’s a carny. You’re going to lose every time.
The coin is flipped, and it’s tails. Who had the better prediction?
You maintain you had the better prediction because you know you gave him an unweighted coin. So you hand him a dollar to repeat the trial. You end up losing $50 without winning once.
You finally reconsider your assumptions. Perhaps the carny switched the coin. Perhaps the carny knows how to control the coin in the air. If it turns out that your assumptions were violated, then your spouse’s original prediction was better than yours: you’re going to lose every time.
Likewise, in order to evaluate Silver’s model we need to consider the possibility that his model’s many assumptions may contain flaws. Especially if his prediction, like yours in this example, differs sharply from real-world outcomes. If the assumptions are flawed, then the prediction could well be flawed too.