return to indexKit Woolsey
Variance Reduction
I read with great interest David Montgomery�s article on Variance reduction in the February 2000 issue. I am one of those who sometimes uses "1995 methods" because I do not fully trust variance reduction. Despite David�s article (and a private conversation we had last year, during which he gave me a private tutorial) I still have questions. Perhaps David has the answers?
A: Every version of JellyFish and Snowie that I have used has given only the variance reduced results.
B: I think it would be better to make both results available. Having multiple results for a single rollout makes both the implementation and presentation a bit more complex, but I believe it�s nevertheless worthwhile.
C: The bots don�t "realize" anything, but they do report the standard deviation or confidence interval of the rollout, which allows the user to infer what is going on.
D: The report of the equivalent games is, in effect, a report on the quality of luck estimates. When each game is worth many, then good evaluations were used.
E: It is unlikely that you will find something this severe, but you can get results where the equivalent games are fewer than the actual games. With JellyFish version 2.0, a 72-game interactive (that is, manual but variance reduced) rollout of this position
| 30 270 | White money game Blue |
had a higher standard deviation than you would get using the actual game results. JellyFish 2.0 truly had no clue about this kind of position. I no longer have the exact statistics, but the increase in the standard deviation was modest. I have never seen this kind of result in a "normal" position. If a program plays a position well enough that you actually care about its rollout results, you won't see this kind of problem.
F: Assuming the programmers are honest and competent, the equivalent games indicated will reflect the actual statistical reliability of a rollout. This is a judgment that applies to everything about a bot. My personal belief is that the developers of Snowie and JellyFish are all honest and highly competent individuals, but others have reached different conclusions, and bugs have been found in every version of JellyFish and Snowie.
G
H: I suspect you�ll split (or slot) with confidence regardless of the rollouts and what I say here, but let�s try to untangle the many issues involved in your question.
First, there is the general issue of what we should make of rollout results which are close. The closer the decision, the less reliant we should be on a computer rollout for deciding what to do. Many factors bear on which play will be correct in a particular situation, almost none of which are reflected in a computer rollout. Fortunately, when plays are very close, it doesn�t matter a lot which play we choose.
Elaborating this idea, it�s useful to keep in mind the question that computer rollouts answer. They do not tell us the correct decision. What they tell us is which decision is best assuming both players from then on play exactly as the bot does. So JellyFish level 6 rollouts tell us how to play when we are playing cubeless money backgammon against JellyFish level 6.
This is not just an academic matter. Different plays do better against different opponents. Against Expert Backgammon, the correct 63 opening was 24/21 13/7. In Nackgammon, the correct 41 opening against JellyFish level 6 is 24/20 23/22. But this isn�t the correct 41 opening against Snowie nor probably against JellyFish level 7. So the results of close, cubeless money JellyFish level 6 rollouts shouldn�t have too much effect on how you play an opening 21.
Often the effect of a match score is more important than considering who is playing (assuming two strong players). The question you posed was how to play an opening 21 at DMP, but answering that question based on rollouts played as though gammons and backgammons counted is very suspect. The standard deviations and equivalent games are also calculated assuming cubeless money play, so they too are not directly applicable to your question.
Now let�s turn to the data itself. First, you have mixed different experiments, so its not surprising that the results might differ. Chuck Bower�s results were not based on rollouts of an opening 21, but on rollouts of all the responses to an opening 21. If level 6 ever plays a second Roll different from the play that rolled out best, the two experiments are different. I don�t have Chuck�s results, but in from my own I see that for a 43 response after 21 slotting 24/20* 24/21 did best at DMP, although level 6 plays 24/20* 13/10. I found several similar differences after 21 splitting.
From your letter is appears that JellyFish has a bug in displaying the number of "equivalent games" for small standard deviations. A 12960 game level 6 rollout of an opening position is certainly equivalent to far more than 32000 games. I recommend ignoring the equivalent games in both JellyFish and Snowie and concentrating on the standard deviation or confidence interval, which relates much more directly to what you need to know.
For large samples you can combine rollouts by simply weighting each according to the number of trials. Combining your data for slotting you have (864 x 49.5 + 1800 x 49.9 + 12960 x 50.0)/(864+1800+12960) = 49.96%. For splitting you get 49.85%. Given that the rollout data is only displayed to one decimal place, you clearly can�t have much confidence in this distinction.
To get the combined standard deviation is trickier and I won�t go into it here. The important principle is that if you increase your sample size by a factor of F, you only get sqrt(F) reduction in your standard deviation. To cut the standard deviation in half, you have to quadruple your sample size.
You wrote:
Slotting showed an ev of -.014, with an std of .011, while splitting had an ev of +.007, and an std of .010. Is that not exactly 2 std?
The standard deviation expresses uncertainty about an equity by itself, but this isn�t the right value for comparisons between plays. Here, slotting should be thought of as about -.014 � .022 (that is, its very likely somewhere between -.036 and +.008) and splitting as +.007 � .020 (somewhere between -.013 and +.027). The point is that there is uncertainty in both equities.
When standard deviations for two plays are similar, as they are here and as they usually are when you do rollouts with the same number of games, you can think of the standard deviation of the difference between the two plays as about 1.4 time the average of the two standard deviations. Here the difference between the two plays is +.007 � �.014 = .021. The approximate standard deviation of this difference is 1.4 x (.011 + .010)/2 = .015. So the difference here is about 1.4 standard deviations, not two.
Let me summarize the points here:
I have, obviously, opinions about these questions, but I do not know the answers, and look forward to Monty�s followup.
- Jake Jacobs
I: This can�t be answered simply either. Let�s try to look at the issues one by one.
Sampling (e.g., 20%).
Sampling is unlikely to make a big difference. The evaluations that you get with 20% and 100% tend to be very close both in absolute equity values and in play selections. 20% plays a little bit worse than 100%, but not much. When 20% picks a worse play, it is almost always a play that 100% thinks is a decent choice.
There is an interaction effect between sampling and cubeful rollouts, because the cube turns and cubeful equities rely on the absolute (as opposed to relative) values of the evaluations. But I believe that sampled evaluations are almost always at roughly the same levels as the 100% evaluations, so there is no significant problems specific to cubeful rollouts. This is in marked contrast to comparing 1-ply with 3-ply cubeful rollouts. Between 1 and 3-ply you often have evaluations differing by .2 or more, and whether you do your cube evaluations 1-ply or 3-ply will make a very big difference.
By the way, JellyFish level 7 also uses sampling; you just don�t have any option to adjust it or turn it off.
Search space (e.g., tiny).
For certain kinds of positions the search space will make a big difference. For most of them, it does not, and in general you can use the tiny search space with confidence. But you shouldn�t think of the smaller search spaces as restricting you to the 1-ply choice. The 1-ply choices are screened according to your selected criteria, evaluated at 2-ply, screened again, evaluated at 3-ply, and then the best play according to the 3-ply evaluation is made. With a perfect search space you screen out all the stupid plays but none of the best plays. In practice, you get pretty close to this with the tiny or small search spaces for most positions.
Jellyfish, too, has a search space. Any bot that plays with 3-ply in real-time must. In some positions you can have thousands of legal moves, and it simply isn�t worthwhile to evaluate them all at 3-ply.
An important point is that it's not so important that a rollout pick the best play each turn. What is important is that it never pick really bad plays. Using the smaller search spaces Snowie will occasionally miss the play it would have thought best with a huge space, but only rarely will the play it selects be a bad play. Some equity is probably given up on that turn, but equity is given up lots of times even when Snowie plays at 100% huge.
There is very little interaction between the search space and cubeful rollouts, because search space affects play selection. If you change the search space the bot will sometimes play differently�whether you are doing a cubeful or cubeless rollout. Disregarding the changes in moves made, the changes in cube actions will be very rare and insignificant.
I tend to use 20% sampling with tiny and small search spaces in my rollouts, and 100%-huge when doing an analysis.
Comparing 1-ply Cubeful to 3-ply Cubeful
So if you do a 20% tiny cubeful rollout, is it the same as a 1-ply cubeful rollout? No, not at all.
The basis for cubeful rollouts is still the cubeless evaluation. If you do a cubeless evaluation on 1-ply, then 2, and then 3, regardless of search space and sampling parameters, you will often see big changes. A position can go from not good enough to a drop, or conversely from too good to no double. If you use 3-ply to do your cubeful rollout, you get the benefit of the better evaluations in making these cube decisions.
Does this mean that the 3-ply result will be more "accurate"? Let�s go back to what a rollout tells us: it tells us the equity assuming both players play exactly as the bot does with the settings we�ve specified. Every rollout is perfectly accurate for this question. As for perfect play, or in the Finals tomorrow against a strong player�well, generally 3-ply plays better than 1-ply, so most of the time its rollouts should be closer to the theoretical truth. But there are no guarantees.
In my article I assumed a correct implementation of variance reduction. And everything else, for that matter. Once you assume something might be wrong, anything can happen.
There is cause for caution with Snowie�s rollouts. Snowie 3 is an complicated program with lots of changes relative to version 2. I�ve seen many bugs in the released version. Chuck Bower posted a nice position on the GammOnLine bulletin board showing what certainly looks like a bug in the variance reduction algorithm. A position rolled out 1-ply by Chris Yep without truncation or variance reduction reports an equivalent games greater than the actual games rolled out, which makes no sense.
For the most part Oasya hasn�t responded to bug reports in public forums. I�m sympathetic to them because as far as I know its just Olivier and André and they have an awful lot of work preparing the next version for us. But with so much complicated stuff going on, and with so little of it adequately documented, it makes sense to scrutinize Snowie rollouts carefully.
Thanks for your questions, Jake. I hope my answers are of some use.
-David Montgomery
Two other players contacted me with questions.
From Rob Maier I learned that it may seem that I assumed that theevaluation of a position before the roll is equal to theaverage of the evaluations after the roll.
But this isn't assumed--it's a direct result of the waythe before-roll evaluation is calculated. The before-rollevaluation is the average of the continuations.It may be useful to think of the before-roll evaluationas a 2-ply or level 6 evaluation, while the after-rollevaluations are 1-ply or level 5.
From Jeremy Bagai I learned that it may seem that I wasimplying you get exactly the same equity whether you usevariance reduction or not. You get the same equity onaverage; or equivalently, the same equity if you roll a position out both ways forever. For equivalentsample sizes, the distribution of rollout results is the same.But because a rollout is a random process, two rolloutsare quite unlikely to give the exact same equity. This istrue whether or not variance reduction is used.
Thanks to Rob and Jeremy for pointing out these issues forclarification.