return to index



Quantifying Backgammon Skill

by Chuck Bower
I. RECENT HISTORY: TEN YEARS OF NEURAL NET BACKGAMMON.

It's interesting to follow the evolution of respect (or lack thereof) that computer players have received over the past 10 years. In 1990, ExpertBackgammon for the PC (EXBG) was the only commercially available game with any reasonable backgammon skill at all, and it was widely branded "intermediate". Then in 1991 Tesauro started a revolution with TD-Gammon, the first neural net player. Its performance was met with skepticism, but it was immediately seen that TD-G was a vast improvement, and likely an indication of future strengths of computer players (since tagged 'robots' or 'bots', for short).

Initially the opportunity to play against TD-Gammon was the priveledge ofa chosen few. Unfortunately TD-Gammon didn't become available to the public until later (1995), as part of a carrot by IBM for enticing PC users to install the OS2/Warp operating system. This public release was a play-only version, and although it would have been a very exciting opponent in 1991, by the time TD-G was publically released, the first commercially available robot -- Jellyfish (JF) -- had already hit the scene in a big way (in 1994). JF's name was chosen because its brain is said to be of similar strength to the aquatic sea creature. (And I didn't realize that jellyfish could even play backgammon. :) With JF's wide distribution, the critics were quite numerous, if mixed of opinion. For example, Bill Robertie, who had been one of TD-G's biggest fans, took a mildly contrarian view of Jellyfish1.0 in the periodical INSIDE BACKGAMMON vol. 5, #1 (Jan-Feb 1995): "Although we've been pointing outJellyFish errors... let me reiterate... that it does far more things right than it does wrong, and there's no doubt in my mind that it's the strongest commercially-available program right now. (It's closest competition was EXBG. --CB) But it's not [yet] as strong as TD-Gammon, and it's not a world-class opponent."

However, regardless of the negative criticisms levelled at the robots, many experienced players saw the value of using them for rollouts, even those performed by the weak bot EXBG. Meanwhile, several homegrown neural net backgammon players were appearing on the internet servers. Some of these were quite formidable opponents, LONER being among the best and probably at least on par with JF and TD-G. Among the general populace (particularly, online players), the robots' strengths were receiving wide acclaim. Among top-level, seasoned tournament players, however, there was still considerable skepticism. See, for example, Nack Ballard's 1997 comments in an annotation of a Backgammon game between Ron Karr and Richard McIntosh.

Jellyfish's strength and reputation continued to grow with each release. Then in 1998, a second commercially available neural net player -- Snowie (name shortened from 'SnoWhite' to avoid potential legal battles with Disney) was marketed. Currently in its 3rd release, Snowie has become the accepted authority on the game. Despite its documented shortcomings, there has been a tendancy for its evaluations and especially its rollouts to be taken as gospel. If Snowie says you are wrong, then you are wrong. Period. That view may seem extreme, but it is common among Snowieusers. To the more objective comes the question "but how do we know...?"

II. SKILL MEASUREMENT METHODS.

A. Head-to-head Money Play.

One logical reason for skepticism of a player's strength (robot or human) is the difficulty in objectively measuring that strength. Unlike tennis or chess, backgammon is permeated with randomness and it's very difficult to filter out the luck to find the skill that remains. Prior to the 1960's (when tournament play was popularized by Obolensky), the only measure of skill was head-to-head money play. Even here, many long sessions are likely to be required if statistically significant results are desired. The standard deviation of a money session is approximately 3*sqrt(N) whereN is the number of games played.

B. Tournament Performance.

With the popularization of tournament backgammon in the 60's and especially during the 70's, a new method of measurement appeared. Initially there was little quantitative data recorded, and a player's reputation was based as much upon opinion as fact. Paul Magriel was widely regarded as the best player of the 70's, based partly upon his excellent trophy collection, but likely also on his worldwide media exposure.

In the 80's, Kent Goulding (KG) and associates began logging tournament performance, not just wins and losses but also the strength of the opponent. Mimicking chess ratings, they instituted a (US) national tournament performance rating and ranking system. Unforutnately this doesn't help with robot skill measurement because non-humans have never been allowed to enter tournaments against humans.

C. Online Performance.

With the advent of internet online play in 1989, yet another yardstick was available -- online performance ratings. This method is identical to the KG system used for tournaments and mentioned above. However, finally robots could be rated on an equal footing with human players. There was a minor Systematic Error which cropped up online that hadn't been a problem for KG -- dropping. This larcenous activity, fortunately the practice of a minority, consists of not finishing matches (and thus nothaving them affect one's ratings) which are virtually guaranteed to end in losses for the dropper.

Not only is the dropper's rating thus higher than his/her true skill level, but the droppee's rating is equivalently lower than his/her/its skill level. To add to the discrepancy, it has been speculated (by David Montgomery) that robots are more likely to be dropped upon as compared to humans, and thus will be likely to sufferlarger ratings deficits compared to their human competitors.Though unproven (to my knowledge), this theory is reasonable, analogous to the fact that many people (e.g. shoplifters and tax cheats) are known to be more likely to comit crimes against nameless, faceless organizations than against individual humans. Also, a bot is probably less like to complain to the online administrator about being dropped upon, or at least that is potential rationale.

D. Third Party Judgement.

If a competitor's play (for example, a match) is recorded, then an intelligent, objective third party could review that record and determine skill level. A big advantage here is that most of the luck is irrelevant,since the performance was only measured based upon how the given roll wasplayed, regardless of whether that roll was actually beneficial to theplayer or not.

Better than simply "intelligent (and) objective" analysis is quantitative analysis. In the July-August 1995 issue of the Hoosier Backgammon Club newsletter, I made a comparison of Jellyfish evaluations,Jellyfish rollouts, TD-Gammon evaluations, TD-Gammon rollouts, and EXBGrollouts. (Expert BackGammon did not give quantitative evaluations. It merely reported its decision: play or cube.) This study was based upon10 positions which had been written up in INSIDE BACKGAMMON (May-June 1994)and that's where I got the TD-G results. My conclusion (which admittedlywas not at a stastically significant level) was that assuming TD-Gammon rollouts as the benchmark, Jellyfish1.0 level-6 (2-ply) evaluation was betterthan TD-Gammon evaluation for those 10 positions.

The most recent versions of Snowie have incorporated the best everskill measurement tool -- full match analysis. Snowie will recorda match that it has played, or import a match played on a server betweenany two opponents, a Jellyfish match, or even a hand recorded match(when transcribed into the proper form). It can then analyze the entirematch, one play at a time, with either evaluations, rollouts, or somecombination of these. A quantitative error figureis reported for each play, and those errors are tallied at the end togive a rating. Snowie will also keep a cumulative record of performancein any number of matches.

III. QUANTIFYING ROBOT SKILL.

A. Head-to-head money play: JF vs. World-class Humans.

In 1997 Malcolm Davis initiated a contest by inviting two of the world's best human players, Nack Ballard and Mike Senkiewicz, to Texas to play against Jellyfish3.0. Human players put up their own money and Harvey Huie backed Jellyfish. Ballard and Senkiewicz were not teamed up, so actually there were two independent tests. Dice were human rolled to remove any concern that JF's generated dice were less than random. Each contest consisted of 300 independent money games. Coincidentally, Jellyfish finished dead even, beating Senkiewicz by 58 points and losing an identical amount to Ballard. JF's creator, statistician Fredrik Dahl, was quick to point out that a 58 point win in a 300 game sample is insufficient to conclude superiority. Ballard's win and Senkiewicz's loss were only significant at around one standard devition each -- not particularly meaningiful. Taken together, clearly neither the humanrace nor the droids could even hint at having an edge.

B. Head-to-head matchplay: JF vs. SW.

Shortly after the release of Snowie, Larry Strommen approached both Olivier Egger (Snowie's creator) and Fredrick Dahl to propose a commoninterface so that the two best robots could go at each other withouthuman intervention. Hundreds of matches could be contested in areasonable (about a week) time period. Neither showed interest.I've since heard of matches with human intervention, but these havebeen understandably time consuming and the results are not statisticallysignificant. (For example, if player A has a match win expectation of55% against player B, it would take on the order of 400 matches toestablish this at the 95% confidence level.)

C. Online Performance: JF and SW.

Both Snowie and Jellyfish have played thousands of matches on FIBS. One might think that their performance there could give a relativemeasure of their strengths. This is true to a point, but unfortunately with large enough (this time both statistical and systematic) uncertainty that one cannot conclude which is the better player. The problems are twofold. First, there is a temporal problem: Jellyfish had stopped playing on FIBS before Snowie started. It is well documentedthat server ratings vary with time, primarily due to the preferentialretirement of weaker players. The second (statistical) uncertaintyhas to do with the fluctuations in ratings, due primarily to the dice.Although I'm not aware of the true numbers, I believe it has been shownthat swings of +-100 points for even a highly experiencedonline player is not uncommon. Relative to all players on FIBS,both JF and SW were consistently in the top 5. All that we can sayfrom the FIBS experience is that the two bots are close to equal.

D. Third Party Judgement: comparing the skills of JF and SW.

Up to now, this article has been merely a review of past history. Recently I played a 19-point match against Snowie3.2, 3-ply (huge,100%) and then let Snowie grind away for over 8 days doing a 2-ply rollout of EVERY checker play as well as all cube decisions, but only for Snowie's side. (I didn't have 16 days to comit my PC to get the rollout analysis for both sides!) The 'Huge' searchspace was used for both checker play and cube decision rollouts. 144 trials truncated at 10 were performed on several candidates for each of Snowie's plays. 360 UNtruncated trials measured the cube actions as well as failure-to-double errors. The entire match can be found HERE.

The end result was that Snowie rollouts judged that Snowie player committed errors at the average rate of 1.753 millipoints per move (mppm). (For comparison, if a player made only two errors in 100 moves, with one of magnitude 0.100 cube-adjusted equity and one of magnitude 0.075 cube-adjusted equity, the net result would be 1000*(0.100 + 0.075)/100 = 1.75 mppm.)

I then stepped through the entire match with Jellyfish as a kibitzer andasked JF3.0, level-7 (3-ply), time factor = 1000 to indicate how it wouldhave made each play and handled the cube. The same Snowie rollouts measured JF's error rate for this match at 2.533 mppm.

In an attempt to be fair, I then took three 7-point matches I had played against JF3.0, level-7 last year whose total number of moves (702) was close to the Snowie 19-point match (681 moves for both sides). I had Snowie roll these matches out play-by-play at the same 2-ply settings used in the Snowie 19-point match analysis. Snowie rollouts said that Jellyfish's error rate was 1.164 mppm. I also had Snowie do a simple evaluation (huge, 100%) of these three matches and tallied up the errors that Snowie would effectively have made in the same situations. Here the Snowie rollouts reported that Snowie's (player) error rate was 0.672 mppm.

E. Statistical Analysis.

As with any measurement, there are both statistical and systematic uncertainties associated with these quantities. To get an estimate, I looked at the standard deviations in JF3.0 level-7's error rate in a subset of 7-point matches played against me over the past year. I also computed a standard deviation of my own error rate. Note that these distribtions aren't Gaussian. In particular, they aren't even symmetric, since there is a hard lower bound (zero) but no upper bound. I eliminated 11 outliers (matches with large error rates) froma sample of 81 matches to determine the standard deviations. I then multiplied by the squareroot of the mean number of plays in those remaining 70 matches (N=224.5) to come up with the standard deviation of the error rate per move. JF's standard deviation was 9.38 mppm and mine was 23.40 mppm.

.I can think of two reasons why the standard deviation of my errors isdifferent (in fact, larger) than JF's. Firstly, my mean error rate for the 81 matches was considerably higher (5.042 mppm) than JF (1.502 mppm). Often the magnitude of fluctuations track the magnitude of the quantity they measure. In addition, humans are vulnerable to many external factors such as concentration breaking distractions, stamina issues, and confidence variations that contribute to more erratic play than for the robots.

In summary, the statistical error on a SW rating of a series of plays (for example, a match) is just s.d.m./sqrt(N) where s.d.m. is the per move standard deviation on the error rate (9.38 for SW and 23.40 for a 'typical' human) and N is the number of plays for BOTH sides.

We can now assign statistical uncertaintiess to the earlier error rates for Snowie (player) and Jellyfish for the 19-point match: SW == 1.753 +-0.705 (95% confidence) and JF == 2.533 +-0.705. For the three 7-pointers the numbers are SW == 0.672 +- 0.694 and JF == 1.164 +- 0.694. For all 81 7-point matches (17,473 plays) between JF and me: JF == 1.502 +- 0.139. CRB == 5.042 +- 0.347. (All uncertainties are at 95% confidence level.)

Systematic uncertainties are always difficult to quantify. Qualitatively,two sources of systematic error are the 'robot bias' (particularly theSnowie bias) and the method of analysis. Snowie bias comes about becauseSnowie rollout is not a perfect judge of skill. The rollout resultscontain systematic uncertainties based upon Snowie's less than perfectplay. In addition, since the robots tend to play similarly, Snowie willlikely give another robot (JF) a higher rating than it deserves. I don'treally have much of a feel for the magnitude of this effect, but wouldcrudely guess it's around 0.5 mppm for a typical 7-point match.

The systematic error due to the analysis method could be quantified butI haven't had the time to devote to that effort. Basically there arethree ways for Snowie to analyze a match: evaluation only, rollout only,and a combination. Most people either do evaluation only or have Snowieevaluate and then roll out plays where the evaluation error is largerthan some threshold. For example, in my 81 matches with Snowie, allplays and cube decisions where evaluation indicated an equity error of 0.030 or higher were rolled out. For positions that are rolled out, the rollout result takes precedence over evaluation.

For the above described combination method (roll out positions with errorgreater than some specified threshold), only two candidates are rolled outfor each error position: the actual play made in the match and the play Snowie evaluation considers best. If some third candidate would do better in a rollout, this will not be discovered because that candidate will notbe rolled out. When all positions of a match are rolled out, however, thentypically several candidates are rolled out for each position. In thismethod there is less likelihood that the best play (that a rollout couldfind) falls through the cracks. A complete match rollout is therefore astricter judge of quality of play than a combination method evaluation. Myinterpretation of this is that the systematic error of a match which iscompletely rolled out is less than the systematic error for a match whichis only partially rolled out.

F. Third Party Judgement: Human Ratings.

Harald Johanni has been analyzing human vs. human matches for a few yearsnow and has built up a database of matches which have been analyzed bySnowie. He does a 3-ply (huge, 100%) analysis and then rolls out (2-ply)all errors with magnitude larger than 0.1 cube-adjusted equity. Hepublishes a ranking/rating list (http://www.backgammonmagazin.de/Start.htm)based upon those analyzed matches. Dirk Schiemann is the top rated player with an error rate of 2.927 mppm based upon having had 17 of his matches analyzed. He is the only player in Johanni's list with an error rate under 3 mppm. 13 players have an error rate under 4 mppm and on 44 have an error rate under 5 mppm.

Although Johanni's system is far from comprehensive in rating tournament players, consider this: his top 30 players include 17 with five or more analyzed matches. They are Schiemann, Grandell, Paul Weaver, Heitmuller, Levermann, Muysers, Mads Andersen, Sax, Ballard, Johanni, Magriel, Winslow, Goulding, Karsten Nielsen, Robertie, Granstedt, and Meyburg. Their average rating is 3.93 +-0.15 (conservative 95% confidence) for 327 matches.

I can think of three reasons why a human rating can be expected to be lessthan an equally skillful robot. Two have been mentioned previously: humanvulnerabilities (for example: fatigue, concentration, confidence) and Snowie bias. The third is a psychological factor. Robots only make theplays they consider to be best, technically (that is, the play it would make against an equal opponent). Sometimes a human player will intentionally make a technically inferior play, but one s/he expects the opponent to misreact to, the overall effect being a higher equity gain compared to making the best technical play. This can be especially valuable in doubling actions. For example, a double may be technically too early, but still correct if there is a reasonable chance the opponent might pass.

IV. CONCLUSIONS.

Although the skill measurement methods detailed here give the analyst many more options than were available even as short as 3 years ago, there is still considerable systematic uncertainty in their results. We may have to wait until the game of backgammon has been solved to have a really accurate rating/ranking system. In the meantime, although Jellyfish and Snowie are among the best players in the world, one should keep in mind that they are not perfect, and even the most powerful rollout result isn't exact. Always keep one skeptical eye open.

return to index