Back to NSA Ratings Committee

QUALIFYING SYSTEMS: A REPORT OF THE NSA RATINGS COMMITTEE

Release version, October 1st, 1997

Steven Alexander
John Chew
Jerry Lerman
Robert Parker
Dan Stock, chair
Jeff Widergren

SECTION I. Introduction

The Ratings Committee was formed to investigate questions of Rating Systems and how they should be applied. Rating Systems (RSs) assign players some kind of rating, often a number, that indicate approximate relative playing strengths based on past performance. Qualifying Systems (QSs) choose a subset of players to play in certain elite tournaments. RSs and QSs are inextricably linked, in that one would expect both to be able to choose the "best" players.

The first objective of the Rating Committee was to come up with recommendations regarding a QS appropriate for NSA use.

In doing so, we first considered a number of criteria by which QSs may be judged. Those criteria are discussed in Section II of this document. We refer to the criteria by the names of Feasibility, Accuracy, Appearance of Fairness, Encouragement, Cheatproofing, Explainability, and Calculability.

There are a surprising number of difficulties in finding an appropriate QS. To answer questions about why we took so long to make this report, what we have been doing, and why we have ended up with a system with admitted drawbacks, an explanation of some of the difficulties of QSs and RSs is in Section III.

A QS consists of two parts: a set of logistical rules and a qualification statistic algorithm. The logistics include such items as the minimum number of games a player must play to qualify, the length of the qualification period, and so on. The qualification statistic algorithm (QSA) is a way to figure out a statistic that approximates ability shown during the qualification period. One example of a QSA is to take the peak rating during some part of the qualification period.

We agreed that the logistics of qualifying are quite important. We have a number of recommendations for logistics, which are discussed in Section IV. Each of these logistical recommendations directly affects at least one of the criteria in Section II, thereby reducing the demands on the QSA.

QSAs are discussed in Section V. We discuss several different QSAs, including ones used in the past, ones favored by minorities of the committee, and the one we recommend.

SECTION II. Criteria for Selecting a QS

Early in our discussions, the committee decided to base its work on seven criteria for judging QSs.

The first criterion is Feasibility. This criterion says that a QS should be implementable, and should not require an excess of computing resources.

The second criterion is Accuracy. This criterion says that a QS should choose the players who, in some statistically reasonable sense, have performed the best over an appropriate qualifying period. The Accuracy criterion includes the subcriterion of "significance": i.e., assuring that enough data is considered that significant statistical inferences can be made. Significance can be tuned by setting the minimum number of games, the length of the qualifying period, and so on.

The third criterion is Appearance of Fairness. All QSs have biases. In general, more accurate systems will appear to be more fair. However, some systems make their biases so clear that players who lose out will feel that they have been unnecessarily cheated. This criterion says that a QS should avoid having clear biases to players who perform well in certain tournaments or at specific times (in favor of other tournaments and times in the qualifying period). Additionally, a QS should try to minimize the number of players who are not selected, yet who appear to have a "legitimate" claim to having been, during the performance period, a substantially superior player to any of those who are selected. One committee member characterized this criterion as "One's qualification chances should be invariant under meaningless translations of games won."

The fourth criterion is Encouragement. This criterion says that a QS should not discourage players from playing for fear that they could lower their probability of qualifying -- i.e., players should not be tempted to "sit on their ratings". Similarly, a QS should not encourage players to otherwise affect their playing style (e.g., to compete in some tournaments but not others). Another part of the Encouragement criterion is to assure currency: that is, to assure that players are not "rusty" by the end of the qualification period. It is important to note that the Encouragement criterion does NOT try to give a player a better chance of qualifying just by having played more games -- that would violate both the Accuracy and Appearance of Fairness criteria.

The fifth criterion is Cheatproofing. This criterion says that a QS should discourage players from "throwing" games to allow others to qualify. Fortunately, cheating of this kind has not been a significant problem in the past, and the rules of the game already deal harshly with such a situation. However, some committee members considered it important for the QS to discourage such behavior by having any loss directly affect one's chance of qualifying (as may not happen, for example, with a "Peak-Rating" type of QSA).

The sixth criterion is Explainability. This criterion says that a QS should be easy to explain to a player who is interested.

The seventh criterion is Calculability. This criterion says that a QS should allow players to calculate with relative ease where they stand. Calculability and Explainability are the two different facets of the idea that a QS should be "simple".

RC committee members had substantial disagreements as to how much the various criteria should be weighed -- or indeed, whether some are significant at all. We decided to have members judge possible QSs in two ways: both by their own weighting of the criteria, and by a compromise weighting scheme. We consider it important that the final QS should not only be good when judged by the compromise scheme, but also that a majority of the members find that it is good when judged by their own favorite scheme. In our compromise, Feasibility is considered vital. The other criteria are weighted with the following percentages:

Accuracy 65%
Appearance of Fairness 20%
Encouragement 5%,
Cheatproofing 5%,
Explainability 4%, and
Calculability 1%

It is important to note that many of these criteria can be judged only subjectively. Combining subjectively determined criterion scores according to a precise schedule can give the misleading impression of more accuracy than is actually present. Individual scores, and any reasons given for them, should be given full consideration. Should the reader find us in this report failing to acknowledge the subjectivity that riddles these measurement standards, please excuse our shortcoming.

SECTION III. Difficulties of QSs and RSs

About November of 1996, after deciding on criteria, a majority of the committee decided that efforts should be made to improve the rating system. It was believed that with an improved RS, the QS would be simple. After all, both the RS and the QS have similar goals: to choose the best players. Skeptics thought that improving the RS in a limited time frame seemed unlikely.

One committee member volunteered to do programming. He spent over 250 hours in early 1997 analyzing the existing ratings program, evaluating and understanding the current ratings software, and writing new statistical software to evaluate rating systems based on their behavior with the data derived from four years of tournament results.

By March, he had proposed a new system that seemed very promising indeed. Even the skeptics were impressed by the initial data. Unfortunately, due to complications with real life, further analysis of the data took several months. And when it did come, the results were bad: the new system would have favored high-rated players who tended to play low-rated players (rather the opposite of the current system).

Further analysis of rating systems will be done. We are still hoping to find a rating system that will be so statistically impressive that it can be used directly as a QSA. In the meantime, however, we needed to come up with a QS without modifications to the RS.

Why is it so hard to fix the rating system?

Many outside the committee have suggested simple fixes to the existing rating system, such as "just increase the standard deviation" and "just include a luck factor". Two different committee members, in separate studies, have analyzed such simple fixes. The conclusion is simple: they do not work. In particular, neither "simple" change can alter the current tendency of the rating system to "flatten" -- or underpredict the amount that a lower-rated player will beat a higher-rated player. Indeed, one of the committee members proposed a theoretical model that could help to explain why such "flattening" is endemic to systems like the current one -- however, the applicability of the model was a matter of some disagreement on the committee.

Factors to consider for rating systems include not only the seven criteria for QSs, but also include: ratings deflation or inflation; long-term system stability; volatility; and other factors. Extensive simulation of a number of current and proposed rating systems show one thing: the problem is not simple. And a "simple" fix to the ratings system is likely to cause more aggravation than no fix, if the fix is not properly analyzed in advance.

Working on the rating system is a significant software engineering project being done on volunteer time. We would all like to have it be done and be perfect and be ready to go immediately, but that is not going to happen. Since this software project happens not to be putting dollars in our pockets, many members of the committee have happened to not be able to give it all the needed time -- real life does intervene. By sheer bad luck, it has been a difficult year for many of the committee members. In the meantime, the preexisting work of such individuals as Alan Frank, Jim Homan, Dan Pratt, Brian Sheppard, and Charlie Southwell should be recognized as being really pretty respectable. While the system does have its problems, it is generally true that players with substantially higher ratings are stronger players, that players within a couple hundred points will play competitively, and players with a larger separation will play less competitively.

Given that a significantly improved rating system is not available, this document gives several possible QSAs. If and when we do propose a new rating system, it is likely that a new qualifying system may be needed as well. In particular, if the ratings system is quite accurate, it becomes easier to recommend a qualification system that simply takes the highest rated players at a specific time.

It should be pointed out that no qualification system is perfect. When there are fifty-odd arguably-qualified players and you have to choose a dozen or so, some qualified players will lose out. And many of those who lose out will be able to complain that they would have qualified using many other qualification systems. So please do not expect a perfect qualification system: it does not exist. Some committee members argue that in an analogy to voting paradoxes, attempts to put too many "reasonable" rules on expectations of a QS are also likely to lead to paradoxes. (For an explanation of voting paradoxes, we suggest starting at the Internet site

           http://www.sciam.com/askexpert/math/math2.html

from whence you can get to Don Saari's home page and to at least one of his published papers on the subject.)

Here is an example of the kind of problems that arise. There is a tradeoff between the criteria Encouragement and Accuracy. To have the highest possible Accuracy, one would want to make sure that all games played during the qualification period count toward whether one qualifies. If all games can count, however, then some players will feel that they've had a lucky streak and risk a drop in their evaluation if they continue playing. A QSA that offers good Encouragement should therefore be deemed one in which the expected number of such players is small.

While our efforts on rating systems depended extensively on simulated data, the QS efforts are not quite as amenable to simulation. One reason for this is that we cannot simulate such human choices as "sitting on one's rating" in the systems where that would apply. However, some simulations have been useful in showing relative accuracy of some of the methods being proposed.

We end this discussion with two recommendations that may help us in the future. Some of the committee think that having scores of games available for all tournament games may help us, either in coming up with a new rating system, or just to provide additional interesting statistics. Witness baseball, for instance, where varied statistics can seem almost important as the game itself! (Who won the 1954 World Series? Who hit 61 home runs in one regular season? I bet a lot more readers can answer the second question offhand than the first!)

RECOMMENDATION ("Keeping Score Data"):

The NSA should start collecting game score data and information on which player went first, in addition to win/loss results.

We are aware that the above recommendation may be impractical to implement, as it would require altering the current procedures for reporting tourney results.

RECOMMENDATION ("Transmitting Ratings Information"):

The NSA should regularly transmit new ratings information to the Ratings Committee.

This recommendation would facilitate future development of a new RS. A side benefit might be the opportunity to make this information more readily available on the Internet, ultimately reducing NSA postal expenses while offering a richer variety of statistics.

SECTION IV. QS Logistics

The committee has a number of recommendations regarding the logistics of qualifying.

RECOMMENDATION ("Qualification period and minimum games"):

A QS should have a one-year qualification period with at least 60-70 rated games required.

Appearance of Fairness and Accuracy both come into play when choosing the length of a qualification period. The period must be long enough to allow players to play a statistically significant number of games. Two years may be too long to give an Appearance of Fairness to up-and-coming players. Many tournaments are annual, and players (and tournament directors) will be (and have been) upset if their favorite tournament falls just outside the qualification period. Fortunately, the usual interval between iterations of the same tournament happens to be just large enough to permit, reasonably easily, what should be a statistically significant number of games to be played: namely, one year.

The 60-70 figure is a consensus figure for the number of games that would typically give an accurate result, while being a reasonable number of games to expect a player to play in a year. (Note that with the QSA we ended up recommending, we prefer 70 games; see Section V.)

One member of the committee dissented on the one-year qualification period and wanted to make his feelings known. He is opposed to nonconsecutive qualification periods. For a biennial event, such periods effectively ignore alternate years of performance data. This ignoring, in his opinion, is a problem with respect to Appearance of Fairness, as well as possibly Accuracy, and even Encouragement and Cheatproofing.

RECOMMENDATION ("Start of qualification period"):

The qualification period should commence no earlier than 16 months before the date of the invitational tournament.

The sixteen months leaves enough time for a one-year qualification period, plus enough time afterward for a Qualifying Tourney (see below) and enough leeway to allow travel plans to and from the event being qualified for and for the Qualifying Tourney. The schedule is intentionally rather tight, so that the qualification can be reasonably up-to-date.

RECOMMENDATION ("Currency"):

A QS should include some kind of currency requirement, in which for some G and M, qualifying players must play at least G games in the last M months of the qualifying period.

This recommendation addresses the Encouragement criterion. It should bring about higher levels of player activity in months immediately preceding close-off of the qualification period -- reducing the risk of players "sitting on their ratings", and also helping to make sure that no qualifier is "rusty" from not playing in a long time.

This recommendation would not necessarily apply for certain QSAs, notably systems that use the "peak" value of some qualification statistic. However, such QSAs have enough other weaknesses (as will be discussed later) that they were rated quite low by the committee. And even in peak QSAs, it may be useful to apply this recommendation to assure that players do not get "rusty."

One problem with this requirement (pointed out by Joe Edley) is the potential lack of "good" tournaments where a top-rated player has a "reasonable" chance to up her/his rating in the last few months. Another problem is that this recommendation could require some geographically isolated players to travel far for tournaments within the final M months. However, given the alternative solutions to the problem of Encouragement, this seems like a minor price to pay. Moreover, the NSA can directly affect this problem by scheduling major tournaments at appropriate times.

The precise values of G and M depend on the QSA. For some QSAs, more complex versions of this requirement are suggested; see Section V.

RECOMMENDATION ("Requalification Rights"):

For elite tournaments run on a regular basis, at least the defending champion, and perhaps additional players up to perhaps top 10% of finishers in one iteration, should be invited back for the next iteration if they are still active players. The actual number of players that will automatically requalify, and the method by which players will be ranked for this purpose, should be announced well in advance of the tournament. If any player does not exercise a Requalification Right, that berth should be reassigned using the regular QS.
For international tournaments, this recommendation should be applied separately to the U.S. and Canadian representation.

This is an item of Appearance of Fairness. Regardless of other qualifying systems, doing sufficiently well in a tournament one time should guarantee another chance. Such an approach is widely used in other sports.

RECOMMENDATION ("Qualification Tournament"):

Excluding the berths for those who qualified by Requalification Rights, at least 80% of the qualifiers for an event should be determined by the QSA. The rest (up to 20%) should be determined by a Qualification Tournament, to be held at least 45 days before the invitational tournament and at least 30 days after the qualifying period. The official dictionary and rules at this tournament should match the dictionary and rules of the invitational event (e.g., SOWPODS and single-challenge for qualification to the WSC; OSPD2+ plus/minus future changes and double-challenge for Superstars-like North American competitions). As usual, finishing positions should be based upon Win-Loss record, with point spread as a tie breaker. The top P positions in order of finish (where P is up to 20% of all invitation slots) would be winners of the slots.
Entry to the Qualification Tournament should be limited to the 5*P players who are highest ranked by the QSA and who did not qualify by means of the QSA. Thus, if all eligible players play in the qualifying tournament, the top one fifth of the players in the tournament would qualify.
The length of the Qualifying Tournament should be at least 15 games. The tournament should be NSA-rated unless it is played with rules other than those applied in NSA-tournaments (i.e., a SOWPODS tournament should not be NSA-rated.) The tournament site should be determined by NSA; preference should be given to sites in regions which have much Scrabble® activity but which have been host to few of NSA's Milton-Bradley sponsored tournaments.
One alternate for the invitational event should be selected. The NSA or AB should specify whether the alternate is to be 1) the highest rated player in the qualifying tournament who did not qualify, or 2) the player who came closest to qualifying by the QSA but did not qualify either by the QSA or by the tournament. (We did not find either of these alternatives to be superior to the other.)

As was mentioned in Section III, no QS is perfect. A lot of the imperfection in the system can be reduced by having a two-tier qualification system -- i.e., one including a qualifying tournament. With such a two-tier system, a player who just misses qualifying by the QSA can at least be consoled to have qualified for the Qualifying Tournament.

RECOMMENDATION ("Documentation"):

The NSA should publish the summary of the QS widely, and make it available for free to any member who asks. If a more complex QS is chosen, the detailed explanation of the QS should be provided, at cost, to any member who requests it.

This recommendation helps the Appearance of Fairness by making sure all players know what they must do to qualify. It also helps the Explainability if explanations are widely available.

RECOMMENDATION ("Posting"):

The NSA should endeavor to calculate current "qualification standings" of people who have played a sufficient number of games, and to provide them in some way on the Internet, as quickly as possible after results come in. The posting should include information both on the highest-rated qualifiers and on which tournaments have been included in the calculation.

This recommendation helps with the Calculability criterion. If players know that they can find results quickly, it is less important for them to be able to calculate their own qualification statistics.

RECOMMENDATION ("Warning"):

If possible, the NSA should announce the full QS and the precise time of the qualification period at least three months in advance of the start of the qualification period.

This recommendation allows people to know what is coming up, and gives them a few months to study or otherwise prepare for the start of the qualification period.

SECTION V. Some Qualification Statistic Algorithms

In this section we present five different QSAs. The first two, presented for comparison, are methods that have been used in the past. After our year of considering various possible approaches, members of the committee were asked to present their favorite QSAs for this report; three different proposals were given, all of which are mentioned here.

One of these three emerged as the clear favorite of the committee, and we are recommending its adoption.

For each QSA and each criterion, we present our impressions also in a subjective numerical way. Each member of the committee rated each QSA on a 0 to 10 basis. Averages of these numbers are presented, as well as overall impressions using the committee's weighted ratings and using individual member's opinions.

These numbers, being subjective, should not be considered precise, nor definitive. They can be useful for looking at overall trends.

V.1. Peak Rating

The QSA is the highest rating during the qualification period after the minimum number of games (say 60) have been played. The Currency requirement might not apply for this QSA.

Peak systems like this were extensively analyzed by the committee. In general, they suffered from problems of accuracy, as various kinds of simulations showed. Reasons for the problem include that large amounts of data (anything after the peak) get ignored, and that a lucky streak at any time during the qualification period can result in a weaker player qualifying.

    Accuracy: Low to Medium Low. One hot streak can lead to qualification.
        Ratings: 0,3,3,3,4,5 -- average 3
    Appearance of Fairness: Low, for the same reason.
        Ratings: 0,2,3,3,3,4 -- average 2.5
    Encouragement: Very High. Playing more can't hurt.
        Ratings: 8,10,10,10,10,10 -- average 9.7
    Cheatproofing: Very Low to Medium. After establishing a high peak,
                   only ethics prevent one from throwing games.
        Ratings: 0,1,1,5,5,5 -- average 2.8
    Explainability: Very high.
        Ratings: 8,9,9,9,10,10 -- average 9.2
    Calculability: High to Very High.
        Ratings: 6,7,9,9,10,10 -- average 8.7
    Rating by consensus weighting of criteria: Medium Low.
        0.95, 3.56, 3.59, 3.72, 4.45, 4.9 -- average 3.53
    Individual overall ratings: Medium Low to Medium.
        1,3,4,5,5,6 -- average 4

The "Peak Rating" QSA was not popular with the committee, primarily due to relatively low Accuracy, Appearance of Fairness, and Cheatproofing. One member rated it as high as second of the five QSAs.

V.2. Existing System

We will not attempt to explain the QSA that was used for the 1997 Worlds, as it has been described elsewhere. However, we present our opinions for comparison.

There has been much criticism of the complexity of the system, in addition to the evident biases of the system toward those who did well in tournaments right after the minimum fifty games were played. However, the system did appear to select a competent team for the US Worlds.

    Accuracy: Low to Medium Low. Biased toward certain tournaments (those
              near and just after the fifty-game minimum was reached).
        Ratings: 3,3,3,3,4,6 -- average 3.7
    Appearance of Fairness: Low, for the same reason; the system engendered
                            many complaints on this score.
        Ratings: 2,2,3,3,3,4 -- average 2.8
    Encouragement: Opinions varied widely. Many on the committee felt that
                   Encouragement was Very High, since playing more can't hurt.
                   Some members downgraded it (in one case, by a lot) due to
                   past experience that some good players who were effectively
                   eliminated after fifty-odd games had little incentive to
                   continue to try to qualify.
        Ratings: 4,7,8,10,10,10 -- average 8.2
    Cheatproofing: Low to Medium. After establishing a high peak, only ethics
                   prevent one from throwing games.
        Ratings: 0,1,3,5,5,6 -- average 3.3
    Explainability: Low to Medium Low. Some players were confused, though
                    worse systems can be imagined.
        Ratings: 1,2,3,3,4,7 -- average 3.3
    Calculability: Opinions varied widely. Some committee members found
                   the QSA to be impractical for most individuals to keep
                   track of; others rated it higher, as a determined
                   individual could do so. One committee member thought that
                   all 5 QSAs merited Calculability in the 8-10 range, and
                   so ended up with unusually high Calculability values for
                   several of the QSA's.
        Ratings: 2,3,5,6,7,10 -- average 5.5
    Rating by consensus weighting of criteria: Medium Low.
        2.91, 3.31, 3.33, 3.44, 4.03, 5.26 -- average 3.71
    Individual overall ratings: Medium Low.
        2,3,4,4,4,6 -- average 3.83

The QSA used for the 1997 Worlds was rated lowest in average individual overall ratings, and second lowest by consensus rating of criteria. It failed to distinguish itself in any of the criteria except Encouragement. No committee member ranked it in the top two QSAs.

V.3. Final Rating

This QSA was rated tied for first (with IOPR, see V.4) by one member of the committee.

The qualification statistic is the rating at the end of all the rated tournaments which began on or before the official end of the qualification period. So, if a multi-day tournament begins on the day which marks the end of the qualification period, and then extends past that date, the tournament in its entirety will count toward the qualification period.

All players seeking to qualify must play at least 60 rated games during the qualification period. At least 15 rated games must be played within the final three months of the period, OR at least rated 25 games must have been played within the final 5 months of the period.

Advantages of this QSA include that it is highly familiar to the NSA public. Backers of this QSA assert that the rating system is a good ranker of players' recent skills - provided that they play quite a bit. This is an assertion that seems warranted by Ratings Committee research which has shown that NSA ratings are fair to good predictors of final finishes in tournaments.

Disadvantages are primarily with regard to Accuracy as compared to other QSAs. Opponents of this QSA state that the probabilistic nature of Scrabble and resultant streakiness of outcomes make this QSA under any RS too volatile to depend on. They also complain that players may tend to try to build up a high rating, sit on it until near the end of the year, then play enough games to meet the Currency requirement ... and keep playing then only until their ratings get high again. There is a lot of pressure in the last few tournaments of the qualifying period with this QSA.

    Accuracy: Opinions varied widely. See discussion below.
        Ratings: 2,3,3,6,6,7 -- average 4.5
    Appearance of Fairness: Opinions varied widely. See discussion below.
        Ratings: 2,3,3,5,6,7 -- average 4.3
    Encouragement: Medium, due to the Currency requirement.
        Ratings: 4,4,5,6,6,6 -- average 5.2
    Cheatproofing: Opinions varied widely. See discussion below.
        Ratings: 5,5,6,8,9,10 -- average 7.2
    Explainability: Very High.
        Ratings: 8,10,10,10,10,10 -- average 9.7
    Calculability: Very High.
        Ratings: 10,10,10,10,10,10 -- average 10
    Rating by consensus weighting of criteria: Opinions varied widely.
        2.65, 3.75, 3.75, 5.95, 6.3, 6.97 -- average 4.9
    Individual overall ratings: Opinions varied widely.
        2,3,3,6,7,7 -- average 4.7

The committee was not able to come to agreement on this QSA. We all agreed that what you see is what you get: all six members gave similar grades for Accuracy and Appearance of Fairness. However, three members gave a Low rating (2 or 3) on these two criteria, and half gave a Medium to High rating (5 to 7). Those who gave a Low rating disapproved of the QSA's bias toward those who are fortunate enough to hit an unusually high rating after playing enough games in the final months. Those who gave a higher rating thought that the reasonable accuracy of the ratings system was sufficient.

The committee was also split on the criterion of Cheatproofing (in a different way than in the previous split). Three believed that Cheatproofing in this QSA is Very High, since losing a game in an effort to favor another lowers one's own rating. Three others believed that it is only Medium, since under the system individuals who are not in the running for a qualifying berth could conceivably "throw games" to a contender, and thus help boost the contender's qualifying rating. The latter three generally gave lower Cheatproofing scores to all of the QSA's for this reason.

Overall, the Final Rating QSA was thought of well by the three committee members who found it Accurate: it had one one first-place tie vote, one second-place vote, and one second-place tie vote. However, the other three committee members rated it at or near the bottom of the five QSAs.

V.4. Iterated Overall Performance Rating

This QSA was rated highest by four members of the committee, and tied for first by another member. We commonly refer to this QSA by the abbreviation IOPR.

This QSA tries to find a mathematically "ideal" measure of the performance of each active player during the qualification period, using an algorithm similar to the performance rating method used by the NSA to calculate initial ratings for unrated players: essentially, each player's qualification statistic is the rating that they would have earned if they had started the qualification period unrated and if all the games that they played during the qualification period were counted as one big tournament. ("Essentially" is used here only to cover the fact that the NSA system would not allow a tournament division to consist entirely of unrated players - IOPR will start each player at an arbitrarily estimated rating, and then iteratively calculate NSA performance ratings until values stabilize.)

Skip this paragraph if you're not mathematically inclined. An alternate way of phrasing the calculations is to say that it amounts to the iterative numerical solution of N simultaneous equations

Expected_Wins[i] = Actual_Wins[i]

with i indexing players active during the qualification period, Actual_Wins giving a player's wins during the qualification period, and Expected_Wins giving their NSA win expectation (using the curve from the existing ratings system) during the qualification period. Note that a constant can be added to each qualification statistic in one solution to give another solution; the solution used will be one with an overall average rating approximately matching current average current ratings.

To qualify, a player must play at least 70 games during the qualification period. At least 15 rated games must be played within the final three months of the period, OR at least rated 25 games must have been played within the final 5 months of the period. Note that due to the nature of the calculation and the fact that prior ratings are not considered at all, this QSA requires slightly more games than some of the others. Indeed, one or two members of the committee recommend qualification periods of two years for this method, with a minimum of 120 games in the two years including at least 60 in the second year; however, the majority of the committee preferred a qualification period of exactly one year.

For computational efficiency, an approximation to this algorithm will probably be used. Players whose ratings are below a threshold value, whose IOPRs it can therefore be demonstrated will not be in the qualifying range, will not be included in the iteration process. Rather, their ratings for calculation purposes will be whatever their current ratings are at each tournament. In terms of the existing NSA ratings system, this is the equivalent of making all potential qualifiers (say those rated 1800+) unrated at the beginning of the QP, rating all games played during the QP as one tournament, and using as a QS the initial rating that each potential qualifier would have earned in the regular system. (Note that the cutoff for purposes of this algorithm -- perhaps 1800, as mentioned above -- would be low enough so as to guarantee not to eliminate any potential qualifiers).

The strongest point of this QSA is its Accuracy. In some kinds of statistical tests (ones with idealized players of unchanging abilities all playing a reasonable number of of games) it comes out as a clear winner: in such cases, it has a much better chance than our other recommended QSAs of choosing the actual "best" players.

This QSA also rates well in the area of "Cheatproofing", as any loss directly affects one's chances of qualifying.

Some committee members feel that this QSA is particularly favorable to rising players, since it does not depend on the ratings of contenders before the start of the qualifying period. Accordingly, an up-and-coming player is not burdened with the effort of attempting to raise an improperly low rating during the qualification period.

Weaknesses of the QSA include several problems in the area of Appearance of Fairness; these also affect Accuracy to some degree. A minor problem is that the algorithm does not always converge well due to integer rounding of ratings; this problem would usually only affect a player's rating to within one point, however, and one point cannot be considered statistically significant with any of the proposed QSAs.

Another problem is instability: the values of the qualification statistic can fluctuate wildly after one tourney, since the stabilizing force of long-term ratings is not present. The problem of instability is especially difficult early in the qualifying period. A related problem is that the qualification statistic will frequently change for a player even though the player does not play: this can happen because one's qualification statistic is dependent on that of the opponents one has had during the year.

It is even possible, though rare when a significant number of games have been played, that one player could surpass another in this QSA even though neither player has played a game (if the first player's historical opponents outplay the second player's). While such volatility may be surprising, it turns out to permit greater Accuracy. (Of course, with any of the proposed QSA's, one's ranking can change even when one is inactive.)

This QSA is also quite sensitive with regard to players who have played few games. In particular, if a strong player plays only a few games and does poorly, that player would be treated as a weaker player for the purpose of calculating opponents' qualification statistics.

This QSA does adequately in the area of Explainability - most players have some idea of what a performance rating is and how initial ratings are calculated; but it does terribly in terms of Calculability. If this QSA is chosen, it would be almost be a necessity that current QS standings be posted after each tournament, as it would be almost impossible for most players to do the calculations themselves. Indeed, depending on the number of active players, the NSA itself might have trouble performing the required calculations and have to contract out the computation. John Chew, who has run such a system for NSA Club #3 for a dataset of about a hundred players and about a hundred games, has found that recalculating IOPRs can take several minutes on his departmental compute server. He is willing to offer the use of his software and access to computing resources to help implement it.

    Accuracy: High to Very High.
        Ratings: 8,8,8,9,9,9 -- average 8.5
    Appearance of Fairness: Medium High to Very High. Some biases as
            or against players whose opponents play few games and do
            not play as statistically expected, or whose opponents
            gain or lose strength over the qualifying period; see the
            above discussions.
        Ratings: 5,6,6,8,9,9 -- average 7.2
    Encouragement: Medium High to High, due to to the Currency requirement,
                   due to the perception that players will not want to risk
                   trying to sitting on their ratings when their ratings can
                   change without their playing, and due to the high Accuracy
                   which should encourage players near the cutoff to continue
                   efforts to qualify.
        Ratings: 5,5,7,7,8,8 -- average 6.7
    Cheatproofing: High to Very High, as any loss affects one's own
                   chances of qualifying. Some votes were lower, from
                   members who gave low Cheatproofing marks to all the
                   QSA's due to the possibility of clear non-qualifiers
                   throwing games to potential qualifiers.
        Ratings: 6,6,9,9,9,10 -- average 8.2
    Explainability: Opinions varied widely; some think it easy to explain
                    by analogy to the fairly well-known existing performance
                    rating statistic, while others think the details may
                    be confusing to many players.
        Ratings: 2,4,5,7,9,9 -- average 6
    Calculability: Very Low to Medium Low (although one member gave a
                   Calculability of at least 8 to all five QSA's).
        Ratings: 0,1,2,3,4,8 -- average 3
    Rating by consensus weighting of criteria: High to Very High.
        7.23, 7.39, 7.44, 8.57, 8.59, 8.6 -- average 7.97
    Individual overall ratings: High to Very High. One or two
                    committee members downgraded their individual assessments
                    due to the probable need to have calculations done outside
                    the NSA.
        5,7,7,8,9,9 -- average 7.5

This QSA got 4.5 first place votes among the five QSAs; no committee member ranked it worse than second place. When judged by the consensus weighting of criteria, all six committee members ranked it first; indeed, no committee member gave a consensus-weighted rating above 7 to any other QSA, and all six gave a consensus-weighed rating above 7 to this QSA.

V.5. Average of three ratings.

This QSA was rated highest by one member of the committee.

This QSA is an attempt to find a compromise between the simplicity (and resultant inaccuracy) of the Final Rating QSA with the accuracy (and attendant complexity) of the Iterated Overall Performance Rating QSA.

In this QSA, the qualification period is broken into three four-month subperiods. To qualify, a player must play at least 20 games in each subperiod (a total of at least 60 games). All games in a tournament would be considered to be played in the subperiod in which the tournament ends. The qualification statistic is the average of the three ratings at the end of each subperiod.

The advantage of this system is that, while maintaining the simplicity of the Final Rating QSA, volatility is somewhat reduced. For a player to "get lucky" and qualify, the player would have to peak at all three of the right times, rather than just peaking at the right time.

Some complain that average rating systems like this tend to favor those who do strongly early in the qualification period, as the high rating they earn early in the session is the rating on which their later ratings are based. As one person put it, "averaging ratings over time is not a good idea - because the rating system is already something of a weighted average itself." Of the various possible QSAs that perform averaging of ratings, this complaint is less valid for this particular QSA because there are at least 20 games played between the samplings: enough time that an undeserving player is likely to see a reduction from an unusually high rating.

Disadvantages of this system include the fact that it is possible to some extent to "sit on one's rating" within each subperiod. It may also be difficult for some players to find enough games to play in each subperiod, or to find enough "high-quality" games to play (i.e., ones against reasonably high-rated opposition).

    Accuracy: Medium to Medium High. Biased toward those who can get
              lucky at the right time -- three different times. Somewhat
              biased toward the front end of the qualifying period.
        Ratings: 3,4,5,6,6,6 -- average 5
    Appearance of Fairness: Medium Low to Medium.
        Ratings: 3,4,4,4,6,6 -- average 4.5
    Encouragement: Medium to Medium High, due to the need to play in all
                   three subperiods.
        Ratings: 3,4,5,6,7,8 -- average 5.5
    Cheatproofing: High to Very High. Losing a game to favor another lowers
                   one's own rating.
        Ratings: 5,5,8,9,10,10 -- average 7.8
    Explainability: High, though one committee member thought that the details
                    could cause difficulties.
        Ratings: 3,7,8,8,8,10 -- average 7.3
    Calculability: Opinions varied widely. Three members thought that
                   averaging three ratings would not be difficult, and rated
                   it Very High. Three others rated it only Medium, because
                   of the amount of information to keep track of, especially
                   when monitoring how competitors are doing.
        Ratings: 5,6,6,10,10,10 -- average 7.8
    Rating by consensus weighting of criteria: Medium to Medium High.
        3.48, 4.6, 4.97, 5.77, 6.04, 6.27 -- average 5.19
    Individual overall ratings: Medium.
        3,5,5,5,6,7 -- average 5.2

V.6. Recommendations on QSAs

One qualifying system clearly rated higher than all of the others. We therefore make the following recommendation.

RECOMMENDATION ("QSA"):

Iterated Overall Performance Rating should be adopted as the QSA for now. However, the Ratings Committee would like an additional month (until November 1, 1997) for further testing of this choice and to determine certain details of the algorithm.

While the majority of the committee backed the idea of IOPR, we felt that another month beyond our October 1 deadline could both make us more secure in this decision and allow us to decide on certain details with more information at hand. (Those details include exactly how to work the "approximate" algorithm mentioned previously, and some potential changes to the way that infrequent players are handled and reported.)

One committee member strongly disagrees with the recommendation of IOPR. He acknowledges the impressive accuracy of IOPR, but thinks there are three problems: first, in his opinion, the committee significantly undervalued simplicity (expressed as Explainability and Calculability, together only 5% in the weighted rating); second, in his opinion, the likelihood of needing outside computer power to calculate IOPR is not practical; and third, in his opinion, the committee significantly underrated the accuracy of multiple sampling (as done by the Average of Three Ratings QSA) in comparison to other QSA's, especially the Final Rating QSA. Despite these three alleged problems, however, he still considers IOPR to be superior to the Peak Rating, Final Rating, and 1997 Worlds QSAs.

Due to the particular ways that IOPR works, and its relative inaccuracy when few games have been played, we also recommend the following.

RECOMMENDATION ("Report when Significant")

Publication of players' qualification statistic should not occur unless the player has played at least 30 games in the qualifying period.

The Ratings Committee is hoping to come up with improvements to the rating system in the near future. After that, it may be appropriate to reexamine the question of QSA's. If the new rating system has better predictive value than the current system, it is quite possible that a QSA based on that system (quite possibly one of the QSAs mentioned here, but with the new rating system underlying it) would be superior to IOPR.