A lockless transposition table implementation for parallel search

Abstract

Parallel search programs use traditional alpha/beta tree searching, which includes the concept of transposition/refutation tables. Storing these entries in parallel can result in corrupt data that causes significant problems to the search. To prevent this, the common solution uses an atomic lock/unlock to implement a critical section that avoids the problem, but this incurs a significant performance penalty on many architectures. This paper describes a new approach that completely eliminates the need for any synchronization through locks, while still avoiding the problem of retrieving and using corrupted data.

Introduction

Assume that an entry in the transposition table (entry t) contains three pieces of data; the hash signature (the value used to confirm that this hash table entry represents the right position), the best move and the search result. In Crafty, a hash entry contains more information, but those three pieces are the important ones for this discussion.

Now further assume that the hash signature requires 64 bits to store, and that the move and score can be combined into one 64 bit word. This gives us two 64-bit words for a single transposition table entry. Name these two words W1(k) and W2(k) for chess position k.

For the remainder of this discussion, assume that 64 bit memory read/write operations are atomic, that is the entire 64 bit value is read/written in one cycle. This leads to the conclusion that a 128 bit read/write is broken up into two 64 bit reads/writes, and therefore this is not atomic since it becomes two distinct cpu-to-memory transactions.

When the program determines that it is time to store an entry into the transposition table, it computes the transposition table index by taking the low-order N bits of the hash signature, and using this as the table address where this 128-bit entry is to be stored. The problem arises where two different hash signatures match in the right-most N bits, so that both address the same transposition table entry (that is, W1(j)&MASK matches W1(k)&MASK exactly, where positions j and k are different). For positions j and k, we then have W1(j) and W2(j) as one entry, and then W1(k) and W2(k) as the second entry. What happens if two threads (using two or more processors) try to store these two table entries at approximately the same time, or if one thread tries to store at this table entry while another simultaneously tries to read that table entry? Both will address the same 128 bits in the transposition table. However, recall that W1(j) != W1(k) and W2(j) != W2(k) (j and k are two totally different chess positions that just happen to map to the same table entry because the hash signatures for both have the same low-order bits in W1). The main concern here is that there is no "atomic" way to store both W1(k) and W2(k) in one memory operation, nor is there a way to completely read W1(k) and W2(k) without having a small chance that another thread will change one of the two parts but not both before the read is completed. When two threads try to store two different 128-bit entries at the same table address, or one thread reads the 128-bit entry while another thread is writing to the same entry, the resulting data will look like one of the following:

        W1(j) W2(j)        [ok]
        W1(k) W2(k)        [ok]
        W1(j) W2(k)        [bad]
        W1(k) W2(j)        [bad]

The above happens because when any thread tries to store an entry, it has to do two (or more) memory writes. And if two threads try to store in the same entry concurrently, then any of the above can happen since the stores can be done in any of 4 different orders. (It should be noted that this same problem occurs when one thread is trying to read the 128 bits from memory to access the information, while another thread is simultaneously writing different information to that hash entry for another position because memory writes and reads can be interleaved in several different orders.)

The problem becomes an issue when the data is used. Based on the four possibilities above, there is a 50% probability that when one thread matches its hash signature with W1(p) (where p is either j or k), that it will get a match, but the W2(p) does not go with the W1(p) signature. For any position p in the table, it is essential that W1(p) is paired with W2(p), because the information in W2(p) is trusted.

This problem was first seen in Cray Blitz right after it was parallelized in 1983. The initial thought was "who cares which entry gets stored; there is no real reason to prefer one over the other since it cannot be determined (now) which one will be more important in the future?" But testing and debugging exposed the problem rather quickly. For safety, Cray Blitz tested the best move part of W2(p) to make sure it was a legal move, as making an illegal move (like castling) would create an extra rook or king on the board. During testing, the program would occasionally report an error message from the move validation routine that said "Illegal move from hash table." It was then necessary to discover why this was happening, because it never happened when testing with a serial (non-parallel) search.

We discovered that one thread would attempt to store W1(j) and W2(j) at entry X in the table. Simultaneously, another thread would attempt to store W1(k) and W2(k) at entry X, or it would attempt to read W1(k) and W2(k) from entry X. Since the processors queue up memory reads and writes and they proceed independently, there was no way to be sure that the first process completed both writes before the second would start its own reads/writes. Now the entry could have either W1(k) and W2(j) or W1(j) and W2(k), either of which is a serious problem.

The obvious solution is to use an atomic lock around the memory reads and writes, so that once a single thread writes W1(j) at table entry X, it is impossible for any other thread to read/write anything in the table until the first thread completes the entry by writing W2(j) as well. This neatly solved the problem, and the errors no longer occurred.

The problem with the above solution is that for machines with more than 2 processors, the locks begin to affect performance. For machines with slow atomic locking facilities (most today excepting the Cray are terribly slow at this operation) the penalty is even more pronounced. We wanted a better performance option than the atomic lock solution provided.

Another solution, used by at least one parallel program, is to simply ignore the problem. If the program carefully validates the best move before playing it on the board, then the board corruption problem can be eliminated. Of course, the program ends up using a score from a wrong position every now and then, which can still lead to errors, although the errors are likely to be non-catastrophic since the board does not get corrupted. We tried this approach for a while, but with small transposition tables and fast hardware, the probability of errors seemed too high to ignore.

The lockless strategy

One more observation is important to the solution of the problem. With no lock, it is possible to find (a) no move or score if the current hash signature does not match the table entry signature, (b) a correct move/score for this position if the signatures match, or (c) a wrong move/score if the signatures match but the wrong W2 value is present. Of the three, only the last one has to be avoided.

Our approach is to slightly modify the way we store entries, as follows:

        T = W1(p) & MASK;
        W1(T) = W1(p) ^ W2(p)
        W2(T) = W2(p)

In the above, T is the table address, <W1(p),W2(p)> is the value we want to store in the table position T for the current position p. Now the question is, how does that work and how does it cure our problem?

First, when we later do a table probe, we want to find W1(p) in the table (here p is the position we are trying to find), where we know that W1(p) matches the value W1(p) we stored previously. We take W1(T) from the table and exclusive-or it with W2(T) which should give us the original W1(p) value. If we store W1(j) and W2(j) in the table at entry T, then we really store W1(j)^W2(j) as the first word, and W2(j) as the second word. If we come back later we will match properly, since W1(j)^W2(j)^W2(j) is simply W1(j) (if you exclusive-or a value with the same value twice, you get the original value you started out with before the first exclusive-or operation.)

Let's try an error case, where we store W1(j), then W1(k) and W2(k) followed by W2(j) (which caused the error previously). At the table entry T, we store W1(j)^W2(j) in the first word, and then we are interrupted. The second thread comes along and stores W1(k)^W2(k) in the first word, and W2(k) in the second word. The first thread continues and stores W2(j) in the table. We now have W1(k)^W2(k) for the first word in the table at entry T, and W2(j) as the second.

Now let's suppose we reach position j again and we go to the table to see if we can match the current hash signature with W1. We take the table entry T, and compute W1(T)^W2(T). Which is W1(k)^W2(k)^W2(j). Note that we originally exclusive-or'ed W1(k) with W2(k), but then we lost W2(k) and had it replaced by W2(j). If you compute W1(k)^W2(k)^W2(j) you do not get W1(k) again. You get something else entirely. The current thread then decides "no match here" and continues searching without using the corrupted entry.

The same thing happens if the other thread tries to find position K. It too will fail since W1 needs its paired W2 to decrypt the original W1 value. What we get, then, is a simple methodology to eliminate the error-condition, because we will never get a match if either W1 or W2 is changed independently. Since W1 depends on W2 to decrypt it, the two must remain paired in the table or neither will be used until they are overwritten.

Conclusions

The first thing that should be obvious is that when two threads try to store at the same table entry concurrently, most of the time you get either W1(j) and W2(j) or you get W1(k) and W2(k). It is much more rare to see the non-paired error cases where either W1(j) is paired with W2(k) or vice-versa. This implies that most of the time, this is not an issue at all.

However, on occasion, two threads try a simultaneous update and the effect is a "lost" table entry because it can not be matched by W1(j) or W1(k) since the matching W2 part was lost. The disadvantage of this is that we get an entry that is totally unusable, but by doing so we will _never_ get an entry that is corrupted and provides a bad score or move. Of course, this entry will be overwritten at some point with useful data so the penalty is a short-lived one.

With today's large memories, it is not uncommon to have millions of transposition table entries. When you analyze the probability of two threads storing at the same entry concurrently, the odds are very small. When you add in the odd timing that tickles this bug, the odds are reduced even further. In fact, it is likely that this happens just a few times per game with 4 processors or less. The problem is that when it happens, the search would normally get a bad score which could change the tree. Or even more importantly, it could get an illegal move that would corrupt data structures and likely lose a game. The current scheme eliminates the locking overhead completely, and guarantees that there will be no corrupted entries that actually get used by the search.

This methodology has been used in Crafty for over two years with good results. On machines where the previous atomic locking methodology caused some performance problems, these problems were completely eliminated when the locks were removed. Yet we can still trust the best move and score to be 100% accurate with no danger of incorrect values.

This methodology works just as well on transposition tables with different characteristics than the ones used for the actual implementation. For example, there is no real requirement that an entry have a 64 bit signature and a 64 bit data value. This approach would work well with a 32 bit signature that is stored, plus 64 bits of information to go with that signature. Only the XOR operation would need to be changed to make sure that that W2 is needed to decrypt the W1 signature.

References

Hyatt, R., "The Dynamic Tree-Splitting Parallel Search Algorithm," The Journal of the International Computer Chess Association, Vol 20, No. 1 (1998), 3-19.

Nelson, H, "Hash Tables in Cray Blitz", JICCA Vol. 8, No. 1, 1985 (3-13).

Slate, D and Atkin, L, "Chess 4.5, the Northwestern University chess program" in "Chess Skills in Man and Machine", edited by Peter Frey, Springer-Verlag, 1977.

Hyatt, R, Nelson, H and Gower, A, "Cray Blitz" in "Computers, Chess and Cognition" edited by Marsland and Schaeffer, Springer-Verlag, 1990.