This code defines a checkers board, and the ability to officiate a game of checkers between two AI which are playing using the rules I played with as a child, which are the same as traditional American Checkers (English Droughts), except that a win can only be achieved by capturing all of the opponents pieces. Two AI are defined to compete in checkers, one based in reinforcement learning and one in alpha-beta pruning.
This AI uses a traditional alpha-beta pruning algorithm (with specified depth), where the value of a potential future board configuration is infinity for a winning game, -infinity for a losing game, and given by the following equation for any other board configuration:
In the above equation, U and K represent the number of uncrowned pieces and number of kings a player has, and their subscripts s and o indicate if the player is itself or it's opponent respectively. The delta refers to the change in the value following it between the current board configuration and the potential board configuration the AI is considering.
The reinforcement learning AI is based in the ideas of Q-learning. Since creating and training a Q-table for every possible board configuration takes immense computational power, a set of characteristics defining a set of states was created such that each board configuration exists in one and only one of these states.
The characteristics of the states are as follows:
- Number of own uncrowned pieces
- Number of opponent uncrowned pieces
- Number of own kings
- Number of opponent kings
- Number of own pieces (crowned and uncrowned) on left and right edges of board
- Integer value of own vertical center of mass
- Integer value of opponent vertical center of mass
For characteristics 6 and 7, the center of mass is calculated with an uncrowned piece having the same mass as a king.
Since knowing the state of a board is not enough information to figure out it's possible transitions to other states, initializing a traditional Q-Table is not possible, and if it were, would require a very large amount of memory. With this in mind I instead dynamically create/discover a set of possible transitions between states during training, which I store as keys in a dictionary.
There are two formulas used to update the value of a transition between two states. In these formulas T(state1, state2) is the learned scalar value for transitioning from state1 to state2, R is the scalar rewards function between two states (same as described for the alpha-beta pruning AI), alpha is the learning rate, and lambda is the discount rate.
Directly preceding the AI's turn, if it has already moved at least once this game and the game board is not terminal, then the following value update is used:
If the game board is terminal, then the AI will be notified and will apply the following value update:
When it is the reinforcement learning AI's turn, it looks at a list of possible moves it could make, and if their respective state transitions are not yet known, it adds them to the dictionary of transitions. Then it will chose to do one of two things, based on a specified probability. The first is that it could chose a move at random, this is used during training so that transitions other then the current highest valued transition will still be trained. If the learning AI does not move randomly, it will chose the move who's state transition has the highest value associated with it.