Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.


Introduction
Reinforcement learning (RL) is one of the machine learning (ML) research areas that studies approaches to endowing agents with intelligence using trial and error. RL, unlike supervised learning, resorts to a reward function and has no need of labeled data. An agent only requires (partial) information about the cur-rent state observation of the environment to choose actions so as to maximize the cumulative reward. Nevertheless, finding an optimal solution is not always possible and RL algorithms seek for a balance between the amount of exploitation and exploration [8]. Namely, an agent chooses whether to maximize its return based on past experience, or to explore non-visited states which can have potentially higher returns. A large amount of exploitation can lead to sub-optimal solutions, yet excessive exploration leads to slow convergence of the trajectories.
Deep reinforcement learning (DRL) combines RL with deep learning (DL) techniques. Neural networks (NNs), especially deep convolutional neural networks (CNNs), can be trained together with the search routines of RL [9]. The first DRL model to achieve the success of combing CNNs with RL is [11]. Raw pixels in the game of Atari were fed into a CNN as input and the output of the CNN was a prediction of the future rewards.
Improving DRL with semi-supervised learning and self-supervised learning (SSL) is on the spotlight of the RL research field. For example, Deep Q-learning from Demonstrations (DQfD) uses small sets of demonstrations and accelerates the training through prioritized replay [6]. Self-Predictive Representations (SPR) is a recently proposed data-efficient method that relies on utilizing an encoder with the intentions to compute future latent state representations and to make predictions based on the learned transition model [16].
games, including Othello, were heavily experimented using DRL methods with game-specific settings [10]. Our algorithm is strongly motivated by designing a computationally less costly RL method via using limited expert examples in a data-efficient manner.
Semi-supervised learning uses a small amount of labeled data together with other unlabeled data, whereas SSL is a subset of unsupervised learning techniques. Semi-supervised learning focuses either on clustering or generating new labels from existing data. It consists of traditional methods like K-means [20] and more recent generative models like Generative Adversarial Networks (GANs) [14]. K-means generates prototypes which are labeled based on the Euclidean metrics. GANs generate adversarial examples by utilizing both a generator and a discriminator. The generator attempts to generate fake examples that resemble the real ones, while the discriminator learns to distinguish the fake images from the real ones. As this process of adversary repeats, the generator eventually learns to generate realistic images.
The idea of performing control tasks by learning from expert examples was previously explored in Behavior Cloning. An agent learned a policy from expert behaviors in the form of state-action pairs [13]. An autonomous car was able to drive in a variety of road conditions with a speed of 20 miles per hour by behavior cloning from the samples of human drivers. As a drawback of supervised learning, the training of behavior cloning typically requires a large amount of data.
Moreover, semi-supervised learning can be an intuitive way to improve the performance of RL methods by the efficient usage of examples. Inspired by the ideas of combining semi-supervised learning with RL, several methods have been proposed. Inverse Reinforcement Learning (IRL) [15] makes agents adapt to uncertain environments and extract the reward signal (function) from straight observations of optimal expert behaviors [12]. The agent is able to choose actions based on the learned cost function. IRL has shown strength in many real-world applications. For example, IRL can successfully perform tasks like personalized route recommendation, traffic warning without any exact destinations, and battery usage optimization for hybrid electric vehicles, etc [25]. IRL is a dual of an occupancy measure matching problem, and the learned cost function equals the dual optimum [7].
However, IRL runs RL in the inner loop and hence has extremely high computational costs, and it lacks the ability to fit actions not present in the examples.
The output of IRL is a cost function but does not make the agent choose an action in the environment. If the goal is to let an agent choose actions from the action space, a separate algorithm to let the agent take actions is needed. Generative Adversarial Imitation Learning (GAIL) [7] is an imitation learning algorithm inspired by the similarities between GANs and IRL. It takes the behaviors of the expert as input and gives the actions as output, skipping the part where the cost function is induced.
Q-learning is an asynchronous dynamic programming method that chooses actions based on the updated Q-values [22]. Q-learning uses an offline policy that is based on the Bellman Optimality Equation. In (1), γ is the discounting factor, Q * (s, a) is the Q value of state s with action a, r t+1 is the result value at s t+1 .
The method is described by [11]. Deep Q-learning (DQN) is the NN implementation of Q-learning. For each episode of the game, the sampled data are not directly fed into the network, but kept in a replay buffer. A minibatch is then sampled from the replay buffer to train our NN. This improves both the sample efficiency and the sampling quality by reusing examples but decreasing the correlations between examples.

Methods
Our method is developed based on DQN [11] and experimented using the game of Othello. Othello is an unsolved board game where there are two players playing against each other. An Othello board consists of a grid of 8 × 8 squares, with four squares in the middle set at the beginning of each game.
Whenever a piece is placed, all the opponent's pieces in between this piece and any of the player's pieces are flipped into the current player's color. A new piece can only be placed in a square that results in flips. If there is no valid move for one player, the turn is skipped. If there is no valid move for both players, the game ends. The final game result is determined by counting the pieces on the board. The player with the most pieces wins and a draw is also possible.
We use a variation of Double Q-learning (Double Q) [5] and Dueling Q-learning (Dueling Q) [21] to serve as our baseline model. We then design a novel algorithm which is suitable to incorporate expert examples. Afterwards, we train an agent either with or without expert examples and compare their performances against the baseline DQN.
For our algorithm, we tackle a plausible situation in which we have an offline expert, but the expert cannot tell the detailed Q-values (state-action values). Instead, it tells whether a state is a good state or not. This setting is especially realistic for many real world control tasks, where the reward function is hard to define and even the best expert cannot give a detailed evaluation on the state-action values. Moreover, it is often computationally less costly if an expert only provides a coarse evaluation of the state values instead of the accurate stateaction values.

Double and Dueling Q-learning
Q-values in Q-learning will grow larger and larger since we only choose the maximal state-action value of the next state for each update, a.k.a. 'overestimation bias'. Overestimation bias often results in sub-optimal and unstable performance of Q-learning [17]. The original Double Q exploits two different neural networks Q A and Q B . In each update, only one network is chosen by some update scheme, e.g., random update. Either (2) or (3) will be used to update the chosen network by learning rate α(s, a), where a * is argmax a Q A θ (s , a), and b * is argmax a Q B θ (s , a) at next state s .
Moreover, it is a common practice that the target network should just be a copy of the Q-network but with delayed synchronization, for the purpose of minimal computation overhead [19]. Double Q introduces a new problem named underestimation, as it tends to underestimate Q-values. We use a variation of Double Q in our experiment to avoid underestimation, where we only update Q A θ and a * is argmax a Q B θ (s , a). Dueling Q was specifically proposed based on DQN. The idea is that the state value V (s) and the advantage A * (s, a) are implicitly contained in each Q-value Q * (s, a) by (4). Hence, it is possible to represent those differently in an NN architecture in order to yield improved performance.
The notion of maintaining two separate functions in Dueling Q traces back to [1], [4], and [3]. The shared Bellman residual update equation can be decomposed into one function with the state value and one function with the action advantage. Moreover, Dueling Q forces the maximal advantage to be 0 and replaces the max operator with an average when calculating the advantage, on purpose of improving the stability of algorithm. Compensating any change in the direction of the optimal action's advantage leads to potentially unstable training. Figure 1 illustrates our implementation of the Dueling Q architecture.
Consider the i th NN output o i in Dueling Q, ω i,a the weight in the action layer, ω s the state layer, and K the total number of output neurons, then: Suppose ω b is the weight of the body layer ahead of branching, y i the target value of the i th neuron, then we have the gradient: In spite of the branching, the loss still backpropagates to the body layer as a whole during the procedure of gradient descent. This explains Dueling Q's inability in regulating overestimation bias of Qlearning. To improve on that, we decouple the gradient by using two different networks. The Q-values can be reduced when the gradients from two networks are independent because the overestimation bias would be represented by the state network instead.

Expert Q-learning
We propose our algorithm Expert Q-learning (Expert Q) to incorporate ideas from semi-supervised learning into Q-learning. The state values given by expert examples are from {-1, 0, 1}, simply indicating whether a state is good, neutral or bad. During training, we train an expert (state) network E θ to predict the state values.
We improve the Dueling Q algorithm to utilize the state value in a more explicit way by composing directly the Q-values, so that the state value is not internally represented in the Q-network anymore, but appears in our update scheme. The equation of calculating our Q-value is shown in (7).
Here, K is the size of the action space, and K = 65 since there is one action for each of the 64 squares on the board, plus an additional action of not being able to place a piece. Furthermore, Q * (s, a) is the updated Q-value by adding the state value E θ (s) given by an expert, subtracting the mean of predictions at state s from the network prediction Q θ .
Since we have Q θ (s, a k ) = A(s, a k ) + V (s) by (4), the predictions of the Q-network can indeed be treated the same as action advantages in Dueling Q during the update, by (8).
Our full method is described in Algorithm 1. The first part of the algorithm, i.e., line 2 to line 11 is similar to the steps of Double Q, while our Expert Q steps follow from line 12 to line 21. In the algorithm, r is γ the discounted result value, γ is the discounting factor, is the exploration ratio, s is the next state, a * is argmax a Q B θ (s , a), maxIter is the maximal iterations, and L() is the loss function, i.e., Mean Squared Error (MSE) in this case.
We have four networks Q A θ ,Q B θ , E A θ and E B θ , where Q A θ and E A θ are the Q-network and expert (state) network, Q B θ and E B θ are their copies. E A θ is trained with sate s e and state value v e when the expert example buffer E is not empty.
As mentioned above, our overestimation bias is represent in the expert network instead. The update in line 20 of Algorithm 1 has the effect of reducing overestimation bias because discrete values from {-1, 0, 1} are used as the target values of the expert network, which would be considerably smaller than the actual output when the overestimation happens.

Experimental Setup
In our experiment, positions of board games are represented by binary values in two channels, with each channel denoting the pieces of one player, which amounts to an input size of 8 × 8 × 2. The outputs are values corresponding to actions, which amounts to an output size of 8 × 8 + 1.
Players always play from the initial board position during training, but they play from all 236 initializations of positions during testing. There are unique 236 board positions after 4 turns from the start of Othello [18]. Testing all 236 initializations yields results that can reflect the capabilities of models more properly. Each player plays the white and black side evenly in one round during testing, which is in total a number of 472 games per round. Results are measured by a score calculated from the number of wins and draws out of total number of games, as shown by (9): score = wins + 0.5 × draws games (9)

Parameters and Architecture
Most parameters are kept as the same as possible across experiments. The discounting factor γ is set to 0.99. In the meantime, the reward at a single state is also the γ discounted value of the final outcome of that game. The learning rate is chosen as 1e −4 for both the state network and Q-network. The total number of iterations is 100 × 10 3 and the copies of networks are synchronized every 2 × of 100000 samples to maintain the intention of using a replay buffer, i.e., decreasing correlations between examples.
Our network consists of four 2D convolutional layers. The state network has an additional output layer with output size 1 and the Q-network has an additional layer of output size 65. Double Dueling Q uses an action branch with output size 65 and a state branch with output size 1. The final output layer is also with output size 65. The architecture of Expert Q is shown in Figure  2. The top is the expert network and the bottom is the Q-network. Most of the layers are kept as the same as possible.

Results
The results are obtained from playing against RAN-DOM, GREEDY and STOCHASTIC. In each setting, the plot is an average of 10 different results, shown in Figure 3. Means (µ) and standard deviations (σ) after iterations of 50 × 10 3 and 100 × 10 3 are shown in Table  2.
Results illustrate that Expert Q performs the best out of three models after 100 × 10 3 iterations of training when it is trained and tested against the same opponent in all three experiment settings. Expert Q without examples performs overall slightly worse than Expert Q, but still shows better performance than Double Dueling Q except when playing against GREEDY.
Double Dueling Q has a sudden drop of scores after approximately 40 × 10 3 iterations in Figure 3 (c) whereas its score stays more reasonable in Figure 3 (b). In the meantime, the initial Q-values of Double Dueling Q in Figure 3 (d), (e) and (f) rise drastically after 40 × 10 3 iterations. The scores of Double Dueling Q also rise faster than the other two models in the beginning of all three opponent settings. This can be attributed to its quickly rising Q-values, which is an indicator of both faster convergence and higher overestimation bias. Figure 3 (d), (e) and (f) all show that the Q-values in Expert Q are much smaller than in Double Dueling Q. Expert Q has slightly larger initial Q-values than Expert Q without examples because its state network is trained on expert examples, which introduces slightly more bias into the Q-network.
The three different settings of the opponent (RAN-DOM, GREEDY, STOCHASTIC) correspond to a random environment, a deterministic environment and a stochastic environment, respectively. The Double Dueling Q model makes more mistakes in the stochastic environment than in the deterministic environment when the choices are based on overoptimistic Q-values. This result corresponds to the previous conclusion that the issue of overestimation bias was found to be most salient in the stochastic environment [2].
It can be informative to check the results of trained models playing against each other. Table 3 shows the results of players trained against STOCHASTIC playing against each other in 10 rounds, a total of 4720 games. We find out that the baseline player (Double Dueling Q) actually performs almost equally to Expert Q without examples when the opponent is Expert Q, despite the fact that Double Dueling Q has poor performance when playing against Expert Q without examples. In turn, the Expert Q player demonstrates the best performance as expected.
The result can be an illustration of bias-variance trade-off. High bias in our baseline model leads not only to poor performance when trained and tested against STOCHASTIC, but also to relatively more stable performance when playing against other players. This corresponds to the observations that the improvement of performance is more significant when expert examples are used to train the state network.

Discussion
Our results indicate that Expert Q holds significant advantages over Double Dueling Q especially when playing against a stochastic player (a stochastic environment). Expert Q without examples demonstrates superior performance than Double Dueling Q when it is trained and tested against a non-deterministic player.
When our trained models play against each other, Expert Q still maintains the highest performance out of three models, whereas Expert Q without examples wins Double Dueling Q but does not show more strength when they play against Expert Q. This can be explained by the bias-variance trade-off and shows exactly that Expert Q is an algorithm designed in joint use with expert examples.
In Q-learning, high bias is unavoidable and often not desired. Our experiment, however, demonstrates that higher bias during training or even testing results in more stable performance if the bias is from expert examples. The performance of a model (Expert Q) with slightly higher bias than Expert Q without examples provides the best performance out of three models. We argue that low bias is not always desirable, especially when the environment is susceptible to change, and somewhat higher bias can make the model more resistant to those changes.
It is worth mentioning that Double Q can be underestimating instead of overestimating when Q A θ is used to chose a * . The performance also depends on the choice of parameters (e.g., learning rate, synchronization scheme).
To further incorporate SSL into our method, it might be interesting to adapt the Teacher-Student paradigm into our expert network, where a teacher is trained by existing expert examples, and then used to label unlabeled new examples [24]. Afterwards, a student will be trained from those data labeled by the teacher. Meanwhile, the student can be designed with increased noise during the process of training [23] to guarantee the learning of the real situation.

Conclusion
Our research contributes a novel algorithm that exploits state values from expert examples and uses relatively fewer computational resources than MCTS in decision making games. Our trained models are put directly into contests on Othello, which otherwise cannot be achieved if the experiment was performed in a typical single-player game environment.
Expert Q shows stronger performance comparing to the baseline algorithm and bests both Doubling Dueling Q and Expert Q without examples in direct competitions. The use of expert examples improves the performance by a large margin compared to not using expert examples.
The combined use of semi-supervised learning and DRL is expected to be applicable in many real world