Controlling Blood Glucose For Patients With Type 1 Diabetes Using Deep Reinforcement Learning – The Influence Of Changing The Reward Function

Reinforcement learning (RL) is a promising direction in adaptive and personalized type 1 diabetes (T1D) treatment. However, the reward function – a most critical component in RL – is a component that is in most cases hand designed and often overlooked. In this paper we show that different reward functions can dramatically influence the final result when using RL to treat in-silico T1D patients.


Introduction
Reinforcement learning (RL) is a separate direction in machine learning where the aim is to understand and automate goal-directed learning and decisionmaking [13]. In combination with recent advances in deep learning, deep reinforcement learning has emerged as a very powerful tool for difficult control tasks [11,6].
The artificial pancreas (AP) is a system involving an insulin pump, a continuous glucose monitor and a control algorithm to release insulin in response to changing blood glucose (BG) levels mimicking a human pancreas. Several works have shown promising results using RL for the AP [2,7,8,12], but the main focus of these algorithms have been on fitting the RL framework to the case of type 1 diabetes (T1D). In this work we focus on the reward function, an often overlooked component of empirical reinforcement learning. It is well known that the success of a RL application strongly depends on how well the reward signal frames the goal of the application's designer and how well the signal assesses progress in reaching that goal [18]. In the diabetes case it is particularly the contrasting problems of hyper-and hypoglycemia -too high or too low BG levels -that is problematic for RL appli-cations. In fact, hypoglycemia is a commonly reported problem and one of the acutest complications of all types of diabetes. We propose several new reward functions suited for (T1D), and perform in-silico experiments testing different reward functions on the trust-region policy optimization (TRPO) algorithm [9] using the Hovorka model [4].
Our experiments demonstrate that focusing on reward functions that contain more domain knowledge, such as stronger penalties for reaching low BG levels, is crucial.

Deep reinforcement learning: Policy optimization and TRPO
Policy gradient algorithms consider parametric policies which are optimized using gradient ascent on a given performance measure. The most common choice for the performance measure is the expected return of the start state s 0 , given as Using policy gradient algorithms yield several benefits: the policy gradient theorem, application of RL to continuous action spaces and a naive extension to deep learning using neural network to parameterize the policies.
Furthermore, a key point of using policy gradient algorithms is the policy gradient theorem [13]: This states that the gradient of the performance measure is proportional to the gradient of the policy itself. This allows the use of any differentiable policy parameterization. Furthermore, the  policy gradient theorem is constructive, so it directly yields a simple sample-based algorithm, RE-INFORCE [16], omitted here for brevity. This algorithm has been well studied and a number of improvements and suggestions have been proposed, see e.g. [9,10,5]. The current state-of-the-art in model free policy gradient algorithms is Trust Region Policy Optimization (TRPO) by Schulman et al. [9] and a simplified version, Proximal Policy Optimization [10]. In this work we restrict our attention to TRPO.
Trust region policy optimization is a policy gradient algorithm where each update of the policy is guaranteed to improve the performance. This guarantee is achieved, by enforcing the Kullback-Leibler divergence between the old and the updated policy to be small: We refer the reader to Schulman et al. [9] for further details. The policy π θ (a|s) is a Gaussian policy: where σ(s, θ) and µ(s, θ) are feature extractors. We use neural network feature extractors in this work.

Reward functions
The second category is asymmetric reward functions -hand-designed reward functions including external knowledge from the diabetes disease to give more penalty to hypoglycemic events. T1D reward: Linear function with positive reward for normoglycemic range. Exponential function with negative reward for hypoglycemia, while 0 reward for hyperglycemia. Really negative reward for severe hypoglycemia.
: bg > bg hyper Tight T1D reward: Hypoglycemia considered as values below bg hypot = 90 mg/dL in order to be even more aggressive against hypoglycemic events.
: bg > bg hyper Hovorka reward: Based on the nonlinear model predictive control from [4]. −(bg − y(t)) 2 y(t) is the desired glucose profile. When BG levels are above the desired level y(t) linearly decrease, while for BG values below target value y(t) exponentially increases [4]. Performance measures and testing We test the algorithm on a fixed scenario consisting of 100 random meal-days generated with a fixed random seed. To measure the performance of our simulations, we use time-in-range and time-inhypoglycemia as the performance measures, where we want to maximize the former and minimize the latter. We also include low blood glucose risk index (LBGI), high blood glucose risk index (HBGI), risk index (RI) and the coefficient of variation (CoV), all described in Clarke and Kovatchev [1].

Results and discussion
In this work we test and compare different reward functions using TRPO on the original Hovorka insilico patient, [4], in order to show the importance of the reward function design.
In the experiments we consider two cases, with different insulin-to-carbohydrate ratio (ICR) used to calculate pre-meal bolus insulin doses. This ratio specifies the number of grams of carbohydrate covered by each unit of insulin, see e.g. [14].
Given the fact that we are in this work considering a single-hormone AP, the only available action for the algorithm when the BG is too low or approaching low levels is to turn off the insulin pump. Due to this the actual ICR used during meals will have a strong influence on the overall result. Especially the severity of carbohydrate counting errors, which we include in our simulations, will be affected by different ICRs.

Case 1: 30g/U ICR
We start with a 30g/U ICR. This translates to the in-silico Hovorka patient taking 1 unit of insulin for each 30 grams of carbohydrate intake. We run the TRPO algorithm for 100 iterations using all the reward functions described in Section 3. Figure 1 shows mean BG level values for the different reward functions used within TRPO and the basal-bolus regimen. The mean BG values show good performance for all the different reward functions and basal-bolus regimen when using 30 g/U ICR as shown in figure 1, spending most of the time within range. However, most of the symmetric rewards show lower values than the asymmetric rewards, resulting in a higher hypoglycemia risk. Only the tight binary reward function shows comparable results to the asymmetric reward functions, keeping mean BG values closer to the target value. Results from these experiments are summarized in Table 1.
TRPO outperforms the basal-bolus regimen in terms of time-in-range for all the reward functions tested. However, that is not the case in terms of hypoglycemic events, where the symmetric rewards struggle to avoid hypoglycemia. Only the symmetric binary tight reward function presents competitive results avoiding hypoglycemic excursions in similar terms to asymmetric rewards. The risk reward function actually increases the time spent in hypoglycemia, showing worse results than the rest of the asymmetric rewards. The opposite happens with hyperglycemic excursions, where the symmetric reward functions show better performance avoiding hyperglycemia. This is because the symmetric reward functions deal equally with hypo-and hyperglycemia events, while asymmetric reward functions are designed taking into account external knowledge from the diabetes problem. In this work, this external information consists of higher penalty to hypoglycemia than to hyperglycemia, which is translated into safer behaviour reducing the time spent in hypoglycemic events. This is also reflected in the risk factors, where the asymmetric reward functions are more robust against risk of hypoglycemia than the symmetric reward functions, while both kind of functions show similar performance in terms of hyperglycemic risk. Therefore, the overall risk factor is lower for the asymmetric rewards. Finally, the asymmetric reward functions where hypoglycemia is penalized more than hyperglycemia also present lower CoV, and only the asymmetric risk reward function show similar results to the symmetric functions.
3 Figure 1: Mean blood glucose levels using TRPO with different reward functions, averaged over 100 episodes. Each test episode runs for one and a half day, a total of 36 hours, to include the effects of the algorithm after the last meal. The Insulin-to-carbohydrate ratio is fixed at 30 g/U.

Case 2: 25g/U ICR
We select a 25g/U ICR for the second set of experiments. That means the in-silico Hovorka patient uses 1 unit of insulin for each 25 grams of carbohydrates intakes. Therefore, in this set of experiments the patient uses more units of insulin to deal with the same amount of carbohydrates. The mean BG level values for the the basal-bolus regimen and the different reward functions used within TRPO during these experiments are shown in figure 2. TRPO shows good performance with mean BG values within range most of the time. However, symmetric reward functions lead to lower BG values and then higher risk of hypoglycemia, while asymmetric reward functions stay in safer glucose levels.
Results summarized in table 2 show TRPO clearly improving time spent in target range while reducing hypoglycemic events in comparison with the basal-bolus regimen, which in this case is not able to maintain safe BG values.
Furthermore, the asymmetric reward functions taking into account the importance of avoiding hypoglycemia perform better than symmetric reward functions, reducing hypoglycemic events. This is also reflected in the reduced overall risk index. The symmetric reward functions deals better with high BG values, reducing the time spent in hy-perglycemia. However, in spite of this reduction in time spent in hyperglycemia, the risk of hyperglycemia is similar for symmetric and asymmetric reward functions, with the asymmetric T1D reward function showing the lowest risk. Therefore, asymmetric reward functions results in lower total risk factor. Regarding the coefficient of variation, the asymmetric T1D reward function shows better performance decreasing variance, while symmetric binary reward function presents a CoV value closer to the basal-bolus strategy. The rest of the reward functions present similar results, reducing the CoV with respect to the basal-bolus regimen.

Conclusions
In this work we have shown that changing the reward function will have an impact on the overall performance of RL agents for the AP framework. Furthermore, we tested the influence of including domain knowledge in the reward function, and we observed that this both reduces the hypoglycemic events and risk indices in general, ultimately improving the safety of the in-silico T1D patients.  Table 2: Summary of results for 25g/U insulin-to-carbohydrates ratio. Mean values ± standard deviation of 100 runs with each episode running for one and a half day, a total of 36 hours.