Extracting Rules from Neural Networks with Partial Interpretations

We investigate the problem of extracting rules, expressed in Horn logic, from neural network models. Our work is based on the exact learning model, in which a learner interacts with a teacher (the neural network model) via queries in order to learn an abstract target concept, which in our case is a set of Horn rules. We consider partial interpretations to formulate the queries. These can be understood as a representation of the world where part of the knowledge regarding the truthiness of propositions is unknown. We employ Angluin s algorithm for learning Horn rules via queries and evaluate our strategy empirically.


Introduction
Neural networks have been used to achieve important milestones in artificial intelligence [4,10,12,8], but it is difficult to understand how predictions of the models are made, and this limits their usability. In this work, we propose an approach for extracting rules from black-box machine learning models, such as neural networks. It is often the case that not all values in a dataset are known or trustable. For this reason, our approach assumes settings in which the dataset used to train the neural network contains missing values.
We first binarize a given dataset and we train a neural network with it. Then, we run the LRN algorithm [9]. This algorithm poses queries to the neural network, seen as a teacher, in order to extract rules encoded in it. Rules are represented using Horn logic, for example, they can be of the form ((horse ∧ wings) → pegasus). With Horn rules, we can carry automated reasoning in polynomial time, and it is feasible to check the quality of the model.
We perform an empirical study using the hepatocellular carcinoma (HCC) dataset [16], which describes survivability of patients diagnosed with hepatocellular carcinoma according to clinical information. HCC contains many missing values of attributes of patients. We compare the hypothesis built with our approach with the hypothesis built by a state-of-the-art implementation of the incremental decision tree algorithm [7]. Our rule extraction procedure correctly extracts meaningful rules and it is two times faster than the decision tree algorithm.
Related Work. A similar work [20] extracts probabilistic automata from neural networks by asking queries, and a recent work [15] focuses on how to better simulate queries asked to black-box models. We can also find methods that verify binarized neural networks by extracting a binary decision diagram [17] through queries. The interpretability field is large and there are many approaches to interpret neural networks models [21]. Our technique belongs to the global and active approach that explains the already trained model as a whole, as opposed to changing the network architecture for interpretability (passive), or explaining through feature studies or correlation (local).

Logic and Neural Networks
Let V be a finite set of boolean variables. A literal over V is either a variable v ∈ V or its negation ¬v. A literal is positive, if it is a variable, and negative otherwise. A clause over V is a disjunction (∨) of literals over V. It is Horn if at most one literal is positive. A (propositional) formula over V is a conjunction of clauses over V (in conjunctive normal form). It is Horn if its clauses are Horn.
An interpretation is a function that maps all variables V to either 0 (false) or 1 (true). It also maps the constant symbol ⊤ to 1 and ⊥ to 0. We write v ∈ I if I(v) = 1. We may omit 'over V' in formulas, clauses, literals, and interpretations. A variable v ∈ V is satisfied by I if v ∈ I, otherwise it is falsified. A negative literal ¬v is satisfied by I iff v is falsified by I. A clause c is satisfied by I iff at least one literal in c is satisfied by I and a formula t is satisfied by I iff every clause in t is satisfied by I. A partial interpretation extends the notion of an interpretation by allowing some values to be "missing" or "unknown", denoted '?'. In detail, it is a function which maps V to {0, 1, ?}. A partial interpretation I satisfies a formula t if there is a way to replace each ? in its image by either 0 or 1 and the resulting function satisfies t.
Horn clauses c can be written as rules of the form ant(c) → con(c), where ant(c) (antecedent ) is the set of variables that occur negated in c or the constant symbol ⊤ if none is negated; and con(c) (consequent ) is the positive literal in c or ⊥ if none is positive.
If an interpretation I satisfies a literal, a clause or theory x, we write I |= x, otherwise, I |= x. Let t be a theory and let c be a clause. If, for every I, we have that I |= t implies I |= c, then we write t |= c and we say that t entails c. If t entails every clause in a theory t ′ , then we also write t |= t ′ . If t ′ |= t also holds, then t and t ′ are logically equivalent and we write t ≡ t ′ . We say that a formula φ is satisfiable if there is an interpretation I such that I |= φ and falsifiable if its negation ¬φ is satisfiable.
Neural network models in this work can be understood as an alternative way of representing a formula in propositional logic. A neural network model N is a function that receives a vector in the |V| dimensional space, with values in {0, 1, ?} (with '?' standing for an 'unknown value'), and outputs a classification of this input. The mapping from in-terpretations to vectors is defined as follows. Given an interpretation I over V, we assume a total order on the elements of V and denote by vector(I) the vector in the |V| dimensional space where the element at position i is 1 if v i ∈ I and 0 otherwise. In this work, a dataset is a set of elements of the form (vector(I), l), where l is either 0 or 1 and I is a partial interpretation. For every neural network N trained on a given dataset, there is a propositional formula t N such that N (vector(I)) = 1 iff I |= t N . In this sense, N can be seen as an alternative representation of t N .

Learning via Queries
To formally define the problem setting, we use the notion of a learning framework F as pair (E, H), where H is the set of all formulas in propositional logic and E is the set of all partial interpretations (over variables in V). We say that F is Horn if H is restricted to the set of all Horn formulas. For any h ∈ H, I ∈ E is a positive example for h, if I |= h, and, a negative example for h if I |= h. For any h, t ∈ H, a counterexample for t and h is an example I ∈ E such that either I |= t and I |= h (a positive counterexample), or I |= h and I |= t (a negative counterexample).
We study the problem of identifying an unknown target t ∈ H by posing queries to two kinds of oracles [9] (implementation in Section 3.1). A membership oracle MQ F,t is a function that takes as input I ∈ E and it outputs 'yes' if I |= t, 'no' otherwise. An equivalence oracle EQ F,t takes as input a hypothesis h ∈ H and it outputs 'yes' if h ≡ t, otherwise, it outputs a counterexample for t and h. A membership query is a call to MQ F,t and an equivalence query is a call to EQ F,t .
Definition 1 (Exact Learning). A learning framework F (E, H) is exactly learnable if there is a deterministic algorithm A that takes as input the set of variables V used to formulate the target t ∈ H, asks membership and equivalence queries, and outputs a hypothesis h ∈ H equivalent to t. We say that F is exactly learnable in polynomial time if the number of steps used by A is bounded by a polynomial on |t| and the largest counterexample seen so far. Each query counts as one step of computation.

Extracting Horn Rules with Partial Interpretations
The goal of our work is to find rules hidden in a black box machine learning model such as a trained neural network model. We present an adaptation of the LRN algorithm [9] that learns from partial interpretations instead of entailments, as originally proposed by the authors of the mentioned paper. This algorithm is able to exactly identify any unknown target Horn theory by posing queries to oracles that can answer membership and equivalence queries. The algorithm is guaranteed to terminate in polynomial time with respect to the number of variables into consideration.

The LRN * Algorithm
We adapted LRN so that it is able to learn rules from partial interpretations. Membership queries take as input partial interpretations and counterexamples to equivalence queries are also partial interpretations. Algorithm 2 shows the main steps of the modified algorithm.
Algorithm 1 LRN * 1: Input: It is assumed that the learner knows F (that is, it knows that the hypothesis should be a Horn theory) but not the target t. 2: Output: h such that h ≡ t. 3: Let S be the empty sequence. 4: Denote with I i the i-th element of S. 5: Let h be the empty hypothesis. 6: while EQ F,t (h) returns a counterexample I do 7: if there is I i ∈ S such that I i ∩ I ⊂ I i and MQ F,t (I i ∩ I) = 'no' then 8: replace the first such I i with I i ∩ I in S LRN poses equivalence queries until it receives 'yes' as an answer. It keeps track of important partial interpretations that falsify the target. Each such partial interpretation corresponds to a rule entailed by the target [6]. Upon receiving a negative counterexample, the algorithm asks membership queries to find more specific antecedents of rules entailed by the target. After that, it adds to the hypothesis rules entailed by the target by asking membership queries and the process repeats. Correctness and termination of Algorithm 2 can be proven simi'larly as with the LRN algorithm [9]. This is possible because we can simulate membership and equivalence queries from the learning from entailments setting to the learning from partial interpretations setting (and vice-versa) [6, Theorem 16].
To simulate the membership oracle MQ F,tN , we directly use the classifier N . Whenever the algorithm calls MQ F,tN with input a partial interpretation I, we check if N (vector(I)) = 1, which means that I |= t N . If so, we return the answer 'yes' to the algorithm, 'no' otherwise. Simulating an equivalence query oracle EQ F,tN is not as straightforward as we are checking if the hypothesis constructed is equivalent to t N .
We simulate EQ F,tN by generating a set of examples randomly and classifying the examples using membership queries. Then, we can search for examples in this set that the hypothesis constructed by LRN misclassifies. Depending on the size of the set of examples randomly generated [2, Section 2.4], if the hypothesis does not misclassify any example then one can ensure that with high probability the total number interpretations misclassified (considering the entire space of partial interpretations) is low. More precisely, if the size of the set of examples generated randomly is at least 1 ǫ log 2 ( |H| δ ) [19], then one can ensure that the hypothesis constructed is probably approximately correct [18]. The parameter ǫ ∈ (0, 1) indicates the probability that the hypothesis misclassifies an interpretation w.r.t. the target and δ ∈ (0, 1) is the probability that the learned hypothesis errs more than ǫ.
If H corresponds to the class of formulas only expressible with Horn logic and variables V, then the number of logically different hypothesis in H is close to [1,3]: This number follows from the fact that Horn logic is closed under intersection: if I and I ′ satisfy a Horn theory then I ∩ I ′ also does [11].

Representing constraints
We explain how we can express constraints that are going to be extracted in the experimental section. Horn rules r are of the form ant(r) → con(r): (sunny ∧ happy) → jogging where all the variables both in the antecedent and in the consequent are not negated. This means that with Horn logic we cannot express rules of the form: (¬sunny ∧ happy) → boardgame night.
To express a 'weak' form of negation, we duplicate all the variables in V and treat every new variable as the negation of a variable in V. For example, letv i be the duplicated variable of any v i ∈ V. We can express the rule (ŝunny 1 ∧ happy) → boardgame night.
Usually, when duplicating variables in this way, we would like to avoid that both paired variables are true in a partial interpretation (since they represent each other's negation). For this reason, we assume that Horn rules of the form always hold, for every v ∈ V.

Experiments
In this section we show experimental results using the approach presented in the previous section where a trained neural network is treated as an oracle for the LRN algorithm. We implemented the algorithm in a Python 3.9 script and we used the SymPy library [13] to express rules and check for satisfiability of formulas. For the neural networks, we used the Keras library [5]. Our LRN implementation can start with an empty hypothesis or with a set of Horn formulas as background knowledge (assumed to be true properties of the domain at hand). The background knowledge can also be used to check if the neural network model respects some desirable properties. We conduct the experiments on an Ubuntu 18.04.5 LTS with i9-7900X CPU at 3.30GHz with 32 logical cores, 32GB RAM. We experiment our approach of extracting Horn theories from partial interpretations on a dataset in the medical domain [16]. This dataset contains missing values for attributes. We can consider each instance as a partial interpretation that sets some variables (attributes of that instance) to true, some to false, and other variables to "unknown".

HCC Dataset
Hepatocellular carcinoma (HCC) causes liver cancer, and it is a serious concern for global health. The HCC dataset [16] consists of 165 instances of many risk factors and features of real patients diagnosed with this illness.
There are 49 features selected according to the EASL-EORTC (European Association for the Study of the Liver -European Organisation for Research and Treatment of Cancer). From these features, 26 are quantitative variables, and 23 are qualitative variables. Missing values represent 10.22% of the whole dataset and only 8 patients have complete information in all fields (4.85%). The target class of each patient is binary. Each patient is classified positively if they survive after 1 year of having been diagnosed with HCC, and negatively otherwise. 63 cases are labelled negatively (the patient dies) and 102 positively (the patient survives). Quantitative variables describe, for example, the amount of oxygen saturation in the human body, the concentration of iron in the blood, or number of cigarettes packages consumed per year. The range of the values that each variable can assume varies, but it is specified. Qualitative variables can only have two different values in this dataset (either 0 or 1). Usually they describe categorical information such as if the patient comes from an endemic country, or if it is obese, etc.
The LRN * algorithm expects to receive counterexamples in the form of a partial interpretation that specifies the truth values of boolean variables. For this reason, we encode quantitative variables in a binary representation format. The interval of values of each quantitative variable is partitioned into three sub-intervals. These intervals divide the values of the quantitative variable into "low", "middle", and "high" values. For example, the interval of values of the variable that describes the number of cigarettes packages consumed by the patient per year is [0, 510] can be partitioned into [0, 50], (50, 200], (200,510]. The binarised dataset has in total 26 * 3+23+1 = 102 variables and it can be considered a set of partial interpretations. A missing value in the new dataset is denoted with '?' similarly as in the original one, otherwise the value is 1 (0) if the variable is set to true (false). Each partial interpretation I matches a rule (not necessarily Horn) of the form where each l i is a positive literal if the variable i is set to true in I and false otherwise. The literal l k is not present in the rule if l k has a missing value. As explained in the previous section, by duplicating the number of variables and pairing them such that one represents the negation of another variable, we can express the previous rule with a Horn formula. For this reason, we further modify the dataset by duplicating variables. Each new variable semantically represents the negated concept of its paired variable. So, we form a dataset D of partial interpretations with 204 variables.
We can express each example in D with Horn rules like in Formula 3. We denote by T the set of such rules that can be formed by looking at all partial interpretations in the extended dataset. To express disjointness constraints between paired variables, we assume T to also have the additional Horn rules of the form (v i ∧v i ) → ⊥ (Formula 2).
Finally, the dataset used for training the neural network is formed by randomly generating partial interpretations (with 204 variables) whose classification label is 0 if they do not satisfy a rule in T , 1 otherwise.

Model selection
By only randomly generating partial interpretations (with 204 variables), we can create a very unbalanced dataset with most partial interpretations classified as positive by the target Horn theory T (note: T is defined in the previous subsection). We  Table 1: Architecture and learning rate of the top four neural networks in ascending order with respect accuracy. The model in the first row was the selected one.
#Equiv. t h t nn h nn t tree 100 9.2% 6.0% 5.8% 8.4% 80% of the (balanced) binarised dataset was used for training and validation. We used 3-fold. As T is a Horn theory, there is no noisy data generated in this process.
We built a sequential neural network model, where the number of nodes in the input layer is 204, which is the number of variables in a partial interpretation. We used the library "Keras version 2.4.3" [5] and we empirically searched for the sequential architecture with the best performance varying the number of hidden layers, nodes in hidden layers and the learning rate. We searched our model with the following hyperparameters: 2,3,4,5 numbers of hidden layers, 4, 8, 16, 32 nodes per layer, and 0.001, 0.01, 0.1 as the learning rate. The model with the best performance had 5 hidden layers, 32, 16,8,16, 32 nodes per layer, and 0.1 learning rate. In total we tested (No.learning rates x No. node-layer combinations) = 3 · (4 2 + 4 3 + 4 4 + 4 5 ) = 4080 configurations. This means that we carried in total 3 · 4080 = 12240 training and evaluation runs. The best performing architectures are showed in Table 1.

Test Setting
In our experiments, we run the LRN * algorithm and we set a limit of 100 equivalence queries that the algorithm can ask before terminating with the built hypothesis as its output. To simulate an equivalence query, we randomly generate a sample of partial interpretations and we classify each interpretation using the neural network. Afterwards, we search for a counterexample to return to h as the answer of the query.
We compare the quality of the LRN * hypothesis with the hypothesis formed by an incremental decision tree [7], an established white box machine learning model. We use "Hoefffding Decision Tree" implementation present in the "skmultiflow" framework [14]. It is possible to generate a set of propositional rules by visiting every branch of the tree from the root to leafs labelled negatively. The sampling idea for finding negative counterexamples for LRN * is also used for extracting a decision tree from the neural network.
We generate partial interpretations randomly and they are classified by the neural network. We check if at least one of those classified partial interpretations is misclassified by the decision tree algorithm. If this is the case, we incrementally train the tree with the entire sample. This process is repeated until all classified interpretations in the sample are correctly classified by the tree.
Since the considered number of variables is 204, it is not feasible to have the size of the sample for simulating equivalence queries as dictated by Formula 1 (this number is of the order 2 204 ). Moreover, a size of the sample too small often fails in finding a counterexample. When it is the case, the LRN * algorithm will terminate and output a hypothesis with few (if not zero) rules, and the tree will only be one node with label 1. This problem is especially noticeable in our current scenario as there are many variables but relatively few interpretations that are negatively labelled by the neural network. The selected size of the sample is therefore , both for training the decision tree and for answering queries asked by the LRN * algorithm. When the LRN * hypothesis and the tree have been extracted, we compute a partial truth table of 204 variables of size 2s. We classify these interpretations according to the target T , the neural network, the LRN * hypothesis and the decision tree. We then compare the truth tables and count the number of times an interpretation is classified differently between the different models. Table 2 shows the outcome of our experiment. The columns t h, t nn, h nn, t tree are, respectively, the percentage of interpretations that are labelled differently between the target and the hypothesis, the target and the neural network, the hypothesis and the neural network, and the tree and the target. The running time of the LRN * algorithm with at most 100 equivalence queries was around 60 hours. The time for extracting an incremental decision tree is twice, around 120 hours.

Results
The type of rules that the LRN * algorithm extracted are of the form: {medium hemoglobin ∧ · · · ∧ôbese → survives} with around 40 different variables in the antecedent. With 100 equivalence queries, the hypothesis extracted has 20 rules of this type that are also present in the target T . Other rules that are entailed by T can be found in the hypothesis. Examples labelled negatively with many missing values contain more information about the dependency between variables that must be respected. Indeed, we noticed an increase of the accuracy of the neural network trained on more missing values ensuring ensured balanced classes. As a consequence, also the quality of the extracted rules improves.

Conclusion
In this work we presented an approach for extracting Horn rules from neural network models using partial interpretations. It is often the case that not all values in a dataset are known or trustable. Our method based on partial interpretations covers such scenarios and generalizes the case with (full) interpretations. We test our approach empirically using a real world dataset in the medical domain. 6