RIDDLE : Rule Induction with Deep Learning

Numerous applications rely on the eﬃciency of Deep Learning models to address complex classiﬁ-cation tasks for critical decisions-making. However, we may not know how each feature in these models contributes towards the prediction. In contrast, Rule Induction algorithms provide an interpretable way to extract patterns from data, but traditional approaches suﬀer in terms of scalability. In this work, we bridge Deep Learning and Rule Induction and deﬁne the RIDDLE ( R ule I n d uction with D eep Le arning) architecture. We show that RIDDLE has state-of-the-art performance in Rule Induction via an empirical evaluation


Introduction
The adoption of Rule Induction algorithms supports decisions in medicine [30,27], fault and fraud detection [33,7], and brings benefits to chemical, oil [4], and energy industries [32], among others.These algorithms express patterns found in data in the form of associative ('if-then') rules effectively aiding users in decision-making [5,6,31,22].
Rule Induction approaches have to solve hard combinatorial problems, as the symbols available for constructing the rules form a discrete space [5,6,20].Hence, their scalability suffers when compared to classification methods that can rely on techniques tailored for the optimisation of differentiable functions, such as Deep Leaning with gradientbased optimisation [15,22].Indeed, Deep Learning approaches have excelled in many tasks, including image segmentation [24] and text generation [16].The success of Deep Learning approaches have two well-known reasons: the flexibility to handle many different forms of data and scalability powered by technological advances.Yet, the lack of interpretability of Deep Learning models limits their application in systems that directly or indirectly influence critical decisions [6,29].
Our main contribution is an original Deep Learning architecture dubbed RIDDLE (Rule Induction with Deep Learning) which learns rules using a differentiable error function.This means that our approach can employ the efficient optimisation methods, while providing interpretable rules.We show with formal arguments the reasons that led the development of the RIDDLE architecture.Since it is based on the formal framework of Possibility Theory, RIDDLE builds rules equipped with certainty degrees which express the reliability of each rule.This has two main advantages over most traditional Rule Induction methods: (1) they avoid having 'sharp' decision boundaries and (2) the orders of the rules is irrelevant, that is, they yield rule sets instead of lists [20].Furthermore, we show that our method has state-of-the-art performance (accuracy) on well-known datasets, especially those with uncertain or missing information.

Related work.
The most prominent Rule Induction algorithms belong to the decision tree family such as RIPPER [5] that outputs binary (or crisp) rules.FURIA is an extension that generates fuzzy rules [20].To solve the issue of scalability of Rule Induction systems in Big Data settings, Elkano et al. [15] proposed CFM-BD, a distributed system for fuzzy Rule Induction using a MapReduce paradigm.While CFM-BD has shown promising results, it still solves a search problem on a large discrete space.
We can also find works on neuro-symbolic approaches for Rule Induction.DR-Net [28] employs a simple 2-layer neural network architecture to learn rule sets.DR-Net also controls the complexity of the rules learned via a sparsity term.In contrast, RIDDLE automatically prioritises simpler rules over complex ones as a consequence of weightdecay.Kusters et al. [22] define a 3-layer neural network architecture for rule induction, named R2N, that can also identify potential new terms.R2N integrates neural networks and Rule Induction with a differentiable function, but it can only learn positive DNF, a restricted class of rules where variables cannot be negated [1].RIDDLE instead can learn any propositional formula, while also providing a measure of the 'reliability' for each rule.
Glanois et al. [18] propose HRI, a hierarchical approach to Rule Induction designed for Inductive Logic Programming (ILP) [25].The language of rules differs considerably from propositional rules.Their method also relies on pre-defined rule templates that determine the types of rules that can be learned, while we do not impose such restrictions with our method.
In Section 2, we present the theoretical foundations of our contribution.Then, we introduce RIDDLE in Section 3. In Section 4, we provide an empirical comparison between RIDDLE and FU-RIA, a prominent fuzzy Rule Induction algorithm.We conclude and mention future steps in Section 5.
Propositional Logic.Let V be a finite set of boolean variables.A literal over V, denoted with the symbol l, is either a variable v ∈ V or its negation, in symbols, ¬v.The former is also called a positive literal, the latter a negative literal.A clause is a disjunction (∨) of literals.A formula1 is a conjunction (∧) or set of clauses.A clause c can also be expressed as a rule r of the form ant(r) → con(r) where ant(r) (the antecedent of r) is a conjunction of all but one negated literals in c and con(r) (the consequent of r) is a single literal.Example 1 illustrates these concepts.
A (partial) interpretation I over V is a function I : V → { , ⊥, ?} that states which variables are regarded as 'true', 'false', and 'unknown'.I falsifies a variable v ∈ V if I(v) = ⊥, otherwise it satisfies it.I satisfies a negative literal ¬v iff I(v) is equal to '? ' or '⊥'.I satisfies a clause c if it satisfies at least one literal in c.Intuitively, I satisfies a clause if there is a way of replacing 'unknown' values (?) with 'true' ( ) or 'false' (⊥) such that the clause is satisfied.Also, I satisfies a formula φ, in symbols I |= φ, if every clause in φ is satisfied by I. We write I |= φ instead, if I does not satisfies φ.We clarify these notions with the following example.
For conciseness, we abuse the notation and write ⊥ and when referring to the empty disjunction and the empty conjunction, respectively.Possibility Theory.A possibilistic clause over V is a pair (φ, α), where φ is a clause over V and α is a real number with finite precision in the interval (0, 1], called the valuation of φ.A possibilistic formula K is a conjunction of possibilistic clauses.Given K, and a set Ω of interpretations over V, we define a possibility distribution π K : Ω → [0, 1] as The possibility degree Π K (φ) of φ, indicates how much φ is coherent with π K and N K (φ) expresses the necessity degree of φ being implied by π K .They are defined as follows and it satisfies a possibilistic formula K if it satisfies each (φ, α) ∈ K.We have that (φ, α) is entailed by K, written K |= (φ, α), if all possibility distributions that satisfy K also satisfy (φ, α).We recall key properties of this theory [14].

Neural Networks.
We assume an arbitrary but fixed ordering of the variables in V: (v 1 , . . ., v n , t 1 , . . ., t m ).We will predict the certainty degree of the variables t j with the information provided by v i .In this work, a neural network model is a function g : [0, 1] 2n → [0, 1] m .It takes as input a vector denoting the possibilities of each literal and it outputs the possibilities of each t j .The function g contains parameters to be optimised by iterative updates of the backpropagation algorithm.To take advantage of the optimisation steps, we employ the function log-sum-exp: , as a smooth approximation of the max (min) function when α → ∞ (α → −∞).We write LSE max for LSE 30 and LSE min for LSE −30 .

Introducing RIDDLE
In this section, we describe the theoretical motivations that led the development of RIDDLE.Our goal is to predict Π(¬t) (or Π(t)), of a target variable t, from the possibility degrees of literals V := {v 1 , ¬v 1 , . . ., v n , ¬v n }.If needed, we can compute how necessary the target variable is according to our input with N (t) = 1 − Π(¬t) (or N (¬t)).
We assume the general setting in which we have a dataset of (partial) interpretations I := {I 1 , . . ., I d } where some rules of the form φ → t generally hold.We can convert the statements in I ∈ I to possibilistic degrees via the method proposed by Joslyn [21] to estimate possibilities from imprecise data.We first define the set The number j in x i j corresponds to the value at the j-th position.We denote the formula associated to x i w.r.t.I i ∈ I by x 1 = (1, 0, 1, 0, 1, 0), and Theoretical Motivation.We assume that the unknown formula holds in the dataset of possibility degrees D, and that N K (t) = 0.
Proof.By Lemma 1 and definition of K, we have n , it holds that (1 − max(x)) = min(1 − x), hence by the relation between possibility and necessity: For convenience, we denote by ψ i := l i,1 ∨ . . .∨ l i,si the clause ¬φ i for any rule (φ i → t, α i ) ∈ K.By Lemma 2 and P1, for any formula X we can compute Π K∪X (¬t) with By definition of K, Π K∪X (ψ i ) = Π X (ψ i ), so we can propagate the known uncertainty of input x ∈ D to obtain the certainty degrees of the unknown target variable t with min-max operations.In practice, we do not know what rules φ i → t hold in K and their necessity degree α i .But, such rules constrain every possibility degree in D that we can use to induce φ i and α i , with 1 ≤ i ≤ k.Now, we describe RIDDLE, a novel neural network architecture for Rule Induction leveraging the uncertainty propagation properties of Possibility Theory.
Architecture.We can alternatively compute Π K∪X (ψ i ) as a parametrised combination of product and maximum operators: where for odd (even) 1 ≤ t ≤ n, w ψi t ∈ [0, 1] selects to what degree v t (¬v t ) appears in ψ i .In matrix notation with input x ∈ D, this operation becomes where denotes the matrix dot product with the sum replaced by the LSE max operator.Lemma 3 states that for any clause ψ and input x, we can find Lemma 3.For any clause ψ and formula X over V, there is By P1, we have Π X (ψ) = max(Π X (l 1 ), . . ., Π X (l s )).We can assign for odd 1 ≤ t ≤ n, the value w ψi t = 1 (w ψi 2t = 1) if v t (¬v t ) appears as a top-level literal in the disjunct ψ i , otherwise the value 0. By definition, we get max(w ψ 1 Π X (l 1 ), . . ., As a consequence, we can approximate the computation of Π K∪X (¬t) in Eq. ( 1) with The vector β ∈ [0, 1] k is the parameter that approximates 1 − α i .Therefore, the rule induction problem of finding rules in K is reduced to selecting the right value of each parameter w ψi and β i .
We can improve this method by exploiting the associative property of the LSE max operator and compute f ψ (x) as LSE max (f Ψ1 (x), . . ., f Ψ l (x)), where each Ψ j is a subformula of ψ (Example 4).Moreover, some rule antecedents φ i , φ j (with i = j) in K may share subformulas, so we can decrease the number of parameters by stratifying each f ψi (x).

Example 4. Given ψ
8 , we can compute f v1∨¬v2 (x) = x w v1∨¬v2 , f v3 (x) = x w v3 , and f v4 (x) = x w v4 at first and then compute For l ≥ 1, let HL l i : [0, 1] l h → [0, 1] be the layer that takes as input l h arguments and for x ∈ D computes f ψi (x) as follows where each 1 ≤ s ≤ l, w t i,j ∈ [0, 1].In other words, for 1 ≤ s ≤ l and 1 ≤ j ≤ l h , the function HL s j (x) computes Π X (Ψ ) for a subformula Ψ of ψ i , in the same way f ψi (x) computes Π X (ψ i ).Each layer , is a function with u, v ≥ 1 freely chosen (hyperparameters) that obey the constraint posed by the standard matrix dot product.Finally, we can define RIDDLE(x) as Proof.For all literals l, by definition N K (l) = 0. Hence, for any antecedent ψ, Π K∪X (ψ i ) = Π X (ψ).By Lemma 3 and associativity of max, we can set the values of the parameters in HL l i (x) so that it computes Π X (φ i ).By Eqs. ( 2) and (3), we get that RIDDLE(x) computes Π K∪X (¬t) as in Eq. (1).
Multi-class tasks are modelled with many output nodes or with multiple RIDDLE instances in parallel.Theorem 1 shows the generality of our approach but it relies on Lemma 3 which requires parameters in [0, 1].Thus, after updating the parameters with SGD [3], we replace negative values with 0 and values greater than 1 with 1.
Rule Extraction and Injection.When the optimisation procedure terminates, we can track the literals that are used to discriminate the target possibility value by inspecting the value of each parameter.Indeed, every parameter lies in the interval [0, 1] and by the semantics given to their values, we can apply the argument in Lemma 3, to extract the literals included in the clause of each layer HL l i .We observed that the parameters always collapse to either 0 or 1 after a sufficient number of updates (Section 4).Also, the introduction of hidden layers can be considered a way of having predicate invention as in ILP settings [25].
We can also manually inject rules of the form φ → t to a RIDDLE instance before or after training.Due to Lemma 3, we just need to append to the operation LSE min in Eq. ( 3) a layer HL : [0, 1] 2n → [0, 1], corresponding to the function f ¬φ .

Experimental Results
We implemented the RIDDLE model in Python 3.9 that is fully integrated in the PyTorch [26] ecosystem.
The implementation is freely available at the following link: https://git.app.uib.no/Cosimo.Persia/riddle.
The gradient of the model parameters are computed with PyTorch's automatic differentiation package and after the update, they are 'clamped' to the range [0, 1] to preserve the correctness of the model.We conduct the experiments on an Ubuntu 18.04.5 LTS server with i9-7900X CPU at 3.30GHz, 32 physical cores, 8 GPUs NVIDIA A100 with 80GB, and 32GB RAM.
Test settings.Often, the features in the considered datasets include a mixture of nominal, continuous, and integer fields.Using feature discretization, we divide continuous or integer values in 8 bins such that all bins for each feature have the same number of points.Each bin will be associated with a new variable that it is going to be set to 'true' if the value of the original value belongs to the respective bin.Missing values assign the value 'unknown' to all related new binarised variables.In this way, we can generate a set of interpretations D := {I 1 , . . ., I d }.From D, we can derive the dataset D := {x i | I i ∈ D} as explained in Section 3. D expresses the possibility values of variables and their negation for each interpretation in D. The first column in Table 1, shows the datasets that we considered for the benchmark.These are freely available at UCI machine learning repository [8].Briefly, with 'breast cancer', 'hepatitis', 'horse', 'hypothyroid', 'lymphography', and 'primary tumor', the model should predict the type of disease or the patient's survival.With 'auto', 'credit', 'chess', 'glass', 'mushroom', and 'wine', the model should classify the specific type of object under scrutiny; for example, whether a mushroom is edible.More details about these can be found at the UCI website 2 .Most datasets have a substantial 2 http://archive.ics.uci.edu/mlamount of missing values.
Evaluation.RIDDLE minimises the MSE (mean squared error) during training (see column 'MSE' in Table 1).Compared with arbitrary linear/ReLu feedforward deep network architectures, RIDDLE performs slightly better (in addition to being explainable).Therefore, we will focus our comparison on the accuracy of the FURIA algorithm [20] that represents the state-of-the-art in propositional Rule Induction with confidence values.We use the FURIA implementation available on Weka [17], and compare it with RIDDLE on classification tasks using the standard definition of accuracy.To use the trained RIDDLE model for classification, we look at RIDDLE output (Π(¬t 1 ), . . ., Π(¬t m )) and if Π(¬t i ) ≤ 0.4, then the variable t i is preferred over its negation and we assume that the variable t i is predicted to be true (N (t) = 1 − Π(¬t)).We carried our tests on the same benchmark datasets used by the aforementioned rule induction system.
Model selection.We split each dataset in 70%, 10%, and 20% for training, validation and test, respectively.Finding the best combinations of hyperparameters (number of layers, nodes per layers, learning rate etc.) can be a time-consuming task.But, we noticed that, in general, RIDDLE performs well even with small networks and extremely well with deeper configurations.We adjust the hyperparameters on a grid search fashion using Tune [23].The number of hidden layers varies from 1 to 10 and the number of nodes per layer from 2 to 500, the batch size from 8 to 64, and the learning rate from 0.1 to 0.001.The final models' sizes correlate positively with the number of variables given as input and with the number of rules that hold in the dataset.On average, the resulting network has 6 layers with 50 nodes.Each model is trained with a batch of size 16 over 100 epochs with a learning rate of 0.01, and with early-stopping.That is, we stop the training routine if the validation loss has not has not decreased by more than 0.01 for 10 subsequent epochs.Moreover, we fixed a weightdecay factor of 0.001.This means that the gradients used for updating the parameters are summed to the constant value 0.001.

Results
. Table 1 shows the results of our experiments.The column 'Inst.' shows the number of instances, #V shows the number of variables in the original dataset and #DV is the num- Table 2 shows the statistical results concerning the experiments.For each dataset, we stored the minimal and maximal accuracy.Then, we computed the median and standard deviation (S.D.).From Table 2 we can conclude that RIDDLE has consistent performance in most of the datasets considered.
We remark that another advantage of RIDDLE is that the values provided with the rules have a clear meaning, in terms of necessity.Meanwhile, fuzzy approaches such as FURIA provide a weaker foundation for the interpretation of the values associated with the rules.Additionally, the necessity values also distinguish RIDDLE from approaches such as decision trees which, usually, do not provide a measure of a rule's reliability.

Conclusion
In this work, we introduced RIDDLE: a novel deep learning architecture specialised in performing Rule Induction in the presence of incomplete or uncertain data.RIDDLE is a white-box model as its trained weights have a clear meaning concerning the decisions that the model takes while performing inference on the input.These weights can be translated into propositionally complete rules that are simpler than the rules found by state-of-theart algorithms.In addition, each rule is associated with certainty degree expressing the confidence of the model about the induced rule.Not only that, RIDDLE can also seamlessly incorporate background knowledge via rule injection.Thus, RIDDLE provides an efficient, flexible, and interpretable solution for Rule Induction.
Future work.The next step is to optimise the matrix computation in RIDDLE's implementation and speed-up both training and inference time.Also, we will evaluate the effect of different methods of drawing possibilities distributions from imprecise data [10,13] on accuracy.
contains precisely the interpretations in I that differ only on the unknown values of I.Then, from I I , we count the number of interpretations that satisfy a literal l which is given by c I I (l) := |{I ∈ I I | I |= l}|.Finally, the possibility associated to a literal l, according to the facts in I and I, is defined as Π I I (l) := c I I (l)/max(c I I (l), c I I (¬l)).Therefore, from I we can get the set of possibility degrees for each input literal D := {x 1 , . . ., x d }, where for any

Table 1 :
Accuracy and model complexity of RIDDLE and FURIA compared with different datasets ber variables after discretization and binarization.'MSE' is the loss of RIDDLE on the test data.

Table 2 :
Additional statistics from the empirical evaluation of RIDDLE obtained by running 40 training instances.