Now we iterate.

From our training set, remove the examples covered by the new rule.

Continue growing new rules — making our ruleset increasingly lenient — until we grow one whose precision is less than 50%.

Here is a visualization of the entire process:Growing-pruning a ruleset using iterative coverage.

My new ruleset package, wittgenstein, implements IREP, as well as another ruleset algorithm called RIPPER.

Besides sounding like a heavy metal band, RIPPER remains the state-of-art for this technique.

The algorithm is quite a bit more complicated than IREP, but here are the major differences:A more theoretically-tight stopping condition: Instead of using the pruning metric to tell us when to stop growing new rules, RIPPER borrows an information theory heuristic used in Quinlan’s C4.

5 decision tree algorithm known as description length.

The idea is that we can measure total complexity (in bits) at any stage in our modeling process as the complexity of our tentative model plus that of all the examples it fails to capture.

As our ruleset grows in length and precision, the model’s complexity increases, and the complexity of the number of examples it fails to capture decreases.

To prevent overfitting, we stop growing rules once total complexity has grown beyond some threshold.

Description length guides the balancing act between minimizing training error and minimizing model complexity.

Calculating a model’s description length is complicated and expensive.

But the gist is that model complexity is based on discriminativity.

A rule with more conditionals is more complex than one with fewer, as is a rule that selects its conditionals from a larger pool of possibilities.

k is the number of conditions in the rule, n the number of possible conditions.

r is k/n.

||k|| is the number of bits needed to send k (i.

e.

, log2(k)).

The 0.

5 factor is to account for possible redundancies.

The exceptions description length formula is more straightforward.

We choose the combinations of false positives and of false negatives from our positive and negative predictions.

(The formula uses combinations instead of permutations since order doesn’t matter).

Log₂ converts from decimal values into bits:p is the number of examples classified as positive, n the number classified as negative.

fp is the number of false positives, fn false negatives.

Model optimization: Once we’ve grown an initial ruleset, we can actually use our model to reevaluate each rule’s contribution in a more holistic way.

We consider replacing each rule with a couple of alternatives: a completely new grown-and-pruned replacement, as well as a grown-and-pruned revision of the original.

Our optimized model uses whichever of the three — original, replacement, or revision — is best.

(“Best” here is a bit complicated and somewhat hideous to implement.

It means whichever rule would result in the smallest description length for the ruleset, based on the smallest-possible description length we could get if we removed all the other rules that increase description length.

) We can repeat the optimization phase as many times as we want, but the original paper suggests two iterations.

And is there not also the case where we play and — make up the rules as we go along?- Ludwig WittgensteinWrap things up: If necessary, grow a few more rules to cover any positive training examples our optimized model no longer covers.

Finally, get rid of any rules that don’t improve description length.

Using rulesets in your codeJava users who want to use a ruleset learner can use Weka’s RIPPER implementation, JRip.

There are also Weka wrappers for Python and R.

Python users can also try wittgenstein.

(There might be other Python packages for these particular algorithms out there, but I couldn’t find any.

) The github repo is here.

To install from the command-line:pip install wittgensteinHere’s a quick usage example, using the delightful poisonous mushroom dataset.

Our goal is to produce a set of rules that can discern which mushrooms are poisonous.

Let’s start by loading our dataframe into pandas:>>> import pandas as pd>>> df = pd.

read_csv(mushroom.

csv)And train-test-splitting our data:>>> from sklearn.

model_selection import train_test_split>>> train, test = train_test_split(df, test_size=.

33, .

random_state=42)Wittgenstein uses fit-predict-score syntax that’s similar to scikit-learn’s.

We’ll train a RIPPER classifier, with the positive class defined as poisonous.

>>> import wittgenstein as lw>>> clf = lw.

RIPPER()>>> clf.

fit(train, class_feat='Poisonous/Edible', pos_class='p', .

random_state=42)During initialization/fitting, we can pass a few optional parameters:prune_size: change the grow/prune proportion.

If you want to skip the pruning stage (exciting, but not recommended!) you can set this to None when using IREP.

k: the number of optimization runsdl_allowance: the complexity stopping thresholdverbosity (1–5): Use this if you’re curious to see how rules get generated.

(Each level of verbosity is explained in the docstrings.

)n_discretize_bins: wittgenstein will automatically detect and discretize continuous features for you — use this if you want to control the number of bins.

You can test a model using the default metric (accuracy), or by passing in your own scoring metric.

Let’s import precision and recall from scikit-learn.

We’ll also examine model complexity by counting the number of conditions.

>>> # Split target class feature>>> X_test = test.

drop('Poisonous/edible', axis=1)>>> y_test = test['Poisonous/edible']>>> # Collect performance metrics>>> from sklearn.

metrics import precision_score, recall_score>>> precision = clf.

score(X_test, y_test, precision_score)>>> recall = clf.

score(X_test, y_test, recall_score)>>> cond_count = clf.

ruleset_.

count_conds()>>> print(f'precision: {precision} recall: {recall}.

conds: {cond_count}')precision: 0.

9938.

, recall: 0.

9977.

, conds: 32We can access our trained model using the clf.

ruleset_ attribute.

A trained ruleset model represents a list of “and’s” of “or’s”:>>> clf.

ruleset_.

out_pretty()[[Stalk-surface-above-ring=k^Gill-spacing=c] V[Gill-size=n^Stalk-root=?^Stalk-shape=t] V[Gill-size=n^Population=s] V[Sport-print-color=h^Cap-surface=s] V[Gill-size=n^Cap-surface=s^Stalk-shape=e] V[Gill-size=n^Habitat=g] V[Population=v^Stalk-shape=e^Bruises?=t] V[Gill-size=n^Stalk-root=b^Gill-spacing=c] V[Gill-size=n^Population=c] V[Gill-size=n^Cap-color=p] V[Gill-size=n^Gill-color=u^Cap-surface=f] V[Gill-size=n^Cap-color=g^Gill-spacing=w] V[Gill-color=g^Stalk-root=b]]To generate new predictions, use the predict method:>>> clf.

predict(mysterious_unseen_mushrooms)[True, False, False, True, False.

We can also ask our model to tell us why it made each positive prediction that it did:>>> clf.

predict(mysterious_unseen_mushrooms, give_reasons=True)([True, False, False, True, False.

, [[<Rule object: [Gill-size=n^Population=s]>, <Rule object: [Gill-size=n^Cap-surface=s^Stalk-shape=e]>], [], [], [<Rule object: [Gill-size=n^Population=s]>], [].

)Pretty cool!How well does it work?I ran my IREP and RIPPER implementations on repeated tests across an initial 11 categorical datasets (mostly from UCI), using scikit-learn’s DecisionTreeClassifier as well as grid-search-optimized RandomForestClassifier as baselines.

(I threw out the results from two datasets for which sklearn and wittgenstein, respectively, refused to make positive predictions.

)Comparing rulesets against an ensemble technique like random forest is a bit of an unfair comparison — and grid search-tuning Forest makes it even more unfair — but I wanted to see how well wittgenstein could compete with the best comparable alternative.

Even though it’s a decision tree classifier, scikit-learn’s tree implementation doesn’t actually take categorical data.

But that’s okay — we’ll just need to do a little preprocessing to get our data into a format that DecisionTreeClassifier will accept.

First, let’s use scikit’s LabelEncoder to transform our categorical features into numerical ones:>>> from sklearn.

preprocessing import LabelEncoder>>> le = LabelEncoder()>>> df_le=df.

apply(le.

fit_transform)>>> df_le.

head()Then, we use one hot encoding to create dummy variables.

Otherwise we’d be stuck with ordinal features instead of nominal ones!>>> from sklearn.

preprocessing import OneHotEncoder>>> encoder = OneHotEncoder(sparse=False)>>> encoder.

fit(df_le)>>> df_hot = enc.

transform(df_le)>>> df_hot.

head()Preprocessing complete, we’re now ready to split our data…>>> train, test = train_test_split(df_hot, test_size=.

33, .

random_state=random_state)>>> train_X = train[:,n_classes:]>>> train_y = train[:,0]>>> test_X = test[:,n_classes:]>>> test_y = test[:,0]…and train our model:>>> tree_clf = DecisionTreeClassifier(random_state=random_state)>>> tree_clf.

fit(train_X, train_y)To score our lovely tree:>>> predictions = tree_clf.

predict(test_X)>>> precision = precision_score(test_y, predictions)>>> recall = recall_score(test_y, predictions)>>> print(f'precision: {precision} recall: {recall} node_count: .

{tree_clf.

tree_.

node_count}').

Here is the code for tuning and fitting a random forest:>>> forest = RandomForestClassifier(random_state=random_state)>>> n_features = train_X.

shape[1]>>> grid_params = { 'n_estimators': [50,100,200], 'max_depth': [1,3,5,8,10,None], 'min_samples_leaf': [3,6,10,13,16,20] }>>> clf = GridSearchCV(forest, grid_params, cv=5)>>> clf.

fit(train_X, train_y)>>> best_params = clf.

best_params_>>> forest = RandomForestClassifier( .

n_estimators=best_params['n_estimators'], .

max_depth=best_params['max_depth'], .

min_samples_leaf=best_params['min_samples_leaf'], .

random_state=random_state)>>> forest.

fit(train_X, train_y)ResultsMy package was competitive with sklearn, at least on these datasets.

(A detailed Jupyter notebook with the tests can be found here.

)Here is a comparison for how frequently each ruleset model beat each sklearn model, scored by precision:Here is a recall comparison:I also compared their compactness, as measured by the total number of conditions or nodes.

Some areas wittgenstein did well:Imbalanced classes: Both ruleset algorithms handled imbalanced classes better than tree-based methods; IREP and RIPPER beat Tree in precision and recall, and both did substantially better than Forest in recall for heavily imbalanced datasets.

Overfitting: On datasets with fewer training examples and fewer examples per feature, both ruleset algorithms beat Tree on precision, and RIPPER beat Forest.

(The advantage didn’t extend to recall.

)Compactness/Interpretability: IREP and RIPPER models were orders of magnitude more compact than (out-of-the-box) Tree and (tuned) Forest.

Potential drawbacks to consider:Speed: IREP and decision trees share the same time-complexity — O(anlogn), where a is the number of attributes, and n, examples.

But RIPPER’s optimization stage can become time-consuming on larger datasets, at O(anlog²n).

More specific to this package, until I optimize key portions of wittgenstein into C++, RIPPER usually takes the longest to train of the four.

IREP, on the other hand, runs pretty speedily, even in Python.

Continuous features: sklearn’s trees implement CART, which uses a more sophisticated discretization algorithm than the one I have so-far implemented.

For now you may get better performance out of sklearn for datasets with lots of continuous features, though this could change fairly soon.

Performance: I would generally expect C5.

0 trees, random forest, and partial decision trees (a hybrid tree-ruleset approach) to perform better than ruleset models on most — but not every — dataset; against C4.

5 and CART trees, the winner should be more up-in-the-air.

As with any machine learning model, your specific data and the specific problem you face determines the best tool for the job.

Ruleset learners are a machine-learning approach that have been interesting to implement, and which can in some cases be useful to have in your toolset.

I’d love to hear your thoughts, so please feel free to reach out to me on LinkedIn or Github!References[1] J.

Furnkrantz and G.

Widmer, Incremental Reduced Error Pruning (1994), Machine Learning 1994 Proceedings of the Eleventh Annual Conference[2] J.

Ross Quinlan, MDL and Categorical Theories (continued) (1995) Machine Learning 1995 Proceedings of the Twelfth International Conference[3] W.

Cohen, Fast Effective Rule Induction (1995) Machine Learning 1995 Proceedings of the Twelfth International Conference[4] E.

Frank and I.

H.

Witten, Generating Accurate Rule Sets Without Global Optimization (1998) Machine Learning 1998 Proceedings of the Twelfth International Conference[5] T.

Wang et.

al, A Bayesian Framework for Learning Rule Sets for Interpretable Classification (2017) Journal of Machine Learning Research[6] Ludwig Wittgenstein, Philosophical Investigations (1958).