Discretization methods fall into 2 categories: supervised and unsupervised.Unsupervised methods do not use any information, other than the variable distribution, to create the contiguous bins in which the values will be placed.Supervised methods typically use target information in order to create bins or intervals.We will only talk about supervised discretisation method using decision trees here in this articleBut before moving to the next step, let’s load a dataset on which we will perform the discretisation.Discretisation with decision treesDiscretisation with Decision Trees consists of using a decision tree to identify the optimal splitting points that would determine the bins or contiguous intervals:Step 1: First it trains a decision tree of limited depth (2, 3 or 4) using the variable we want to discretize to predict the target.Step 2: The original variable values are then replaced by the probability returned by the tree..The probability is the same for all the observations within a single bin, thus replacing by the probability is equivalent to grouping the observations within the cut-off decided by the decision tree.Advantages :The probabilistic predictions returned decision tree are monotonically related to the target.The new bins show decreased entropy, this is the observations within each bucket/bin are more similar to themselves than to those of other buckets/bins.The tree finds the bins automatically.Disadvantages :It may cause over-fittingMore importantly, some tuning of tree parameters might need to be done to obtain the optimal splits (e.g., depth, the minimum number of samples in one partition, the maximum number of partitions, and a minimum information gain)..This it can be time-consuming.Let ’s see how to perform discretization with decision trees using the Titanic dataset.Import useful LibrariesIN[1]:import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split2..Load the datasetIN[2]:data = pd.read_csv('titanic.csv',usecols =['Age','Fare','Survived'])data.head()3..Separate the data into train and test setIN[3]:X_train, X_test, y_train, y_test = train_test_split(data[['Age', 'Fare', 'Survived']],data.Survived , test_size = 0.3)So, assuming that we do not have missing values in the dataset (or even if we have missing data available in the dataset, we have imputed them )..I am leaving this part because my main goal is to show how discretisation work.So, Now let’s visualize our data such that we gain some insights out of it and understand the variables4..Let’s build a classification tree using the age to predict Survived in order to discretise the age variable.IN[4]:tree_model = DecisionTreeClassifier(max_depth=2)tree_model.fit(X_train.Age.to_frame(), X_train.Survived)X_train['Age_tree']=tree_model.predict_proba(X_train.Age.to_frame())[:,1] X_train.head(10)Now that we have a classification model using the age variable to predict the Survived variable.The newly created variable Age_tree contains the probability of the data point belonging to the corresponding class5..Checking the number of unique values present in Age_treevariableIN[5]:X_train.Age_tree.unique()Why only four probabilities right?Above in input four, we have mentioned max_depth = 2..A tree of depth 2, makes 2 splits, therefore generating 4 buckets, that is why we see 4 different probabilities in the output above.6..Check the relationship between the discretized variable Age_tree and the target Survived.IN[6]:fig = plt.figure()fig = X_train.groupby(['Age_tree'])['Survived'].mean().plot()fig.set_title('Monotonic relationship between discretised Age and target')fig.set_ylabel('Survived')Here, we can see a monotonic relationship between the discretised variable andAge_tree the target variable Survived..That plot suggests that Age_tree seems like a good predictor of the target variable Survived .7..Checking the number of passengers per probabilistic bucket/bin to under the distribution of the discretized variable.IN[7]:X_train.groupby(['Age_tree'])['Survived'].count().plot.bar()Let's check the Age limits buckets generated by the tree by capturing the minimum and maximum age per each probability bucket to get an idea of the bucket cut-offs.8.. More details