Deep dive into multi-label classification..!

Deep dive into multi-label classification..!Toxic-comments classification.Fig-1: Multi-Label Classification to finde genre based on plot summary.With continuous increase in available data, there is a pressing need to organize it and modern classification problems often involve the prediction of multiple labels simultaneously associated with a single instance.Known as Multi-Label Classification, it is one such task which is omnipresent in many real world problems.In this project, using a Kaggle problem as example, we explore different aspects of multi-label classification.DISCLAIMER FROM THE DATA SOURCE: the dataset contains text that may be considered profane, vulgar, or offensive.Bird’s-eye view of the project:Part-1: Overview of multi-label classification.Part-2: Problem definition & evaluation metrics.Part-3: Exploratory data analysis (EDA).Part-4: Data pre-processing.Part-5: Multi-label classification techniques.Part-1: Overview of Multi-Label Classification:Multi-label classification originated from the investigation of text categorisation problem, where each document may belong to several predefined topics simultaneously.Multi-label classification of textual data is an important problem..(This enters the realm of computer vision.)In multi-label classification, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets.Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.For example, multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time..Whereas, an instance of multi-label classification can be that a text might be about any of religion, politics, finance or education at the same time or none of these.Part-2: Problem Definition & Evaluation Metrics:Problem Definition:Toxic comment classification is a multi-label text classification problem with a highly imbalanced dataset.We’re challenged to build a multi-labeld model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate..the entire corpus of documents.Words in the document with a high tfidf score occur frequently in the document and provide the most information about that specific document.from sklearn.model_selection import train_test_splittrain, test = train_test_split(data, random_state=42, test_size=0.30, shuffle=True)from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')vectorizer.fit(train_text)vectorizer.fit(test_text)x_train = vectorizer.transform(train_text)y_train = train.drop(labels = ['id','comment_text'], axis=1)x_test = vectorizer.transform(test_text)y_test = test.drop(labels = ['id','comment_text'], axis=1)TF-IDF is easy to compute but its disadvantage is that it does not capture position in text, semantics, co-occurrences in different documents, etc.Part-5: Multi-Label Classification Techniques:Most traditional learning algorithms are developed for single-label classification problems..Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used.1..OneVsRestTraditional two-class and multi-class problems can both be cast into multi-label ones by restricting each instance to have only one label..An intuitive approach to solving multi-label problem is to decompose it into multiple independent binary classification problems (one per category).In an “one-to-rest” strategy, one could build multiple independent classifiers and, for an unseen instance, choose the class for which the confidence is maximized.The main assumption here is that the labels are mutually exclusive..This way the method, also called classifier chains (CC), can take into account label correlations.The total number of classifiers needed for this approach is equal to the number of classes, but the training of the classifiers is more involved.Following is an illustrated example with a classification problem of three categories {C1, C2, C3} chained in that order.Fig-13: Classifier Chains# using classifier chainsfrom skmultilearn.problem_transform import ClassifierChainfrom sklearn.linear_model import LogisticRegression# initialize classifier chains multi-label classifierclassifier = ClassifierChain(LogisticRegression())# Training logistic regression model on train dataclassifier.fit(x_train, y_train)# predictpredictions = classifier.predict(x_test)# accuracyprint("Accuracy = ",accuracy_score(y_test,predictions))print("..Adapted AlgorithmAlgorithm adaptation methods for multi-label classification concentrate on adapting single-label classification algorithms to the multi-label case usually by changes in cost/decision functions.Here we use a multi-label lazy learning approach named ML-KNN which is derived from the traditional K-nearest neighbor (KNN) algorithm.The skmultilearn.adapt module implements algorithm adaptation approaches to multi-label classification, including but not limited to ML-KNN.from skmultilearn.adapt import MLkNNfrom scipy.sparse import csr_matrix, lil_matrixclassifier_new = MLkNN(k=10)# Note that this classifier can throw up errors when handling sparse matrices.x_train = lil_matrix(x_train).toarray()y_train = lil_matrix(y_train).toarray()x_test = lil_matrix(x_test).toarray()# trainclassifier_new.fit(x_train, y_train)# predictpredictions_new = classifier_new.predict(x_test)# accuracyprint("Accuracy = ",accuracy_score(y_test,predictions_new))print(".")Output:Accuracy = 0.88166666667Conclusion:Results:There are two main methods for tackling a multi-label classification problem: problem transformation methods and algorithm adaptation methods.Problem transformation methods transform the multi-label problem into a set of binary classification problems, which can then be handled using single-class classifiers.Whereas algorithm adaptation methods adapt the algorithms to directly perform multi-label classification. More details

Leave a Reply