and How To Get Started Using it?Will BadrBlockedUnblockFollowFollowingFeb 22Weka is a sturdy brown bird that doesn’t fly.
The name is pronounced like this, and the bird sounds like this.
It is endemic to the beautiful island of New Zealand, but this is not what we are discussing in this article.
In this article, I want to introduce you to the Weka software for Machine Learning.
WEKA is short for Waikato Environment for Knowledge Analysis.
It is developed by the University of Waikato, New Zealand.
It is an open source Java software that has a collection of machine learning algorithms for data mining and data exploration tasks.
It is a very powerful tool for understanding and visualizing machine learning algorithms on your local machine.
It contains tools for data preparation, classification, regression, clustering, and visualization.
Use Cases:If you just started to learn about machine learning and algorithms, then WEKA is the best tool to get started and explore the different algorithms to see which one can be best applied to your problem.
Sometimes you have a classification problem but you do not know which algorithm can solve it with the best accurate results.
WEKA is an easy way to apply many different algorithms to your data and see which one will give the best results.
Installation:Installing the software is quite simple, you just need to have Java 8 installed as a pre-requisite then download the right executables for your platform from HERE.
After installation, navigate to the package manager to start installing any learning schemes and toolsGUI Chooser — WekaThe first package I will install and demonstrate is Auto-Weka.
Auto-weka is the AutoML implementation for Weka.
It automatically finds the best model with its best hyperparameter settings for a given classification or regression task.
Once the installation is finished, you will need to restart the software in order to load the library then we are ready to go.
NOTE: There is a known issue.
In order to use Auto-WEKA through the WEKA GUI on OSX, WEKA must be run from the command line instead of the Mac application.
For example:cd /Applications/weka-3–8–1-oracle-jvm.
app/Contents/Java/java -jar weka.
jarLoading a Dataset:Weka also comes with a few datasets that you can use for experimentations.
Now, let’s load our first dataset and explore it a little.
In this example, I will use the Iris Dataset.
For Mac OSX, click on “Explorer” → “Open File” → /Volumes/weka-3–8–3/weka-3–8–3/data.
After loading the dataset, you can see that weka will automatically show some stats on the attributes (listed on the left-side window) and when you select one feature, it will the class distribution for that specific feature.
You can also visualize the class distribution in relation to all the other features by clicking on “Visualize All”.
let’s train an actual classifier.
Weka comes with many classifiers that can be used right away.
When you select the “Classify” tab, you can see a few classification algorithms organized in groups:Here is a summary for each of those groups:bayes: a set of classification algorithms that use Bayes Theorem such as Naive Bayes, Naive Bayes Multinominal.
function: a set of regression functions, such as Linear and Logistic Regression.
lazy: lazy learning algorithms, such as Locally Weighted Learning (LWL) and k-Nearest Neighbors.
meta: a set of ensemble methods and dimensionality reduction algorithms such as AdaBoost and Bagging to reduce variance.
misc: such as SerializedClassifier that can be used to load a pre-trained model to make predictions.
rules: Rules-based algorithms such as ZeroR.
trees: Contains decision trees algorithms, such as Decision Stump and Random Forest.
Now, let’s first classify the Iris dataset using a Random Forest Classifier.
Random Forest is an ensemble learning algorithm that can be used for classification, regression and other tasks.
It works by constructing a multitude of decision trees at training time and outputting the predicted class.
In order to use RF in Weka, select the Random Forest from the trees group.
Once you select the Random Forest algorithm, it will automatically load the default set of the hyperparameters.
You can customize the hyperparameters by clicking on the command that shows up next to the classifier.
As for evaluating the training, I will use cross-validation with 15 k-folds then we are ready to train.
The Iris dataset is quite small so the training time will be in a fraction of a second.
It will generate an output summary for the training as below:You can see metrics like the Confusion Matrix, ROC Area, Precision, Recall, etc.
What Algorithm Should I Use For My Problem?Generally speaking, it is hard to know which algorithm would work best for the problem you are trying to solve.
Once you narrow down the scope of the problem (MultiClass Classification, Regression, Binary Classification), you can start trying a set of algorithms that are designed to tackle that scope of the problem.
During that process, you may discover the algorithm that can pick up the hidden structure in your data.
Weka is a really good tool to achieve that because you can quickly switch between algorithms and train them on a portion of your dataset then compare the results without having to write much code.
Once you settle down on the algorithm, you might start implementing a production level of that algorithm that worked best on your data.
Hence, You can use a smarter way that can automatically select the right algorithm with the right hyperparameters for your data.
This smart way is called: AutoML.
Using AutoML (Auto-Weka):Auto-WEKA is the AutoML implementation package for Weka.
It is used much like any other WEKA classifier.
After loading a dataset into WEKA, you can use Auto-Weka to automatically determine the best WEKA model and its hyperparameters.
It does so by intelligently exploring the space of classifiers and parameters using the SMAC tool.
Auto-WEKA has only a few options that can be tweaked.
For most cases, only two options can be relevant:timeLimit: The time in minutes Auto-WEKA will take to determine the best classifier and configuration.
You can turn this knob to a higher value if you cannot get good results in the default time (15 minutes)memLimit: The memory limit in megabytes for running classifiers.
If you have a very large dataset, you should increase this value.
While Auto-WEKA is running, it will provide the number of configurations evaluated so far and the estimated error of the best configuration in the status bar as below:parallelRuns: The number of runs to perform in parallel.
After the time limit is finished, the auto-weka process will stop and display the best configurations for your dataset.
Now, let’s explore some of the Auto-Weka results.
Below is the summary of the Auto-Weka output:It says that the best classifier is the Logistic Model Trees algorithm (LMT) with the hyperparameters specified above as “arguments”.
You can interpret those arguments by using the documentation for the LMT classifier.
Here is the meaning for each argument:-C Use cross-validation for boosting at all nodes-P Use error on probabilities instead of misclassification error for stopping criterion.
-M 3 Set minimum number of instances at which a node can be split to 3.
-W 0 Set beta for weight trimming for LogitBoost.
0 means no weight trimming.
You can also see that the Auto-Weka proved that LMT will give better results than Random Forest.
LMT has 96% correctly classified instances compared to 94% with the RF and less Incorrectly classified instances at 4% for LMT compared to 6% for RF.
This is a small difference but it can have a huge impact on larger datasets.
Summary:This article is a quick starting guide to how to use Weka to explore and train machine learning algorithms on your dataset using the GUI without having to write any code.
It is very useful to gather some insights into your data or even learning a new algorithm or knowing what algorithm would best work for your dataset.
.. More details