", false));For example, lets run different combination of supervised algorithms on the Ionosphere dataset (https://archive.

ics.

uci.

edu/ml/datasets/Ionosphere!!)AdaBoostM1 / MultiBoostAB +/ Pruned TreeSVM+/AdaBoostingM1+/PolyKernel+/Normalized Polykernel)IBK +/ k=3 +/ crossValidation)MLP (with non-linear sigmoid transfer function) and so on.

Then we have the interesting observations:MLP achieves 99% accuracy but takes 1.

45 s.

Its very useful for large number of heterogeneous features.

MLP is slower to learn nonlinear functions with complex local structures due to global nature of functional approximations.

During gradient descent, the plateaus in which the parameter is trapped in the process of learning, take long time to get rid of them.

SVM with Normalized PolyKernel achieves 97.

14% accuracy in 0.

10 s.

SMO Algo is very much resilient to over]fitting.

Looks like Normalized-PolyKernel performs better on test datasets than PolyKernel as it normalizes some sparseness in feature vectors and finds a better model.

J48 with numFolds = 5 and reducedErrorPruning offers higher accuracy say 91.

22%.

The accuracy further improves to 97.

8% when AdaBoostM1 / MultiBoostAB boosting is applied.

KNN with k=3 and with cross validation offers very good result.

But shows some inconsistency due to unbalanced data distribution problem which is not handled properly.

Experimenting Unsupervised AlgorithmNow lets apply Unsupervised Algorithm on the datasets to understand how clustering techniques and feature selections works by creating an Wrapper classes like ICA , NN-ICA, Kmeans-ICA, NN-KMeans, EM-ICALets check the Fun Facts with the following dataset:Users are interviewed to provide opinion about purchase behavior against a large product brand.

prefer_emails, quality_products_priced_high, consult_multiple_sites_before_buy, prefer_television_ads, prefer_mobile_ads,watching_ads_makes_me_buy, prefer_online_networking, women_main_decision_maker, prefer_kids_friendly_ads, prefer_big_brands, frequently_visit_malls, prefer_online_offers, prefer_installment_payments, prefer_restaurant_coupons_shopping, prefer_movie_coupons_shoppingUser Feedback data is captured as numeric values:1,3,1,2,3,1,2,2,3,2,2,1,1,1,1 (1 = Strongly Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 = Disagree, 5= Strongly Disagree)Lets first try to apply KMeans ALgorithm which provides iterative distance-based disjoint sets of clusters .

DataSet ds = CSVFilereader.

read(“/resources/marketing_survey.

csv”, skipHeader); KMeansClusterer km = new KMeansClusterer(6); km.

estimate(ds);Weka Output is simple to interpret.

Evidently , Cluster 0 is the best one across all the features.

We can make other interesting observations like there are customers in Cluster 5 providing stronger opinion about ‘prefer_mobile_ads‘ than other clusters.

Next lets run Expectation Maximization Algorithm which finds clusters with prior probabilities.

It takes considerably longer time to find the clusters as it needs to calculate the cluster membership probability of any instance.

We observe that Cluster 2 offers best market segment.

EMClusterer em = new EMClusterer(); em.

estimate(ds);Now lets focus on Dimensionality reduction algorithms.

Principal Component Analysis: Dimensionality Reduction is achieved by choosing enough Eigen vectors to account for 95% of variance of original data.

PrincipalComponentAnalysis filter =new PrincipalComponentAnalysis(dataset); System.

out.

println(set.

getDescription()); System.

out.

println(filter.

getProjection()); Matrix reverse = filter2.

getProjection().

transpose(); for (int i = 0; i < set.

size(); i++) { Instance instance = set.

get(i); instance.

setData(reverse.

times(instance.

getData()).

plus(filter.

getMean())); }Once the Covariance Matrix is printed , we can find the mutual feature correlation.

PCA answers the questions — which of these M parameters explain a significant amount of variation contained within the data set ?Then we select Top 5 Ranked Attributes.

0.

864 1 0.

443v5+0.

377v4+0.

354v11+0.

327v9+0.

298v6…0.

7509 2 -0.

461v4–0.

422v5+0.

382v9+0.

377v11+0.

339v10…0.

6513 3 0.

519v8–0.

505v12–0.

415v11+0.

388v9–0.

294v13…0.

5564 4 0.

529v6–0.

473v3+0.

361v5–0.

327v10–0.

318v4…0.

4744 5 -0.

446v2–0.

373v8–0.

362v7–0.

36v3+0.

326v10…Similarly, we can run Independent Component Analysis: It generates enough Eigen Vectors in the newly generated feature set to provide an idea about features with higher variations that can be selected for further classifications.

System.

out.

println("Before ICA"); System.

out.

println(set.

getDescription()); System.

out.

println(set); IndependentComponentAnalysis filter = new IndependentComponentAnalysis(set, 8); filter.

filter(set);System.

out.

println("After ICA");System.

out.

println(set);As we mentioned earlier our goal is to mix and match the ALgorithms.

So next we run KMeans Clustering with PCAPrincipalComponentAnalysis filter = new PrincipalComponentAnalysis(dataset, 15); filter.

filter(dataset); System.

out.

println(dataset.

getDescription()); KMeansClusterer km = new KMeansClusterer(4); km.

estimate(dataset);then we run KMeans Clustering with ICAIndependentComponentAnalysis filter = new IndependentComponentAnalysis(set); filter.

filter(set); System.

out.

println(set.

getDescription()); KMeansClusterer km = new KMeansClusterer(3); km.

estimate(set);java -cp unsupervised_algo.

jar:lib/ABAGAIL.

jar com.

ml.

unsupervised.

tests.

KMeansWithIndependentComponentAnalysisTestWe can also apply couple of other Weka dimensionality reduction techniques like InsignificantComponentAnalysis and RandomProjectionRandomProjection reduces the dimensionality of the data by projecting it onto a lower dimensional subspace using a random matrix with columns of unit length (i.

e.

It will reduce the number of attributes in the data while preserving much of its variation like PCA, but at a much less computational cost).

It first applies the NominalToBinary filter to convert all attributes to numeric before reducing the dimension.

Ideally we should encapsulate the code into suitable composite model builder wrapper and test all different variations.

To have more fun , we can run Neural Network with PCA//Feature Namesobs#, chk_acct,duration,history,new_car, used_car, furniture,radio_tv, education,retarining,crd_amount,sav_accnt,employment,install_rate,male_div, male_single, coapplication,guarantor, present_redisence, real_estate,prop_unknown,age,other_install,rent,own_res,num_credits,job,num_dependents,telephone,foreign,response//Feature Values2,1,48,2,0,0,0,1,0,0,5951,0,2,2,0,0,0,0,0,2,1,0,22,0,0,1,1,2,1,0,0,05,0,24,3,1,0,0,0,0,0,4870,0,2,3,0,1,0,0,0,4,0,1,53,0,0,0,2,2,2,0,0,0PrincipalComponentAnalysis filter = new PrincipalComponentAnalysis(set, 15); filter.

filter(set); network = factory.

createClassificationNetwork( new int[] {inputLayer, hiddenLayer, outputLayer}); ConvergenceTrainer trainer = new ConvergenceTrainer( new BatchBackPropagationTrainer(set, network, new SumOfSquaresError(), new RPROPUpdateRule())); trainer.

train(); System.

out.

println(“Convergence in “+trainer.

getIterations()+”iterations”);Instance[] patterns = set.

getInstances(); int true_positives_num = 0; int actual_class = 0; int predicted_class = 0; for (int i = 0; i < patterns.

length; i++) { network.

setInputValues(patterns[i].

getData()); network.

run(); actual_class = Math.

round(Float.

parseFloat(patterns[i].

getLabel().

toString())); predicted_class = Math.

round(Float.

parseFloat(network.

getOutputValues().

toString())); if(actual_class == predicted_class) { true_positives_num++; } }double true_positives = ((true_positives_num*100/(patterns.

length)));So essentially, we can very easily blend all different techniques and build a very powerful set of composite models like Kmeans-ICA, KMeans-PCA, NN-LDA, NN-ICA, NN-KMeans, NN-EMC etc.

There is no specific right or wrong model here, its all subject to experimentation by using Weka or any other library like SciKit.

Now lets switch gears and look into Randomized Searches and build a framework using ABAGAIL and Weka library.

Sometimes MLP is not able to converge to a solution for a complex dataset due to: (1) unbiased estimators but probably with high variance (due to over-fitting) / (2) rigid model in contrast , leading to small variance but high bias / (3) takes longer time and also runs the risk of getting caught into a local optima.

So lets apply Randomized optimization instead of Gradient descent in Back-Propagation NetworkLocal Random Search Algo using ABAGAIL.

jar (in classpath).

Here we apply 3 types of Optimization Algorithms on training samples.

We can combine multiple techniques and also test in isolated manner with different hyper-parameters.

oa[0] = new RandomizedHillClimbing(nnop[0]);oa[1] = new SimulatedAnnealing(1E11, .

1, nnop[1]); oa[2] = new StandardGeneticAlgorithm(100, 50, 10, nnop[2]);RandomizedHillClimbing is “a loop that continuously moves towards increasing value”.

It can randomly choose among the set of best successors to avoid local minima.

Simulated Annealing is an example of random local search method to incorporate sideways and downhill moves at the beginning with a probability based on the size of the change in objective function.

The idea is to do enough exploration of the whole space early on so that the final solution is relatively insensitive to the starting state.

This should lower the chances of getting caught at a local maximum, a plateau, or a ridge.

”Genetic Algorithm is an optimization and search technique based on the principles of genetics and natural selection.

A GA allows a population composed of many individuals to evolve under specified selection rules to a state that maximizes the “fitness” (i.

e.

, minimizes the cost function).

train (OptimizationAlgorithm oa, BackPropagationNetwork network, String oaName) { // for(int i = 0; i < trainingIterations; i++) { oa.

train(); double error = 0; for(int j = 0; j < instances.

length; j++) { if(instances[j] == null || instances[j].

getData() == null) continue; network.

setInputValues(instances[j].

getData()); network.

run(); Instance output = instances[j].

getLabel(), example = new Instance(network.

getOutputValues()); example.

setLabel(new Instance(Double.

parseDouble(network.

getOutputValues().

toString()))); error += measure.

value(output, example); } } }Once we run the combination of algorithms , we observe something very interesting w.

r.

t.

current data set.

GA provides a list of optimum variables, not just a single solution.

Its Random exploration (via CrossOver) technique can find solutions that other local search algo can’t find.

Random mutations avoid the problems of getting caught into a local minima.

GA Simultaneously searches from a wide sampling of the cost surface, dealing with a large number of variables.

So it is well suited for parallel computing,GA optimizes variables with extremely complex cost surfaces (they can jump out of a local minimum)GA takes longer time to achieve high accuracy with a sample size 200, mating nodes to 100, nodes to mutate 10GA works faster if we set population size to 100, mating nodes to 50, nodes to mutate 20 and achieves same high accuracy.

In some cases, calculus-based derivatives to find a solution of a well-behaved convex function can outperform Genetic Algorithm as GA could be still spending cycles to analyze the costs of the initial population, as it needs to come up with more bits for real-valued data.

The main benefit of Hill Climbing search is that it requires only a limited memory.

We need to select a local change that should improve the current value of Objective Function.

The main benefit of Hill Climbing search is that it requires only a limited memory.

Here we need to select a local change that should improve the current value of Objective Function.

Like gradient descent algorithms in continuous spaces it approaches a local minima but instead of using the gradient, hill climbing uses random local search.

HA differs from Gradient Descent Algo which adjust all of the values in random variable X at each iteration according to the gradient of the hill.

Once the number of iterations is 100, RHC correctly classified 95% instances as it can better explore more neighbors and move past ridges and plateaus .

Simulated Annealing is an example of random local search .

If the temperature is decreased by 0.

1 (much more slowly) and number of iterations is 10, the performance of SA improves a lot.

The main reason for improvement of performance is the ‘frequency of moves to explore non-optimal paths and the size of such non-optimal space (plateau) decreases’ as we lower the temperature.

So while GA performs better when nodes_to_mutate is increased, RHC needs large number of iterations and SA converges with smaller number of iterations and slower temperature change.

So far we have focussed on developing a Java-based frameworks using Weka and ABAGAIL for experimenting with supervised, unsupervised and randomized search algorithms.

In next blog, we shall discuss approaches for experimenting with Reinforcement Learning algorithms.

.. More details