However due to the sparsity of the data (less than 10% of the data had most of the required fields), this approach was not followed.Rather, data with most of the required fields were chosen and then the missing values, such as country ICO was launched in, were manually collected from various websites.Conflicting dataSome of the data was conflicting, for example websites had different start and end dates for an ICO or Country it was launched in..This was handled by checking multiple sources and taking the values that was in consensus.Categorical dataOne hot encoding was applied to the categorical data fields — ICODate, ICO month launched and ICO country.The workflow for the Data collection and Preparation phases4..Exploring and attempting to understand the dataWe can explore the relationships between the inputs and the outputs by calculating their correlation coefficients and drawing scatter plots to visualise these relationships.The Price of an ICO in USD/BTC when launched was closely correlated with the future price of an ICO in six months (obviously!), while other inputs were not closely correlated to the output.Furthermore, it was critical to ensure that the distribution of the data of an ICOs was representative of the current market in order to have a model that generalises well..This was achieved and the data collected correlates to statistics related to ICO distributions by country.Distribution of ICO Data Per Country in ICO Omen.5..Choosing a Machine Learning ModelRidge RegressionOnce one-hot-encoding had been applied to the data, the input matrix formed what is known as an undetermined or a fat matrix..This essentially means that there are more features than examples (128 features versus 109 data points).This means that our regression model is susceptible to overfitting and multicollinearity.“ Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables..It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.” —Statistics SolutionsTo avoid this, we apply regularisation..In our case we apply Ridge Regression (L2 regularisation), which penalises very large weights..You can read more about Ridge Regression here.Ridge Regression Formula:Source : Penn StateNeural NetworkWe also used a Neural Network to compare the results with those achieved by the Regression model.For the Neural Network, we used the tanh activation function, and for solvers we used the Adam Solver and also the Gradient Descent Solver.6..Measuring the performance of the ModelTwo measures of performance were used — RSquared(R²) and Root Mean Squared Error(rMSE).R², measures the “percentage of the variance” that the model can explain..A high R² score is usually good, but not always, as it could mean your model is overfitting to the data..You can read more about R² here.rMSE, measures the root of the average squared error our model achieves (error — difference between the values our model predicted and the actual values).ResultsAll the results were computed using the holdout method — performance is measured on test data (i.e. data the model has never seen before).The graphs plot predicted outcome vs measured outcome. This displays the correlation between what was predicted and what was measured. Values closer to the dotted line, indicate better correlation.The Linear Regression Model got an rMSE score of 0.86 and an R² score of 0.62.The Neural Network Model got an rMSE score of 0.58 and an R² score of 0.73.7. Saving the ModelOnce we have a model that we are happy with, we should save the model so that we can re-use it to make predictions later.I used Joblib, which is a python helper class that is part of the SciPy package, that provides utilities for pipelining operations. It makes saving and loading models simple.8. Using the Model to make predictionsOnce you have a saved model, you can load this model and make predictions without having to retrain your model.Here is an example of how you would use your saved model to make a prediction:Conclusion and Final ThoughtsMachine learning in the real world is largely dependent on your data. You often spend more time cleaning, preparing and aggregating the data than you do working with your model.In this article, we managed to create a model, which can predict the price of ICO’s reasonably well. We also showed general steps one could follow to apply Machine Learning to real world problems.Thanks for reading this article, let me know if you have any thoughts or comments. :)I can be reached on Twitter and or here.Source CodeAll source code is available, in a dockerized format.. More details