Machine Learning 101 – An introductionMichele PreziusoBlockedUnblockFollowFollowingDec 29If you are a software engineer, I’m sure at some point you wanted to do ‘some machine learning’, crack the secrets of the Universe and find the ultimate answer to life, the Universe and everything.However, machine learning can be a pretty big and intimidating topic: a very different paradigm from what you usually do/use on a day-to-day basis, driven by big data, mathematical models,… in short: it’s way out of your usual comfort zone.But that’s OK — In fact, this is is part one of a series of articles in which I’ll try and walk you through the main concepts of machine learning.At its core, in machine learning, you have big data on one side and a problem that you think big data can solve or provide an answer for.Big dataBig data is generally speaking a set of data that has the following characteristics:Big data characteristicsVolumeThe quantity of generated and stored data..The size of the data determines the value and potential insight, and whether it can be considered big data or not.VarietyThe type and nature of the data..This helps people who analyze it to effectively use the resulting insight..Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.VelocityIn this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development..Big data is often available in real-time..Compared to small data, big data are produced more continually..Two kinds of velocity related to Big Data are the frequency of generation and the frequency of handling, recording, and publishing.VeracityIt is the extended definition for big data, which refers to the data quality and the data value..The data quality of captured data can vary greatly, affecting the accurate analysis.How can you tackle your big data problem using machine learning?Before you can start thinking about your machine learning solution, you have to start from the problem: you have to fully understand it and describe it.What is the problem?Why do you need to solve it?How would you solve it?1..What is the problem?a.Describe it informallySoft start: describe the problem as if you were describing it to a friendb.Describe it formallyI find it very useful to describe the problem following the formal language that Tom Mitchell describes in his book Machine Learning:The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. […] A computer is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.We should be able to write up a formal description of the problem using E, T and P, or -even better- list out our problems in a table with headers E, T and P.c.Research and list similar problemsResearch and list any problems that you think are similar to the one you are trying to solve..This can help you reaching a solution, limit the domain of your problem and potential problems before you encounter them in your journey.d.List any assumptions and additional informationList any assumptions that are important to the phrasing of your problem but haven’t made it into the description..They might be subtle but help you reach the desired result sooner.Taking the classical example of the analysis of a clickstream to predict traffic spikes, you could say that:Response size and user agent are irrelevantReferrals can be relevant to the modelDate and time are a fundamental dimension2..Why do you need to solve the problem?Simply think about the solution: why you need to build it, the benefits that you (or your customers) will gain from solving this problem,… and, finally, you should also think how the solution will be used in the short term but also in the longer run.It’s OK to hack your way towards the solution but you should always keep in mind the long term and build a solution that is or can be easily transformed in a future proof service — unless, of course, you are doing this as a learning excercise.3.. More details