What’s wrong with SQL?I feel ya.
Nothing is wrong with SQL other than its been around for decades.
The unstructured data craze was an opportunity to do something different and scale wildly in ways not possible before.
However, I guess more folks have concluded there is value in keeping SQL around.
It makes analytics much easier.
So much, in fact, that many NoSQL and “big data” technologies have scrambled to add a SQL layer in some shape or form.
After all, SQL is a pretty universal language even if some people find it difficult to learn.
So what I am gathering here is that NoSQL is not critical anymore to learn as a data scientist, unless somehow my job requires it.
It sounds like I am safe just knowing SQL.
The more I think about it, yes, I suppose you are right, unless you gravitate towards being a data engineer.
Data engineer?Yeah, data scientists kind of broke up into two professions.
Data engineers work with production data systems and help make data usable, but do less machine learning and mathematical modeling work which is left to the data scientists.
This was probably necessary since most HR and recruiters cannot see past the “data scientist” title.
Come to think of it, if you want to be a data engineer I would prioritize learning Apache Kafka more than NoSQL.
Apache Kafka is pretty hot right now.
Here, this Venn diagram may help you.
To get a “data scientist” title, you should be somewhere in the Math/Statistics circle ideally on an overlap with another discipline.
Data Science Venn DiagramAlright, I have no idea whether I want to be a data scientist or data engineer at this point.
Let’s just move on.
So going back, why are we scraping Wikipedia pages?Well to serve as data inputs for natural language processing, and do things like create chatbots.
Like Microsoft’s Tay?.Is this bot going to be smart enough to forecast sales and help me launch new products with the right amount of inventory?.Is there an inherent risk it becomes racist?Theoretically, it might.
If you ingest news articles maybe you can create some models that identify trends that results in business decision recommendations.
But this is really REALLY hard to do.
Come to think of it, this may not be a good place to start.
Okay, so… natural language processing and unstructured text data is probably not going to be my thing?Probably not, but note that’s a lot of data science nowadays.
Silicon Valley companies like Google and Facebook deal with a lot of unstructured data (like social media posts and news articles), and obviously they have a lot of influence in defining what “data science” is.
Then there are the rest of us using business operational data in the form of relational databases, and using less exciting technologies like SQL.
Yeah, that sounds about right.
I guess they also spend a lot of time mining user posts, emails, and stories for advertising and other nefarious purposes.
It is what it is.
But you might find Naive Bayes interesting and somewhat useful.
You can take bodies of text and predict a category for it.
It is pretty easy to implement from scratch too:Categorizing bodies of text with Naive BayesYou are right, Naive Bayes is kind of cool.
But I don’t see any value in unstructured data beyond this.
We will move on then.
So you are working with a lot of tabular data.
Spreadsheets, tables, and lots of recorded numbers.
It almost sounds like you want to do some forecasting or statistical analysis.
Yes, finally we are getting somewhere.
Solving real problems.
Is this where neural networks and deep learning comes in?Whoa, hold your horses.
I was going to suggest starting with some normal distributions with means and standard deviations.
Maybe calculate some probabilities with z-scores, and a linear regression or two.
But again, I can do all that in Excel!.What am I missing here?Well… um… yes that’s correct, you can do a lot of this in Excel.
But you get a lot more flexibility when you write scripts.
Like VBA?.Visual Basic?Okay, I’m going to start over and pretend you didn’t said that.
Excel does have great statistical operators and decent linear regressions models.
But if you need to do a separate normal distribution or regression for each category of items, it is much easier to script in Python rather than creating hellish formulas whose length can become a distance-to-the-moon metric.
When you become advanced at Excel, you inflict pain on everyone who works with you.
You can also use the amazing library scikit-learn.
You get a lot more powerful options for different regression and machine learning models.
Okay, fair enough.
So I guess this segues into mathematical modeling territory.
When it comes to the math stuff, where do I start?Well conventional wisdom says linear algebra is the building block for a lot of data science, and this is where you should start.
Multiplying and adding matrices together (called a dot product) is something you will do all the time, and there are other important concepts like determinants and eigenvectors.
3Blue1Brown is pretty much the only place you will find an intuitive explanation of linear algebra.
So… taking a grid of numbers and multiplying/adding it against another grid of numbers is something I will be doing a lot?.This sounds really meaningless and boring.
Can you give me a use case?Well… machine learning!.When you do a linear regression or build your own neural network, you will be doing a lot of matrix multiplication and scaling with randomized weight values.
Okay, so do matrices have anything to do with data frames?.They sound similar.
Actually, hold on… I’m thinking about this for the first time.
Let me walk that statement back.
In practicality, you will not need to do linear algebra.
Oh come on!.Seriously?.Do I learn linear algebra or not?In practicality, no you probably do not need to learn linear algebra.
Libraries like TensorFlow and scikit-learn do it all for you.
It’s tedious and it’s boring anyway.
Ultimately, you might want to get a little bit of insight on how these libraries work.
But for now, just start using the machine learning libraries and let your curiosity guide how much linear algebra you learn.
Your uncertainty is unsettling me.
Can I trust you?Also, before I forget.
Don’t actually use TensorFlow.
Use Keras because it makes TensorFlow much easier to work with.
Speaking of machine learning, does linear regression really qualify as machine learning?Yes, linear regression is lumped into the “machine learning” tool bag.
Awesome, I do that in Excel all the time.
So can I call myself a machine learning practitioner too?*Sigh* technically, yes.
But you might want to expand your breadth a bit.
You see, machine learning (regardless of the technique) is often two tasks: regression or categorization.
Technically, categorization is regression.
Decision trees, neural networks, support vector machines, logistic regression, and yes… linear regression all execute some form of curve-fitting.
Each model has pros and cons depending on the situation.
Wait, so machine learning is just regression?.They all are effectively fitting a curve to points?Pretty much.
Some models like linear regression are crystal clear to interpret while more advanced models like neural networks are by definition convoluted, and are difficult to interpret.
Neural networks are really just multi-layered regressions with some nonlinear functions.
It may not seem that impressive when you have only 2–3 variables, but when you have hundreds or thousands of variables that is when it starts to sound impressive.
Well when you put it that way, sure.
And image recognition is just regression too?Yes.
Each image pixel basically becomes an input variable with a numeric value.
That reminds me, you have to be wary of the curse of dimensionality.
This basically means the more variables (dimensions) you have, the more data you need to keep it from becoming sparse.
This is one of many reasons why machine learning can be so unreliable and messy, and can require ridiculous amounts of labeled data you will likely not have.
I now have a lot of questions.
(Here we go)What about problems like scheduling staff or transportation?.Or solving a Sudoku?. More details