The absolute beginner’s guide for data science rookies

Well my friend I hope my experience in the last couple of months can help you clear your doubts.

Let’s start, shall we?1.

What to learn?A data scientist is the player that can start the play from its own goal, dribble a couple of defenders, make a precise cross to the penalty mark and head the ball inside the net for a gorgeous goal.

Sorry for the football reference, can’t help it, just wanted to picture how you’ll master a diverse set of skills that’ll make you very useful in almost any data-related problem.

Now, I’ll divide this requirements under 2 approaches.

First, from a technical point of view we’ll review the foundations, that means the fields of science data science relies on from one way or another.

Second, from a more practical point of view I’ll show you which programming libraries you should focus in order to put your hands on real data projects.


1 Data Science FoundationsProgramming ????: Your first task will be to choose either you’ll use Python or R (I’ll leave you some help here, here and here) and then immerse yourself into coding.

Linear Algebra ????: As you’ll be working with data you’ll want to know how to represent data sets as matrices, and understand concepts like vectorization and orthogonality.

Calculus ????: Many of the models you’ll write and use will use tools like derivatives, integrals and optimization to compute and find a solution to your problem more rapidly.

Probability ????: While you use data science, many times you’ll be working to predict something in the future so you’ll want to know how likely something is to happen or why two events are related.

Statistics ????: In order to describe the information you’ll be analyzing, things like the mean or percentiles will come in handy, also tests to check your hypothesis will appear along the way.

Machine Learning ????: Maybe the core of data science, at some point during your project you’ll want to predict something and that’s when machine learning kicks in.


2 Data Science LibrariesAs you’ll discover after some time coding on your own, every programming language counts with a series of packages or libraries that provide different functions and methods to perform diverse tasks with more ease.

Here you’ll find a table with the most popular and useful libraries for Data Science in Python with some brief guidelines below.

In case you’ve gone the R way, don’t worry, I’ll also leave you a very good article with a similar table for R libraries here.

Top Python Libraries for Data ScienceThe Starter Kit is all you need to start doing data science, Numpy provides the base for working with data but you’ll handle it more easily with Pandas.

Scipy provides some fancy functions and methods to perform advanced calculations on top of the Numpy framework, and Matplotlib will allow you to plot your findings visually.

Finally Scikit-learn is the starting point for machine learning, it contains everything you need to apply all classical regression, classification and clustering methods.

On the other hand, Deep Learning frameworks will help you build artificial neural networks to perform more complex machine learning tasks like image recognition.

Then there are other Data Visualization alternatives which allow to create more stylized and interactive plots even on web applications.

Natural Language Processing (NLP) is a very popular field within data science which for example allows Alexa or Siri understand what you’re saying.

When looking for data to do your analysis the Internet is an unlimited source for that purpose, so Web Scraping tools will come in handy to collect and retrieve this data on a frequent basis.

Last but not least, Statsmodels (statistical analysis) and XGBoost (gradient boosting) will help you in some more specific tasks.


x Data Science (not so) Bonus TracksSo far we’ve talked about working with data, but what type of data?.Well it can range from very pretty, clean and structured csv files to gigantic datasets with millions of examples of very unstructured data, yes buddy, we’re not in Kansas anymore.

Depending on where you end up working, your company might handle data on different ways but most certainly they’ll handle very large data sets, so this two tools coming up next are a definitive must for every data scientist.

Databases: Big companies store their data in databases.

Why not a spreadsheet?.Well, basically databases are a much proper way to handle large chunks of data while ensuring data integrity and security, and also allowing easy querying and updating.

Now, the thing is there are 2 types of databases, relational DBs (SQL) and non relational DBs (NoSQL), you might want to learn the differences (this post may help you) and hopefully learn to work with both.

Big Data: When it comes to working with a lot of data you also have to think how you’ll process and refine that amount of information.

When you have thousands of rows you may need just a few seconds to perform simple tasks, but when you need to run a highly complex model on millions of records then you’ll probably need days.

For this purposes you use models for parallel and distributed computing, in order to perform multiple tasks simultaneously in different cores or CPUs.

The 2 most popular frameworks are Hadoop MapReduce and Spark (read about their differences here).


Where to learn?2.

1 BootcampsBootcamps are the easiest pick when it comes to learn Data Science, they are an all-in-one bundle of several modules that give you knowledge in almost everything you should know to do data analysis.

People may argue the depth with which these topics are covered, but truth be told they excel at giving a good introduction and intermediate level of expertise in data-related subjects, more advanced levels will for sure come when yous start doing your own projects or working, so don’t worry.

Now, there seem to be many options, but the two most popular and really online bootcamps are Dataquest and Datacamp, which have created a strong debate upon which one is better, I personally don’t think it’s that easy, but I’ll try to show at what they’re better or worse and their main differences so you can make your choice easier.

Also you can check this interesting Reddit thread.

Datacamp (DC) ⚙: Founded in 2013, they can be called the creators of data science bootcamps and sure have a good reputation nowadays.

They started with R-based courses but have added Python content in the last few years.

Their courses are based on short videos followed by a mix of questions and programming exercises, the latter basically consist of completing already existent code.

They also provide several projects to work with besides the programming exercises where you can use multiple competences you’ve learned before.

There are a few courses for free, but to access all the content you have to pay $29 monthly.

It also has an XP-based progress system which helps you keep the rhythm.

Dataquest (DQ) ????: Founded 2 years after Datacamp they’re maybe not as known as the first, but sure are as good.

Analogously to DC, they started with Python courses but added courses with R on the way.

They offer text-based courses where after every bit of information you learn you have to code to put everything on practice, so it’s very application-oriented.

After every module ends you have to do a real-life project in Jupyter Notebook (we’ll talk about this later) from scratch so you can build a portfolio to showcase your skills.

The first coding courses on Python are fully free, but then you’ll have to pay if you want to follow one of their paths, $29 for Data Analyst and $49 for the others.

They also count with an active Slack community to solve your doubts or share your achievements.

Data Science Bootcamps ComparisonTo summarize.

Datacamp maybe is more established than DQ and has a wider offer of courses, allowing you to deepen in other subjects besides the core knowledge for Data Science.

Dataquest on the other hand is more focused on building some strong data science foundations based on practicing with very demanding exercises and projects.


2 MOOCsAs I said at the beginning, online learning is at its peak, with more students enrolling to expand their knowledge, with more than 100 million students worldwide and about 11.

4k courses available, MOOCs are kings.

Now, where’s the magic?.They offer self paced courses, normally free to audit, in an extensive list of topics and subjects.

If you want to learn you can dedicate 1–2 hours daily and advance whenever you can, never education was so within reach for everyone.

Below you’ll find a very brief comparison graph of the top 4 (imo) MOOC platforms for Data Science.

If you want to dig deeper on the differences, pros and cons of each one of them, I’d recommend you to read this post where they perform a very detailed analysis comparing them in terms of several factors.

Later I also explain in more detail which platform fits you better depending on what you’re looking for.

MOOC Platforms ComparisonCoursera: The big boss of MOOCs, founded by Andrew Ng and Daphne Koller, it was very difficult not to be an absolute hit, partnering with top universities and companies.

Courses include quizzes and assignments, while specializations also have capstone projects, to evaluate your progress.

Coursera offers two principal modalities: courses, which are free to audit but offer a certificate for a one time fee; and specializations, which are a set of interrelated courses, you can audit them for a 7 day free trial and then pay a monthly fee (~$50) until you complete it, to obtain a certificate.

Choose if: You are clear enough on what you want to learn to hand-pick the courses, with a wide variety of courses and a good balance between theory and application.

Udacity: What started as some free computer science courses offered by Stanford University is now one of the most respected MOOC platforms, with a clear orientation towards technology-related content, which is made in conjunction with really top notch companies like Google or Amazon.

Udacity possesses several free courses on precise subjects, but their specialty are their nanodegrees™, a bundle of courses that cover all the necessary skills for a certain track (e.


Data Analyst).

Choose if: You want job-oriented courses aiming to develop everything you need to carry out a data position and don’t care too much about the price.

EdX: Completing the podium we have the MOOC provider created by the MIT and Harvard back in 2012.

Although only launched one month after Coursera, both basically share the same educational offer and approach towards learning, with a well crafted platform and an active community.

Their certification system is quite similar too, their verified courses are the same as Coursera’s courses and their programs are equivalent to Coursera’s specializations.

Choose if: You didn’t find what you were looking on Coursera, although courses are equally good, they have less variety and programs are one-time fees so they’re a bit more expensive depending on your pace.

Udemy: This one is some kind of weirdo in comparison to the other three we’ve mentioned before, even though Udemy counts with the largest offer of courses with tens of thousands of them, they’re all offered by private people, everyone can be an instructor!.Which suggests their certificates are not backed up by any college, company or organization.

All this means content is much easier to create, but quality courses are maybe harder to find, still you can find some very good stuff.

Their prices can go up to $200 per course but you can find them on sale for $10–15 pretty much all the time, so don’t despair and wait for it.

Choose if: You’re just interested in learning the content, don’t care that much about the certificate, and are looking for affordable options.

Top Data Science and Machine Learning CoursesOf course these are my personal choices, although most people will agree with almost all of them I encourage you to have other opinions and research more, this post from Class Central is a very good start.


3 BooksThe good old books!.Though they actually never get old they can be a very valuable resource to support your learning if you pick wisely.

In terms of Data Science books they normally come in 2 “formats”: plain theory or code-oriented, though you’ll also find a mix of both sometimes.

Also, at this point you should know already what programming language suits you better because the authors of books have already made up their minds, and books will be written using only one language.


What’s next?We’ve come a long way already, learning what and where to learn everything you need to become a data scientist, but what do you do now?.I’ll show you how to put your skills in practice the right way so you’re tuned with the data science community.

Jupyter ????: The scientific world is a bit messy, tons of math, graphs and code may seem like you’re doing some really avant-garde stuff, but if you’re not able to show it clearly then it’s worthless.

That’s why awesome Jupyter Notebooks appeared to help you, they allow to run your code, plot amazing interactive plots, import any kind of data, write beautiful equations and comment everything appropriately through markdowns in a well presented notebook so you can show your work, without being an absolute indecipherable mess.

Here you can see an example of a Jupyter Notebook to understand what I’m talking about.

Kaggle ????: Competitive data science may sound like major leagues, but Kaggle (acquired by Google in 2017) is an awesome data science platform and community that allows companies and organizations to create competitions (sometimes it’s a problem they actually want to solve) where everyone can participate and even win juicy prizes, counting with live rankings showing the best results so far.

Besides it counts with a lot of learning material to get you started, and a huge repository with datasets of all kinds and for different purposes, so basically you have everything you need to start working on real projects!GitHub ????: Within software development, something called Git was created, a version control software, which basically helps keep track of changes in a project where multiple people work.

After this GitHub appeared, a web-based hosting service for version control, it helps, in very simple words, keep the projects on the cloud organized and logged.

In GitHub you’ll find all sorts of awesome repos you’ll be able to fork and modify them for your personal projects, but most important, GitHub will help you keep an organized portfolio for you to showcase your work when looking for a job.

Stackoverflow ❓: The coding road is not exempt of bumps on the way, in fact it can become very annoying handling all the errors and debugging hundreds of lines of codes, but as part of human nature, we’ve learned from our mistakes and learned how not to trip over with the same stone.

Stackoverflow is a Q&A community of developers from all over the world which allows you to find explanation to common (and not so common) problems in programming through threads where everyone can ask and respond easily.

Remember this when you’re desperately Ctrl+C/Ctrl+Ving an excerpt of your code in Google that will redirect you to your glorious answer in Stackoverflow, just magic.

Stackoverflow saving the daySome final thoughtsI strongly believe a good professional is always learning, in a world as dynamic as ours you can become obsolete in no time, and that’s where the power of learning resides, being able to keep up to date and reinvent yourself will not only help you grow as a pro but also as a person.

Finally, learning is a very personal process, there isn’t (yet) a magic formula that works for all and even though throughout this post I’ve tried to present multiple choices in the form of suggestions, the final call is up to you, just make sure you enjoy the ride ????Updated March, 2019.


. More details

Leave a Reply