This means anywhere you call a method it can come from the same block of code, whereas Jupyter notebooks can’t share code as easily between notebooks.
Solely using Jupyter resulted in multiple notebooks with a lot of copy-pasted code that needed to be kept in sync with its source — not what you want when trying to track down a bug in your data pipeline!Not to say Jupyter doesn’t have a place in a Kaggle project — I would certainly continue to use it for data exploration and visualization.
But I’d also define most of the methods in my main Python files and import them into my notebook — then I’ll know code that works in the notebook will also work in my pipeline.
Regardless of how you go about it, visualizing the data at each step is very important (cue #3…)3.
Visualize EVERY transformation you perform on the dataMS Paint seems to be this site’s visualization tool of choice.
I can’t stress this enough.
Remember I mentioned trying to track down a bug in my pipeline?.Well that was after about a month of not knowing it existed, which was a month of confusing errors and poor results.
In scrounging for documentation I came across this post, where a single pixel offset cost the developer two months of debugging!Always visualize the data before and after each change to ensure the operation is being performed as you expect.
If it’s an image, show it.
If it’s CSV data, do a sanity check on the summary statistics and make sure to replace NaN values.
Even with NLP data, you can print the text before and after stemming, or plot the embedding vectors on a 2D or 3D graph.
Example of word embeddings reduced to 3D vectors.
If instead, “man” is to “woman” as “king” is to “taco”, then you may have a bug to squash (source).
These small checks will save you a lot of time debugging and means you don’t have to backtrack and do it later anyway.
Plus you can keep the visualization code in a separate Jupyter notebook so it doesn’t get in the way!4.
Optimize for the right metricNot quite as intuitive as accuracy or MSE (source).
This can be a big ask, but is important nonetheless.
Many online ML (that’s what the PROS call machine learning) tutorials provide code to load data and train a model, and they almost always use one of the built-in metrics, such as categorical accuracy or mean squared error.
But if you’re trying to detect tumors in brain scans, and the competition values recall over accuracy (a.
you want to catch as many cases as you can, at the expense of some false positives), then accuracy isn’t going to train your model to perform well!See?.Deep learning as easy as Copy-Paste!.(source)Always review the competition metric before you begin working with the data, and certainly before starting on the model.
Sometimes one of the standard metrics will suffice, and sometimes you’ll have to write a custom objective function in some random source code file and encounter errors that are near-undiagnosable due to Tensorflow’s very FUN graph system (my first StackOverflow post, sadly unanswered).
Wow, 3 whole upvotes!.(Not going to link you to my embarrassing StackOverflow questions)Long story short, you can’t expect to edge out hundreds of ML PhDs when you’re not even optimizing for the right thing.
Choose a competition you’re psyched aboutWhen you learn coding just for the THRILLS (source)This may come as a surprise, but my passions do not lie in labeling “amateur” (crappy) smartphone images of historical landmarks.
As I kept working on the project, the initial interest faded and I found it harder and harder to convince myself to open my laptop and invest a couple more hours on it.
In stark contrast, just last week I whipped up a quick March Madness bot for an office competition; the results were underwhelming (I found a bug hours after the submission deadline, but so it goes) but I found the time flying by, Deep Work style, even after a full day of programming at work.
I’ll take this as a lesson for future competitions, or even a personal note regarding Kaggle in general: it’s nice to have a rank on a leaderboard to show the fruits of your labor, but you won’t be able to do your very best on it if you can’t find a reason to stay engaged and keep improving.
TakeawaysWriting coherent English is apparently harder than writing code.
But YOUR takeaway should be that while Kaggle does a great job making data science seem easy, the pros are posting state-of-the-art scores due to solid foundations and intuition built from many failures, many of which I haven’t yet encountered.
Trying to teach yourself data science and machine learning can be brutal, but I’ve also never seen more collaboration and high-quality free resources in any other field.
Stick with it!.Thanks for reading!PS: If anyone wants to take a stab at the Google Landmark Recognition Challenge, check out my kernel to download the dataset to TFRecord files in S3!.