“Microsoft Malware Prediction” and its 9 million machines

Everyone should!When there are 8 Million record and 100 columns you have about 6.

4 gigabytes of information you are trying to store.

Also, you need to load a test set, which would be another 6.

3 gigabytes.

Suddenly, you are looking at 12.

7 GB of data, most of it wasted space.

Therefore, I wouldn’t have even been able to load that into a consumer GPU.

NVIDIA 1080 TI tops off at 11 GB.

By using dtypes, we can reduce it down quite a bit.

The first trick is setting the data types to smaller sizes where it makes sense.

Just enough so that you can load in the data without a memory error.

To do so, you need to make some sanity checks.

You need to balance between how you can reduce the dtype while still having enough space for all of your unique variables.

Let’s remember while there are 8 million records for the city.

There are only X unique city identifiers.

We are passed an id for cities so we could see the city “42” 7,000 times while city “28,930” only appears 2 times.

The same thing happens for categories.

This data often comes in as a string, therefore, we use categories to change to say that these are unique and to group them.

If the size of your array is lower than 32,768 uniques, you are safe to turn it into an int16.

Unique cities number over 100,000 so we use float32 instead of Int64.

IsBeta is boolean, so we can make it int8.

If we make it too small, we have problems, but if it is too big, we waste space, which brings in my love of reduce_mem_usage (found here) which finds the optimal value for your data and your unique.

While you would have to get the dataset to load manually, you don’t have to spend more time finding the optimal values for all your variables.

After running reduce_mem-usage, you can see a dramatic cut in memory.

Suddenly we have something much smaller.

As you can see below, my size drops from 7 GB down to 1.

67 GBWhich then allows me to put more data into the data frame for my GPU.

Great stuff!Splitting with PicklesAnother method I enjoyed was splitting the program into two sections.

The data builder- where I explore, create features, move, and manipulate data.

Use pickle to save the information and move it over for the learnerThe Learner- for building models, and running the learning epochs.

By splitting the builder and the learner, I can keep the code cleaner, and it allows me to run it with different options more easily.

Plus experiments go faster.

If something crashes in the learner, I can shave 10 minutes offloading off of the reset.

How did we do?We end with a list of probability for each machine having malware.

Ideally, we should see a saddle where the highest number is either on 0 or 1.

If you take just a few training epochs you see this Space Invaders looking thing suggesting with more training we would get there.

It was not as effective as the top public kernels which are somewhat disappointing.

Things I should have done better:More data exploration and feature work- Probably the biggest problem was not better understanding how significant some of the features were in the problem.

I should have checked more for interdependence.

Data subsampling- Typically as a dataset gets this big you should go back and break items into subsampling and then later aggregate results.

Adversarial analysis- I need to do a better job understanding how to compare my training set, validation set, and public test.

It didn’t seem like these were as close to each other as I would have liked.

AUCROC- I used cross-entropy for a loss function.

However, I should have looked further into using AUCROC instead.

While I could print off a metric, I was not able to better explore the feasibility of it.

Embedding Sizes- It seems like some embeddings exploded outward with many examples of 1s or 2s of unique values in a test set.

For example, if Chicago only has a sample size of 2, should that feature be used?.It would neither help you determine if there was an infection or not.

y_range- I should have kept it in and as a tensor to keep moving things forward quickly.


. More details

Leave a Reply