You can use a “multi-hot vector” which is exactly the same as a one-hot vector except more than one entry can be equal to 1.
Here’s an example (modified from the previous example; the additional entries are highlighted in yellow):Categorical variables that can take on more than one value like this are common in medicine.
For example, diagnoses, procedures, and medications extracted from a patient’s medical record over ten years are likely to take on more than one value.
Many patients will have several past diagnoses, several past procedures, and several past medications.
Thus, you will need a multi-hot vector for diagnoses, another multi-hot vector for procedures, and another multi-hot vector for medications.
Here’s a snippet of the provided code that transforms each categorical variable into a one-hot or multi-hot vector as appropriate, using the pandas function get_dummies():Representing Quantitative Variables: The Importance of Normalization (i.
, “How to Feed Your Neural Network”)Now we are going to switch gears and talk about representing quantitative variables.
It seems like you could just throw these variables straight into the model without any extra work, because they’re already numbers.
However, if you’re building a neural network model, you don’t want to feed in raw quantitative variables because they will likely have very different scales, and giving a neural networks numbers of different scales will make it sad (i.
it will be more difficult for the neural network to learn anything.
)What I mean by “different scales” is that some variables may take on small values (e.
tumor diameter = 0.
5 cm) and others variables may take on large values (e.
weight = 350 pounds.
) A neural network will train more effectively if you feed it only small, normalized values.
Here’s a common procedure for normalizing a quantitative variable before training a neural network:Split your data into training, validation, and test setsCalculate the mean of your quantitative variable in the training setCalculate the standard deviation of your quantitative variable in the training setNormalize every original value of your quantitative variable across ALL your data (train, validation, and test) using the mean and standard deviation you just calculated on the training set:You must do this separately for each quantitative variable in your data set (e.
separately for “tumor diameter” and “weight”), because each quantitative variable will have a different mean and a different standard deviation.
The effect of performing the above steps is that all your quantitative variables will now be represented as small numbers centered on 0, which will make them good food for your neural network.
Why do we calculate the mean and standard deviation using ONLY the training set?.Why not use the entire data set?.Answer: If we included the test set in our calculation of mean and standard deviation, we would be leaking information about the test set into our training data, which is cheating.
Here’s a snippet of the provided code that normalizes continuous variables, using StandardScaler from scikit-learn:Imputation to Deal with Missing DataFrequently, data values are missing.
Perhaps a survey participant didn’t answer all the questions or a patient received care in a different state and their diagnoses, procedures, and medications didn’t make it in to their local medical record.
If we read in a data file with missing values, these values will be “NaNs”, or “not a number.
” We have to replace them with a number in order to train.
Filling in missing values is called “imputation.
”There are different strategies for imputation.
Here’s one reasonable strategy:Replace all missing values for a categorical variable with the training set mode.
Thus, if the mode (most commonly chosen value) for animal is “dog,” we replace all of the missing answers to “What is your favorite animal?” with “dog”Replace all missing values for a continuous variable with the training set median.
Thus, if the median height is 5.
2 feet, we replace all of the missing entries for “height” with 5.
Here’s a snippet of the provided code that performs imputation of missing values, using the pandas function fillna():ConclusionData preparation is critical to achieve good performance in machine learning methods.
It took me a few months during the first year of grad school to gather all of the information contained in this blog post, so I hope that aggregating it here and providing my code will help you in preparing your own interesting data sets :).
About the Featured ImageThe featured image is a Samoyed dog, which happens to be my all-time favorite kind of dog.
This dog made a cameo appearance in the “example categorical variables” table.
Fun facts about Samoyeds:Samoyeds were bred by nomadic reindeer herders in SiberiaSamoyeds are one of the oldest dog breedsShed Samoyed fur can be used to knit clothingDue to their thick coats, Samoyeds can stay nice and warm in temperatures well below freezingOriginally published at http://glassboxmedicine.
com on June 1, 2019.