and from where?DatasetLuckily, one of the greatest creation in history, the internet, is filled with many open datasets that we can download and use to train our model.
In this case, we will be using The USDA National Nutrient Database for Standard Reference (SR) from the US Department of Agriculture’s Agricultural Research Service.
In this tutorial, we will be using the abbreviated excel file that you can download on here.
The dataset contains the main source of food composition data in the United States with more than 8,000 data points with over 50 variables of micronutrients in each food (such as carbohydrate, protein, etc.
Now that we got the data, the first thing to do is data cleansing.
Data CleansingNaturally, data does not always come in the form of a clean CSV or text file that is ready to be fed in a machine learning model.
Most of the food that we take from nature are first to be cleaned and processed before we can buy them from the supermarket.
This is also the same with data.
But first, let’s look at our data.
Since the data is in the form of an Excel file, we will use Pandas read_excel function to open the file.
import pandas as pdpath_to_data = "datasets/ABBREV.
xlsx" # Path to the Excel filedataset = pd.
head()If you run on the code on Jupyter Notebook, then the table should look like this (might look different if you run a python script on terminal)dataset.
head() displays the first 5 rows of the dataset.
By doing this, we can get some idea about what our data looks like.
As we can see, the table contains data about the nutrient compositions of each food (such as Butter, Cheese, atc.
The columns contain the name of each micronutrient and macronutrients of the food (Protein, Carbohydrate, etc.
To see all the variable of the data, you can run the following code below.
columns# OutputIndex(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_mg)', 'Manganese_(mg)', 'Selenium_(µg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Panto_Acid_mg)', 'Vit_B6_(mg)', 'Folate_Tot_(µg)', 'Folic_Acid_(µg)', 'Food_Folate_(µg)', 'Folate_DFE_(µg)', 'Choline_Tot_ (mg)', 'Vit_B12_(µg)', 'Vit_A_IU', 'Vit_A_RAE', 'Retinol_(µg)', 'Alpha_Carot_(µg)', 'Beta_Carot_(µg)', 'Beta_Crypt_(µg)', 'Lycopene_(µg)', 'Lut+Zea_ (µg)', 'Vit_E_(mg)', 'Vit_D_µg', 'Vit_D_IU', 'Vit_K_(µg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)', 'GmWt_1', 'GmWt_Desc1', 'GmWt_2', 'GmWt_Desc2', 'Refuse_Pct'], dtype='object')Before we move on, I always like to check the datatype of each column.
This is important because only float type will be used during training.
I will not get into details about what all of those variables mean.
But if you are interested to find out, you can check the documentation that is included in the zip file when you downloaded the dataset.
Luckily, Pandas’ DataFrame class has an attribute dtypes that returns a Series with the data type of each column.
dtypes#OutputNDB_No int64Shrt_Desc objectWater_(g) float64Energ_Kcal int64Protein_(g) float64Lipid_Tot_(g) float64Ash_(g) float64Carbohydrt_(g) float64Fiber_TD_(g) float64Sugar_Tot_(g) float64Calcium_(mg) float64Iron_(mg) float64Magnesium_(mg) float64Phosphorus_(mg) float64Potassium_(mg) float64Sodium_(mg) float64Zinc_(mg) float64Copper_mg) float64Manganese_(mg) float64Selenium_(µg) float64Vit_C_(mg) float64Thiamin_(mg) float64Riboflavin_(mg) float64Niacin_(mg) float64Panto_Acid_mg) float64Vit_B6_(mg) float64Folate_Tot_(µg) float64Folic_Acid_(µg) float64Food_Folate_(µg) float64Folate_DFE_(µg) float64Choline_Tot_ (mg) float64Vit_B12_(µg) float64Vit_A_IU float64Vit_A_RAE float64Retinol_(µg) float64Alpha_Carot_(µg) float64Beta_Carot_(µg) float64Beta_Crypt_(µg) float64Lycopene_(µg) float64Lut+Zea_ (µg) float64Vit_E_(mg) float64Vit_D_µg float64Vit_D_IU float64Vit_K_(µg) float64FA_Sat_(g) float64FA_Mono_(g) float64FA_Poly_(g) float64Cholestrl_(mg) float64GmWt_1 float64GmWt_Desc1 objectGmWt_2 float64GmWt_Desc2 objectRefuse_Pct float64dtype: objectAhah, look at that.
It seems that all of the columns are floats except for NDB_No, Shrt_Desc, Energy_Kcal, GmWt_Desc1, and GmWt_Desc2.
Also, you can see that some of these columns have the datatype object.
This is because they contain text for each row entry.
For that reason, Pandas will consider them as categorical data.
I’d like to use the term false categorical variable because they are not actually categorical data and just contain text within them.
To make the data cleansing process more simple, let’s separate these false categorical variables with the other numerical variable.
Since Pandas do not provide such a feature, we have to do it manually.
tot_cols = dataset.
columnsnum_cols = dataset.
columnscat_cols = list(set(tot_cols)-set(num_cols))# Output of cat_cols['GmWt_Desc1', 'GmWt_Desc2']Great!.Now we can move one to our first step.
First of all, note that NBD_No contains the index of each food and Shrt_Desc contains the name of the food itself.
In this case, we can drop these variables since they won’t be used to build the model.
drop_cols = ['NDB_No', 'Shrt_Desc']dataset = dataset.
drop(drop_cols, axis=1)Next thing we need to set the labels for our training.
Since we are dealing with a multivariate regression problem, that means the output variable will be more than one.
In other words, this means we have multiple labels.
As I mentioned before, the UNICEF wants you to make a prediction about important micronutrients within certain foods.
But as we see in the columns, there are many of them.
How do we know which of them are going to be our labels?Well, there’s no forward answer to this and it depends on the problem you are trying to solve.
Since we don’t know much about nutrition (I assumed) we can check out the UNICEF’s official website.
According to the UNICEF’s official website, there are seven main deficiencies when it comes to micronutrients: Iodine, Vitamin A, Iron, Zinc, Calcium, Vitamin D, and Folate.
All of them are provided in our dataset except for Iodine.
But after discussing with our supervisor in the team, he says it’s okay to proceed for the first iteration of building the model.
Just to give a visualization of what we are trying to do.
For now, we will just put the name of the outcome variable in a list.
We can do that in Python by doing the following.
labels = ['Calcium_(mg)', 'Iron_(mg)','Zinc_(mg)', 'Vit_A_IU', 'Vit_D_IU', 'Folate_Tot_(µg)']Alright, now we need to go back to those false categorical variables.
Since we’ve already droppedNDB_No, Shrt_Desc, and all we need to do is what to do with GmWt_Desc1, and GmWt_Desc2.
Now there are two things you can do here.
First, you can just drop them because they are just descriptions of the GmWt_1 and GmWt_2 variables.
A second alternative is to clean the text and just includes the numbers of each row entry.
I would recommend the first choice since these variables would most like do not have any significant impact on the performance of the model.
However, for the sake of this blog post, I will include them.
Data on the internet is not always clean.
They can have redundant or chattery words in them.
So it is good practice to somehow clear these text and represent them with numbers.
Computers can only compute numbers and not text type data.
For that reason, we need to remove those text in GmWt_Desc1 and GmWt_Desc2.
In this case, that includes anything within parentheses.
Luckily, I’ve already a written a function to do just that.
def clean_text(data): # If value is NaN we return the same value for cleansing later on if isinstance(data, str): # Remove any punctuations including words within parentheses clean = re.
sub(r'([^)]*)', '', data) # Get numbers only clean = re.
sub("D","",clean) return clean else: return dataWe can use this function to remove punctuations and words within parentheses.
Also, we want to get the number within sentences.
For example, if the entry in the row is “1 pat, (1” sq, 1/3" high)”, then we only left with the number 1.
Pandas allows us to that simply with the attribute apply.
This where the list of false categorical variables comes in handy.
# Remove text in GmWt_Desc1dataset[cat_cols] = dataset[cat_cols].
apply(clean_text)# Remove text in GmWt_Desc2dataset[cat_cols] = dataset[cat_cols].
apply(clean_text)Fantastic!.Looks like we’re done, right?.Not really.
There is a couple of things that we need to do before we are ready to feed the data.
If you go back to the clean_text function, there was a comment that state “If value is NaN we return the same value for cleansing later on.
”Another common step in cleaning your dataset is dealing with missing data.
This very common since many datasets that can be found on the internet is not completely filled.
Those missing data are replaced with NaN values.
Same like text, computers cannot compute NaN values.
Let’s see if our dataset has any missing values for each column.
sum()# OutputWater_(g) 1Energ_Kcal 0Protein_(g) 0Lipid_Tot_(g) 0Ash_(g) 325Carbohydrt_(g) 0Fiber_TD_(g) 594Sugar_Tot_(g) 1832Calcium_(mg) 348Iron_(mg) 144Magnesium_(mg) 739Phosphorus_(mg) 579Potassium_(mg) 426Sodium_(mg) 83Zinc_(mg) 706Copper_mg) 1257Manganese_(mg) 2160Selenium_(µg) 1700Vit_C_(mg) 818Thiamin_(mg) 634Riboflavin_(mg) 616Niacin_(mg) 637Panto_Acid_mg) 2242Vit_B6_(mg) 905Folate_Tot_(µg) 1261Folic_Acid_(µg) 2039Food_Folate_(µg) 1768Folate_DFE_(µg) 2057Choline_Tot_ (mg) 4016Vit_B12_(µg) 1193Vit_A_IU 711Vit_A_RAE 1535Retinol_(µg) 1806Alpha_Carot_(µg) 3258Beta_Carot_(µg) 3162Beta_Crypt_(µg) 3270Lycopene_(µg) 3292Lut+Zea_ (µg) 3315Vit_E_(mg) 2889Vit_D_µg 3262Vit_D_IU 3211Vit_K_(µg) 3563FA_Sat_(g) 349FA_Mono_(g) 666FA_Poly_(g) 665Cholestrl_(mg) 410GmWt_1 300GmWt_Desc1 299GmWt_2 3965GmWt_Desc2 3965Refuse_Pct 50dtype: int64Wow, it seems we have plenty of missing data in our dataset.
Again, there are two things we can do here.
First is by dropping those the rows that contain missing data for any columns.
This means if a column has a NaN value, we drop them right away.
The second option is to do imputing, which means replacing the missing values.
This option is more preferable because we are not throwing away other data that might be important during the training of the model.
The question is, what do we replace it with?There’s no one answer to this.
It depends on the problem you are facing and the model you are using.
Some models are robust toward missing data, while others cannot handle them.
For example, later on, we are going to use a popular machine learning algorithm called Random Forest.
This model has its own way of dealing with missing data.
However, we will not cover that in this series.
But if you are interested to know about it, this video can be very helpful.
For simplicity, we will be using one of the most popular technique to replace missing data.
And that is by using the median.
Why?.That is an explanation for another blog post.
In the meantime, let’s move forward.
To get the median value for each column, we can use the attribute median() provided by the DataFrame class.
Then we can replace the missing value by using the fillna attribute.
get_median = dataset.
median()dataset = dataset.
fillna(get_median)An alternative way is using Scikit-learn’s SimpleImputer class.
Now we need to check if no missing data values are in our dataset.
any()# OutputFalseLooks like all the missing values are gone.
Now, for the final step.
Since we are dealing with a supervised learning task, we need to separate the dataset into input data and labels.
But before that, I want to mention something very quick.
If you pay attention to the list of variables we have, for Vitamin A and Vitamin D, we have two variables representing each of them.
For Vitamin A we have Vit_A_IU and Vit_A_RAE, and for Vitamin D we have Vit_D_IU and Vit_D_µg.
The IU stands for International Standard.
I’m not going to explain about what IU is, but if you want to know more then you can check it out here.
I decided to choose Vit_A_IU and Vit_D_IU because I try to avoid too much variation in the difference of units of each variable.
Now that we got that out of the way, let’s move on to separating the input data and the labels.
Remember the list of label names that initially prepared before?.This is where it’s going to come handy.
# List of names of the columns that we will drop for the input datadrop_cols = ['Vit_D_µg', 'Vit_A_RAE'] + labels# Get the data for labelsy = dataset[labels].
values# Get the data for inputsX = dataset.
valuesThe value attribute converts the data from a DataFrame object into Numpy arrays.
Awesome!.We’re done with cleaning the dataWe are making progress as part of our data science team.
Now that the data is rid of text and missing values, we are ready to start building the model.
Join me in Part 2 of this series where we will finally take another step on the fight against world hunger by creating a machine learning model using Scikit-learn.
I hope this blog post was helpful and can’t wait to see you in part 2.
Until then, cheers!About Me:An Indonesian student currently studying in Korea and has a deep interest in the world of AI.
Backend Developer at a startup called 88spares.
Believe in continuous learning.
Want to give back to the AI that has contributed so much in my pursuit of knowledge.
When I’m not coding, I like to read up the latest trends in the space industry and play for my college’s American football team.