Machine Learning has never been this easy: Feature Engineering Concepts in 6 questionsAnson ShiuBlockedUnblockFollowFollowingMay 26Key terms: feature normalization, categorical features, one hot representation, feature crosses, text representation, TFIDF, N-gram, Word2VecThis article is written for people who are keen to master machine learning concepts and skills required for machine learning jobs quickly by going through a set of popular and useful questions.

Any comments and suggestions are welcome.

BackgroundFor all machine learning problems, data and its features are critical to algorithm performances.

Feature engineering is a series of data-preprocessing which aim at removing data impurities and redundancy.

The questions below focus on discussing two types of dataStructured Data: It is like a table in the relational database.

Each column has its own definition and contains at least integer and category data types.

Each row represents one data sampleUnstructured Data: It includes text, images, audio, and video data, in which they cannot be represented by a single numerical value.

Also, each data sample has a different size/dimension and vague category definitionQuestion 1: Why do we need to normalize numerical features?We need to make features comparable.

For example, we need to analyze whether height or weight is more influential to human’s health.

Let’s say we intend to use the meter (m) and kilogram(kg) as the unit of weight and height respectively.

The range of height is about 1.

5 to 1.

9 meters while the range of weight is about 50–100 kilograms.

Since the analyze result would be more significant towards features with larger differences, the weight will most probably yield a larger impact than height.

Therefore we do perform normalization to convert features in the same scale for further analysis.

Stochastic Gradient descent is faster after normalization.

Therefore we should normalize numerical features for algorithms which requires gradient descent.

(except decision tree)Follow-up: What are the common ways to normalize numerical features?Min-max Scaling2.

Z-score NormalizationQuestion 2: How should we handle categorial features?Categorical features are features such as gender (M/F), blood type(A, B, AB, O).

The input type is usually in “string” format.

For most machine learning models, categorical features should be converted into numerical features before feeding into the model.

Follow up: What are the common methodologies?Ordinal Encoding: It applies to categorical features with an ordinal relationship.

For example academic grading in High School (A, B, C), we can assign A:3, B:2 and C:1.

One-hot Encoding: It applies to categorical features without ordinal relationship.

For example blood types (A, B, AB, O), we can express it in a sparse vector representation, in which A: (1,0,0,0), B: (0,1,0,0), AB:(0,0,1,0) and O:(0,0,0,1).

Reminder: when it comes to a large volume of categorical features- Stores feature in Sparse Matrix format in order to save computational space- It might be necessary to go through a feature selection process to lower the dimension of the feature because 1) It is difficult to measure the distance between the dimensional feature point (KNN) 2) Overfitting as the number of parameters increase when the dimension increases(Logistic Regression) 3) Only part of the feature is useful for prediction/classificationBinary Encoding: There are 2 steps in total.

Firstly, we should assign a unique id.

Second, we will convert the id into a binary representation.

Comparing with one-hot representation, it has a lower dimension, so it helps to save computational spaceOthers: There are other encoding methods, including Helmert Contrast, Sum Contrast, Polynomial Contrast, Backward Difference Contrast, etc.

Question 3: What is feature crosses?.How should we handle high dimensional feature crosses?Answer:To increase the fitting ability on a complex relationship, it is common to pair up two discrete features to formulate “feature crosses” in the process of feature engineering.

The example is shown belowUsually, it is practical to apply feature crosses in normal discrete features.

Yet, it might be troublesome when it comes to “id” type features.

Assume there are m users and n items, the number of learning parameters will become m x n.

In the e-commerce scenario, it is really difficult to learn so many parameters when there are millions of items and users.

To solve this problem, we should represent item and user in lower K dimension respectively.

The reason is shown as belowQuestion 4: How can we find feature crosses in an efficient way?Since we usually need to process many different high dimensional features, overfitting/parameters overflow might be the problem if we simply pair up all combinations of feature cross.

To solve the problem, it is necessary to select meaning feature pairs.

Decision Tree is the solution.

In case we four raw input features including age, gender, user type(trial, paid), item type(skincare, food, etc.

), we construct a decision tree based on the “click” or “not click” labels.

For every path from a node to leaf node, we can treat as one meaningful feature cross.

From the graph shown above, there are 4 possible feature crossesAge <= 35 & Gender = femaleAge <= 35 & Item type = skin careUser type = paid & Item type = foodUser type = paid & Age <= 40In case, we have two new data samplesQuestion 5: What are the common text representation models?Bag of Words and N-gram ModelBag of words1)The most basic method of producing text representation is bag of words.

2) Divide the entire document into many terms (removing the meaningless terms (stop words))3) Represent each article as a long vector, in which each dimension stands for a particular vocabulary4) For weighting of each dimension, we can calculate by using Term frequency-Inverse document frequency (TFIDF)From the above formula, a term is more important to an article whenThe term appears a lot of times in the documentThe term seldom appears in other documents.

Therefore when the term also appears in so many other documents, we add a discount (IDF) to the TF-IDFN-gramYet, there are some vocabularies which are composed of consecutive terms.

(e.

g.

Natural Language Processing).

Therefore we can put consecutive n terms (n << N, N = total number of terms) into one term-set (we call it as N-gram).

If the document is represented by N-gram as a feature, it is called the N-gram modelStemmingFor grammatical reasons, documents are going to use different forms of a word, such as organize, organizes and organizing, or there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization while representing the same meaning.

The purpose of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base formTopic ModelThis model is used to discover representational topics in documents, and calculate the distribution of the topic (details will be discussed in the later post)Word Embedding (Deep Learning Model)Map each term in a low dimensional space (K = 50 ~ 300), and each dimension can be treated as a latent topicIf there are N terms in a document, the representation of the document will be N x K dimensionsIf we simply feed the raw representation in machine learning models, the result is usually unsatisfactory because machine learning models requires a higher-level representationFor common machine learning models, we can usually yield good performance by leveraging feature engineering.

In deep learning models, since each latent layer can automatically extract a higher level features in different abstract level, it should perform a lot better than simply using machine learning modelsCompared with MLP (fully connected network), CNN and RNN can better capture the properties of a document, and they are also faster due to less learning parameters → less training time, low risk of overfittingQuestion 6: How does Word2Vec work?.What are the differences between Word2Vec and LDA and how are they related?There are in total of two types of network structures in Word2Vec Modelsource: https://skymind.

ai/wiki/word2vecw(t) = target wordw(t+1), w(t+2), w(t-1), w(t-2) = context wordssliding window size = 21) Continuous Bag of Words (CBOW)predict the probability of the targeted term based on its context words2) Skip-grampredict the probability of the context terms based on the target wordsFor both network architectures, they include 1) Input Layer, 2) Projection Layer and 3) Output layer1) Input Layerone hot encoding (Yes: 1, No: 0), dimension = 1 x N (N = total number of words)1.

5) Weighting MatrixN x K matrix2) Projection LayerK hidden unitsOutput = 1 x N * N x K = 1 x KFor CBOW, we need to take the sum of 1 x K from context wordsFor Skip-gram, we do not need to take the sum, because the input only has one 1 x K from the target words2.

5) Weighting MatrixK x N matrix3) Output LayerOutput = 1 x K * K x N = 1 x NEach dimension in N stands for the representation of the corresponding word in the entire corpusAfter getting the output from the output layer:4) Softmax Activation functionPurpose: turn each value (in the vector) into probabilityHow to train the weighting within the neural network?Reason: there are two weighting matrix in the aforementioned network architecture1) N x K matrix (in between input and projection layer), 2) K x N matrix (in between projection and output layer)Back-propagation (optimize through gradient descent) -> since the softmax has already normalized the output value, it requires to loop through the entire corpusTherefore we have the following improved methodHierarchical SoftmaxNegative SamplingFinal Step: after trainingwe get the two weighting matrices (matrix 1 = N x K, matrix 2 = K x N)we can choose one of the matrices to be the representation of N terms, (each dimension = K)What is the connection between Word2Vec and LDA?1) What is Latent Dirichlet Allocation (LDA)?It is used to classify text in a document to a particular topic.

It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

(learn more about LDA from medium)From meaning perspective:LDA:i) It performs topic clustering on words based on the co-occurrence relationship between words in the documentii) it gets the probability distribution of document — topics, and topic — wordsWord2Vec:i) It’s learning goal is to get a trained “context — word”ii) The trained word representation is more likely to contain context informationiii) If two words are similar in terms of Word2Vec representation, it means both words usually exist in a similar contextFrom method perspective:LDA:generative model base on the probability graph model → derives from the product of the consecutive conditional probabilityWord2Vec:neural network representation, → derive word representation by learning the network weightingWooHoo, thanks for reading!.Hope you now have a better understanding of feature engineering on different types of data.

Feel free to give me any comments and see you next time!.