Using word2vec to Analyze News Headlines and Predict Article Success

word2vec can help us answer these questions, and more.

Charlene ChamblissBlockedUnblockFollowFollowingMar 3Word embeddings are a powerful way to represent the latent information contained within words, as well as within documents (collections of words).

Using a dataset of news article titles, which included features on source, sentiment, topic, and popularity (# shares), I set out to see what we could learn about articles’ relationships to one another through their respective embeddings.

The goals of the project were:Preprocess/clean the text data, using NLTKUse word2vec to create word and title embeddings, then visualize them as clusters using t-SNEVisualize the relationship between title sentiment and article popularityAttempt to predict article popularity from the embeddings and other available featuresUse model stacking to improve the performance of the popularity model (this step was not successful, but was still a valuable experiment!)The entire notebook is hosted here, using nbviewer.

Imports and PreprocessingWe’ll begin with imports:import pandas as pdimport gensimimport seaborn as snsimport matplotlib.

pyplot as pltimport numpy as npimport xgboost as xgbThen read in the data:main_data = pd.



head()# Grab all the titles article_titles = main_data['Title']# Create a list of strings, one for each titletitles_list = [title for title in article_titles]# Collapse the list of strings into a single long string for processingbig_title_string = ' '.

join(titles_list)from nltk.

tokenize import word_tokenize# Tokenize the string into wordstokens = word_tokenize(big_title_string)# Remove non-alphabetic tokens, such as punctuationwords = [word.

lower() for word in tokens if word.

isalpha()]# Filter out stopwordsfrom nltk.

corpus import stopwordsstop_words = set(stopwords.

words('english'))words = [word for word in words if not word in stop_words]# Print first 10 wordswords[:10]Next, we need to load in the pre-trained word2vec model.

You can find several such models here.

Since this is a news dataset, I used the Google News model, which was trained on about 100 billion words (wow).

# Load word2vec model (trained on an enormous Google corpus)model = gensim.




bin', binary = True) # Check dimension of word vectorsmodel.

vector_sizeSo the model will generate 300-dimensional word vectors, and all we have to do to create a vector is to pass it through the model.

Each vector looks like this:economy_vec = model['economy']economy_vec[:20] # First 20 componentsword2vec (understandably) can’t create a vector from a word that’s not in its vocabulary.

Because of this, we need to specify “if word in model.

vocab” when creating the full list of word vectors.

# Filter the list of vectors to include only those that Word2Vec has a vector forvector_list = [model[word] for word in words if word in model.

vocab]# Create a list of the words corresponding to these vectorswords_filtered = [word for word in words if word in model.

vocab]# Zip the words together with their vector representationsword_vec_zip = zip(words_filtered, vector_list)# Cast to a dict so we can turn it into a DataFrameword_vec_dict = dict(word_vec_zip)df = pd.


from_dict(word_vec_dict, orient='index')df.

head(3)Dimensionality Reduction with t-SNENext, we’re going to squish (read: do dimensionality reduction on) these word vectors using t-SNE, to see if any patterns emerge.

If you’re not familiar with t-SNE and its interpretations, check out this excellent, interactive distill.

pub article on t-SNE.

It’s important to play around with the parameters for t-SNE, as different values can produce very different results.

I tested several values between 0 and 100 for perplexity, and found that it produced roughly the same shape each time.

I tested several learning rates between 20 and 400 as well, and decided to leave the learning rate at its default (200).

For the sake of visibility (and processing time), I used 400 word vectors instead of the full set of 20,000 or so.

from sklearn.

manifold import TSNE# Initialize t-SNEtsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)# Use only 400 rows to shorten processing timetsne_df = tsne.

fit_transform(df[:400])Now we’re ready to plot our reduced array of word vectors.

I used adjust_text to intelligently push words apart, for improved readability:sns.

set()# Initialize figurefig, ax = plt.

subplots(figsize = (11.

7, 8.


scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.

5)# Import adjustText, initialize list of textsfrom adjustText import adjust_texttexts = []words_to_plot = list(np.

arange(0, 400, 10))# Append words to listfor word in words_to_plot: texts.


text(tsne_df[word, 0], tsne_df[word, 1], df.

index[word], fontsize = 14)) # Plot text using adjust_text (because overlapping text is hard to read)adjust_text(texts, force_points = 0.

4, force_text = 0.

4, expand_points = (2,1), expand_text = (1,2), arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.


show()If you’re interested in trying out adjust_text for your own plotting needs, you can find it here.

Be sure to import it using the camelcase adjustText, and please note that adjustText is currently not compatible with matplotlib 3.

0 or higher.

It’s encouraging to see that even when the vector embeddings have been reduced to 2 dimensions, we see certain items clustering together.

For example, we have months in the left/upper left, we have corporate finance terms near the bottom, and we have more generic, non-topical words (like ‘full’, ‘really’, ‘slew’) in the middle.

Note that if we were to run the t-SNE again with different parameters, we may observe some similarities to this result, but we’re not guaranteed to see the exact same patterns.

t-SNE is not deterministic.

Relatedly, tightness of clusters and distances between clusters are not always meaningful.

It is meant primarily as an exploratory tool, rather than as a decisive indicator of similarity.

Averaging Word EmbeddingsWe’ve gotten a sense of how word embeddings work as applied to this dataset.

Now we can move on to some more interesting ML applications: finding titles that cluster together, and seeing what patterns emerge.

Instead of using Doc2Vec, which does not have pre-trained models available and so would require a lengthy training process, we can use a simpler (and sometimes even more effective) trick: averaging the embeddings of the word vectors in each document.

In our case, a document refers to a title.

We’ll need to redo the preprocessing step to keep titles intact — as we’ll see, this is somewhat more involved than splitting on words.

Thankfully, Dimitris Spathis has created a series of functions that I found to work perfectly for this precise use case.

Thanks, Dimitris!def document_vector(word2vec_model, doc): # remove out-of-vocabulary words doc = [word for word in doc if word in model.

vocab] return np.

mean(model[doc], axis=0)# Our earlier preprocessing was done when we were dealing only with word vectors# Here, we need each document to remain a document def preprocess(text): text = text.

lower() doc = word_tokenize(text) doc = [word for word in doc if word not in stop_words] doc = [word for word in doc if word.

isalpha()] return doc# Function that will help us drop documents that have no word vectors in word2vecdef has_vector_representation(word2vec_model, doc): """check if at least one word of the document is in the word2vec dictionary""" return not all(word not in word2vec_model.

vocab for word in doc)# Filter out documentsdef filter_docs(corpus, texts, condition_on_doc): """ Filter corpus and texts given the function condition_on_doc which takes a doc.

The document doc is kept if condition_on_doc(doc) is true.

""" number_of_docs = len(corpus) if texts is not None: texts = [text for (text, doc) in zip(texts, corpus) if condition_on_doc(doc)] corpus = [doc for doc in corpus if condition_on_doc(doc)] print("{} docs removed".

format(number_of_docs – len(corpus))) return (corpus, texts)Now we’ll use these to do the processing:# Preprocess the corpuscorpus = [preprocess(title) for title in titles_list]# Remove docs that don't include any words in W2V's vocabcorpus, titles_list = filter_docs(corpus, titles_list, lambda doc: has_vector_representation(model, doc))# Filter out any empty docscorpus, titles_list = filter_docs(corpus, titles_list, lambda doc: (len(doc) != 0))x = []for doc in corpus: # append the vector for each document x.

append(document_vector(model, doc)) X = np.

array(x) # list to arrayt-SNE, Round 2: Document VectorsNow that we’ve successfully created our array of document vectors, let’s see if we can get similarly interesting results when plotting them with t-SNE.

# Initialize t-SNEtsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)# Again use only 400 rows to shorten processing timetsne_df = tsne.

fit_transform(X[:400])fig, ax = plt.

subplots(figsize = (14, 10))sns.

scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.

5)from adjustText import adjust_texttexts = []titles_to_plot = list(np.

arange(0, 400, 40)) # plots every 40th title in first 400 titles# Append words to listfor title in titles_to_plot: texts.


text(tsne_df[title, 0], tsne_df[title, 1], titles_list[title], fontsize = 14)) # Plot text using adjust_textadjust_text(texts, force_points = 0.

4, force_text = 0.

4, expand_points = (2,1), expand_text = (1,2), arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.


show()Pretty interesting!.We can see that the t-SNE has collapsed the document vectors into a dimensional space where the documents are spread out based on whether their content has more to do with countries, world leaders, and foreign affairs, or has more to do with technology companies.

Let’s now explore article popularity.

Consensus has it that the more sensationalized or clickbait-y an article’s title is, the more likely it is to be shared, right?.Next, we’ll see if there’s evidence for that in this particular dataset.

Popularity and Sentiment AnalysisFirst, we need to drop all articles for which we don’t have a popularity measurement or a source.

Null measurements for popularity are represented in this data as -1.

# Drop all the rows where the article popularities are unknown (this is only about 11% of the data)main_data = main_data.


Facebook == -1) | (main_data.

GooglePlus == -1) | (main_data.

LinkedIn == -1)].

index)# Also drop all rows where we don't know the sourcemain_data = main_data.




shapeWe still have 81,000 articles to work with, so let’s see if we can find an association between sentiment and number of shares.

fig, ax = plt.

subplots(1, 3, figsize=(15, 10))subplots = [a for a in ax]platforms = ['Facebook', 'GooglePlus', 'LinkedIn']colors = list(sns.

husl_palette(10, h=.

5)[1:4]) for platform, subplot, color in zip(platforms, subplots, colors): sns.

scatterplot(x = main_data[platform], y = main_data['SentimentTitle'], ax=subplot, color=color) subplot.

set_title(platform, fontsize=18) subplot.

set_xlabel('') fig.

suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)plt.

show()It’s a bit hard to make out whether there’s any relationship here, since a few articles are significant outliers in terms of their share counts.

Let’s try log-transforming the x-axis to see if we can reveal any patterns.

We’ll also use a regplot, so seaborn will overlay a linear regression for each plot.

# Our data has over 80,000 rows, so let's also subsample it to make the log-transformed scatterplot easier to readsubsample = main_data.

sample(5000)fig, ax = plt.

subplots(1, 3, figsize=(15, 10))subplots = [a for a in ax]for platform, subplot, color in zip(platforms, subplots, colors): # Regression plot, so we can gauge the linear relationship sns.

regplot(x = np.

log(subsample[platform] + 1), y = subsample['SentimentTitle'], ax=subplot, color=color, # Pass an alpha value to regplot's scatterplot call scatter_kws={'alpha':0.

5}) # Set a nice title, get rid of x labels subplot.

set_title(platform, fontsize=18) subplot.

set_xlabel('') fig.

suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)plt.

show()Contrary to what we might expect (from our idea of highly emotional, clickbaity headlines), in this dataset we find no relationship between headline sentiment and article popularity as measured by number of shares.

To get a clearer sense of how popularity looks on its own, let’s make a final plot of log(Popularity) by platform.

fig, ax = plt.

subplots(3, 1, figsize=(15, 10))subplots = [a for a in ax]for platform, subplot, color in zip(platforms, subplots, colors): sns.


log(main_data[platform] + 1), ax=subplot, color=color, kde_kws={'shade':True}) # Set a nice title, get rid of x labels subplot.

set_title(platform, fontsize=18) subplot.

set_xlabel('') fig.

suptitle('Plot of Popularity by Platform', fontsize=24)plt.

show()As our final segment of exploration, let’s take a look at sentiment by itself.

Does it seem to differ between publishers?# Get the list of top 12 sources by number of articlessource_names = list(main_data['Source'].


index)source_colors = list(sns.

husl_palette(12, h=.

5))fig, ax = plt.

subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)ax = ax.

flatten()for ax, source, color in zip(ax, source_names, source_colors): sns.


loc[main_data['Source'] == source]['SentimentTitle'], ax=ax, color=color, kde_kws={'shade':True}) ax.

set_title(source, fontsize=14) ax.

set_xlabel('') plt.


75, 0.


show()The distributions look fairly similar, but it’s a little hard to tell how similar when they’re all on different plots.

Let’s try overlaying them all on one plot.

# Overlay each density curve on the same plot for closer comparisonfig, ax = plt.

subplots(figsize=(12, 8))for source, color in zip(source_names, source_colors): sns.


loc[main_data['Source'] == source]['SentimentTitle'], ax=ax, hist=False, label=source, color=color) ax.

set_xlabel('') plt.


75, 0.


show()We see that the sources’ Sentiment distributions for article titles are very similar — it doesn’t look like any one source is an outlier in terms of positive or negative titles.

Instead, all 12 of the most common sources have distributions centered around 0 with modestly sized tails.

But does that tell the full story?.Let’s take one more look at the numbers:# Group by Source, then get descriptive statistics for title sentimentsource_info = main_data.


describe()# Recall that `source_names` contains the top 12 sources# We'll also sort by highest standard deviationsource_info.


sort_values('std', ascending=False)[['std', 'min', 'max']]WSJ has both the highest standard deviation and the largest range.

We can see at a glance that WSJ has both the highest standard deviation and the largest range, with the lowest minimum sentiment compared to any other top source.

This suggests that WSJ could be unusually negative in terms of its article titles.

To verify this rigorously would require a hypothesis test, which is beyond the scope of this post, but it’s an interesting potential finding and future direction.

Popularity PredictionOur first task in preparing the data for modeling is to rejoin the document vectors with their respective titles.

Thankfully, when we were preprocessing the corpus, we processed the corpus and titles_list simultaneously, so the vectors and the titles they represent will still match up.

Meanwhile, in main_df, we have dropped all of the articles that had -1 popularity, so we'll need to drop the vectors that represent those article titles.

Training a model on these enormous vectors as-is will not be possible on this computer, but we’ll see what we can do with a little dimension reduction.

I’ll also engineer a new feature from publish date: “DaysSinceEpoch”, which is based on Unix time (read more here).

import datetime# Convert publish date column to make it compatible with other datetime objectsmain_data['PublishDate'] = pd.

to_datetime(main_data['PublishDate'])# Time since Linux Epocht = datetime.

datetime(1970, 1, 1)# Subtract this time from each article's publish datemain_data['TimeSinceEpoch'] = main_data['PublishDate'] – t# Create another column for just the days from the timedelta objects main_data['DaysSinceEpoch'] = main_data['TimeSinceEpoch'].


describe()As we can see, all of these articles were published within about 250 days of each other.

from sklearn.

decomposition import PCApca = PCA(n_components=15, random_state=10)# as a reminder, x is the array with our 300-dimensional vectorsreduced_vecs = pca.

fit_transform(x)df_w_vectors = pd.

DataFrame(reduced_vecs)df_w_vectors['Title'] = titles_list# Use pd.

concat to match original titles with their vectorsmain_w_vectors = pd.

concat((df_w_vectors, main_data), axis=1)# Get rid of vectors that couldn't be matched with the main_dfmain_w_vectors.

dropna(axis=0, inplace=True)Now we need to drop non-numeric and non-dummy columns so we can feed the data to a model.

We’ll also apply scaling to the DaysSinceEpoch feature, since it is wildly larger in magnitude compared to the reduced word vectors, sentiment, etc.

# Drop all non-numeric, non-dummy columns, for feeding into the modelscols_to_drop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source'] data_only_df = pd.

get_dummies(main_w_vectors, columns = ['Topic']).

drop(columns=cols_to_drop)# Standardize DaysSinceEpoch since the raw numbers are larger in magnitude from sklearn.

preprocessing import StandardScalerscaler = StandardScaler()# Reshape so we can feed the column to the scalerstandardized_days = np.


reshape(-1, 1)data_only_df['StandardizedDays'] = scaler.

fit_transform(standardized_days)# Drop the raw column; we don't need it anymoredata_only_df.

drop(columns=['DaysSinceEpoch'], inplace=True)# Look at the new rangedata_only_df['StandardizedDays'].

describe()# Get Facebook data onlyfb_data_only_df = data_only_df.

drop(columns=['GooglePlus', 'LinkedIn'])# Separate the features and the responseX = fb_data_only_df.

drop('Facebook', axis=1)y = fb_data_only_df['Facebook']# 80% of data goes to trainingX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

2, random_state = 10)Let’s run a non-optimized XGBoost on the data and see how it works out-of-the-box.

from sklearn.

metrics import mean_squared_error# Instantiate an XGBRegressorxgr = xgb.

XGBRegressor(random_state=2)# Fit the classifier to the training setxgr.

fit(X_train, y_train)y_pred = xgr.

predict(X_test)mean_squared_error(y_test, y_pred)Underwhelming results, to say the least.

Can we improve this performance with hyperparameter tuning?.I’ve pulled in and repurposed a hyperparameter tuning grid from this Kaggle article.

from sklearn.

model_selection import GridSearchCV# Various hyper-parameters to tunexgb1 = xgb.

XGBRegressor()parameters = {'nthread':[4], 'objective':['reg:linear'], 'learning_rate': [.

03, 0.

05, .

07], 'max_depth': [5, 6, 7], 'min_child_weight': [4], 'silent': [1], 'subsample': [0.

7], 'colsample_bytree': [0.

7], 'n_estimators': [250]}xgb_grid = GridSearchCV(xgb1, parameters, cv = 2, n_jobs = 5, verbose=True)xgb_grid.

fit(X_train, y_train)According to xgb_grid, our best parameters were as follows:{'colsample_bytree': 0.

7, 'learning_rate': 0.

03, 'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 250, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.

7}Try again with the new parameters:params = {'colsample_bytree': 0.

7, 'learning_rate': 0.

03, 'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 250, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.

7}# Try again with new paramsxgr = xgb.

XGBRegressor(random_state=2, **params)# Fit the classifier to the training setxgr.

fit(X_train, y_train)y_pred = xgr.

predict(X_test)mean_squared_error(y_test, y_pred)It’s better by around 35,000, but I’m not sure that’s saying a whole lot.

We might infer, at this point, that the data in its current state seems insufficient for this model to perform.

Let’s see if we can improve it with a little more feature engineering: we’ll train some classifiers to separate the two main groups of articles: Duds (0 or 1 share) vs.

Not Duds.

The idea is that if we can give the regressor a new feature (the probability that the article will have extremely low shares), it may perform more favorably on predicting highly-shared articles, thus lowering the residual values for those articles and reducing mean squared error.

Detour: Detect Dud ArticlesFrom the log-transformed plots we made earlier, we can note that in general, there are 2 chunks of articles: 1 cluster at 0, and another cluster (the long tail) going from 1 onwards.

We can train a few classifiers to identify whether the article will be a “dud” (be in the 0–1 shares bin), and then use the predictions of those models a feature for the final regressor, which will predict probability.

This is called model stacking.

# Define a quick function that will return 1 (true) if the article has 0-1 share(s)def dud_finder(popularity): if popularity <= 1: return 1 else: return 0# Create target column using the functionfb_data_only_df['is_dud'] = fb_data_only_df['Facebook'].

apply(dud_finder)fb_data_only_df[['Facebook', 'is_dud']].

head()# 28% of articles can be classified as "duds"fb_data_only_df['is_dud'].

sum() / len(fb_data_only_df)Now that we have our dud feature made, we’ll initialize the classifiers.

We’ll use a Random Forest, an optimized XGBClassifier, and a K-Nearest Neighbors classifier.

I will leave out the part where I tune the XGB, since it looks essentially the same as the tuning we did earlier.

from sklearn.

ensemble import RandomForestClassifierfrom sklearn.

neighbors import KNeighborsClassifierfrom sklearn.

model_selection import train_test_splitX = fb_data_only_df.

drop(['is_dud', 'Facebook'], axis=1)y = fb_data_only_df['is_dud']# 80% of data goes to trainingX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.

2, random_state = 10)# Best params, produced by HP tuningparams = {'colsample_bytree': 0.

7, 'learning_rate': 0.

03, 'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 200, 'nthread': 4, 'silent': 1, 'subsample': 0.

7}# Try xgc again with new paramsxgc = xgb.

XGBClassifier(random_state=10, **params)rfc = RandomForestClassifier(n_estimators=100, random_state=10)knn = KNeighborsClassifier()preds = {}for model_name, model in zip(['XGClassifier', 'RandomForestClassifier', 'KNearestNeighbors'], [xgc, rfc, knn]): model.

fit(X_train, y_train) preds[model_name] = model.

predict(X_test)Test the models, get the classification reports:from sklearn.

metrics import classification_report, roc_curve, roc_auc_scorefor k in preds: print("{} performance:".

format(k)) print() print(classification_report(y_test, preds[k]), sep='!.')The top performance in terms of f1-score came from the XGC, followed by the RF and finally the KNN.

However, we can also note that the KNN actually did the best job in terms of recall (successfully identifying duds).

This is why model stacking is valuable — sometimes even an otherwise excellent model like XGBoost can underperform on tasks like this one, where evidently the function to be identified can be locally approximated.

Including the KNN’s predictions should add some much-needed diversity.

# Plot ROC curvesfor model in [xgc, rfc, knn]: fpr, tpr, thresholds = roc_curve(y_test, model.

predict_proba(X_test)[:,1]) plt.

plot([0, 1], [0, 1], 'k–') plt.

plot(fpr, tpr) plt.

xlabel('False Positive Rate')plt.

ylabel('True Positive Rate')plt.

title('ROC Curves')plt.

show()Popularity Prediction: Round 2Now we can average out the probability predictions from the three classifiers, and use it as a feature for the regressor.

averaged_probs = (xgc.

predict_proba(X)[:, 1] + knn.

predict_proba(X)[:, 1] + rfc.

predict_proba(X)[:, 1]) / 3X['prob_dud'] = averaged_probsy = fb_data_only_df['Facebook']This is followed by another round of HP tuning with the new feature included, which I’ll leave out.

Let’s see how we do on performance:xgr = xgb.

XGBRegressor(random_state=2, **params)# Fit the classifier to the training setxgr.

fit(X_train, y_train)y_pred = xgr.

predict(X_test)mean_squared_error(y_test, y_pred)Uh oh!.This performance is essentially the same as it was before we even did any model stacking.

That said, we can keep in mind that the MSE as an error measurement tends to overweight outliers.

In fact, we can also calculate the mean absolute error (MAE), which is used for assessing performance on data with significant outliers.

In mathematical terms, the MAE calculates the l1 norm, essentially the absolute value, of the residuals, rather than the l2 norm used by the MSE.

We can compare the MAE with the square root of the MSE, also known as the root mean squared error (RMSE).

mean_absolute_error(y_test, y_pred), np.

sqrt(mean_squared_error(y_test, y_pred))The mean absolute error is only about 1/3 of the RMSE!.Perhaps our model is not as bad as we might have initially thought.

As a final step, let’s take a look at each feature’s importance according to the XGRegressor:for feature, importance in zip(list(X.

columns), xgr.

feature_importances_): print('Model weight for feature {}: {}'.

format(feature, importance))Aha!.prob_dud was found to be the most important feature.

Neat!.Our model found prob_dud to be the most important feature, and our custom StandardizedDays feature was the second most important.

(Features 0 through 14 correspond to the reduced title embedding vectors.

)Even though overall performance didn’t improve through this round of model stacking, we can see that we did successfully capture an important source of variability in the data, which the model picked up on.

If I were to continue expanding on this project to make the model more accurate, I’d probably consider augmenting the data with outside data, including Source as a variable via binning or hashing, running the models on the original 300-dimensional vectors, and using the “time-sliced” data (a companion dataset to this data) of each article’s popularity at various time points to predict the final popularity.

If you found this analysis interesting, please feel free to use the code, and to expand on it further.The notebook is here (note that some of the cells may be in a slightly different order than they were presented here), and the original data used for this project is here.


. More details

Leave a Reply