Using Dimensionality Reduction to Visualize Job PolarizationOnyi LamBlockedUnblockFollowFollowingMay 28PC1 and PC2 extracted from the MDS Embedding using 2003 data.

Each point represents a job, and each color represents a job zone.

The smaller the job zone, the less education requirement/experience it requires.

In this post, we illustrate how dimensionality reduction techniques including principal component analysis (PCA) and multidimensional scaling (MDS) can be used to visualize polarization of work activities since 2003, using data from O*NET,a US government-sponsored website with a rich set of variables that describe work and worker characteristics.

Let’s briefly go over the data structure of O*NET to understand why using dimensionality reduction techniques could be useful.

The dataset is organized around different mix of knowledge, skills, abilities and work activities that are required to perform at each job.

It contains hundreds of standardized and occupation-specific variables on almost 1,000 occupations covering the entire U.

S.

economy.

For this analysis, we used the Work Activity files from 11/2003 and 8/2018 to compare how activities performed at different jobs changed in this span of 15 years.

There are 41 different work activities such as “Handling and Moving Objects” and “Resolving Conflicts and Negotiating with Others”.

In the file, a data value is assigned to each occupation-activity-scale combo.

In general, the higher the data value, the more important is the activity to the the job.

For example, in the 2018 file (shown in the screenshot below), the data value for the job “Chief Executive” to perform activity “Getting Information” using the scale “Importance” is 4.

72.

Principal Component AnalysisIn order to understand how work activities performed at different jobs have changed, one can follow each of these 41 activities separately over time for each job, but it can quickly become out of hand.

PCA can help address this issue by reducing the number of activities that we need to keep track of.

Specifically, it does so by transforming the data to a new coordinate system such that the first coordinate (i.

e.

the first principal component) represents the projection with the greatest variance of data, the second greatest variance on the second coordinate, and so on.

The following code implements PCA in python using the scikit-learn library.

It does so by first standardizing the feature scale:from sklearn.

decomposition import PCAfrom sklearn.

preprocessing import StandardScaler#df is the dataframe that contains the value of each of the 41 activities for all jobsx = df.

loc[:, features].

valuesx = StandardScaler().

fit_transform(x)pca = PCA(n_components=num_dim)#num_dim is the number of dimension we want to examine.

We set num_dim to be 2.

principalComponents = pca.

fit_transform(x)The figures below plot the principal components in the two periods.

To illustrate how jobs that require different education levels differ by the work activities they do, we color-coded each job (denoted by the points on the scatter plot) by the job zone they belong to.

Education is one of the criteria in determining job zone, along with other measures such as experience and training.

The higher the job zone, the more preparation the job requires.

Occupations that belong to job zone 1 are jobs that sometimes require a high school diploma whereas job zone 5 occupations usually require graduate school degree.

If polarization is growing, the activities that high and low-skill group performs would be more different from each other.

And this is exactly what we see when we compare the plots in this two periods.

In 2018, jobs of similar color are more closely clustered together, suggesting that there is little overlap in work activities across jobs that belong to different zones.

Actually doing the calculation reveals that jobs that require less preparation (those in zone 1 and 2) have a larger PC1 value in 2018 than in 2003.

In 2003, jobs in zone 4 (colored in yellow), which demand relatively more education, were more scattered along the x-axis.

In contrary in 2018, they mostly lied to the left of 0, which means that they are more clustered together.

PC1 and PC2 using PCA, 2003 on the left, 2018 on the rightTo examine this pattern more rigorously, we can calculate the distance between jobs in different job zones.

We do so by first determining the centroid of the job zone cluster, then compute the distance between the centroids of the job zones.

In 2003, the distance between job zone 1 and 5 is 5.

7, and it has increased to 6.

7 in 2018.

Similar increase in distance also took place between job zone 1 and 4, and between job zone 2 and 5.

Interpreting the Principal ComponentsThe PCs themselves do not have any substantive meaning, so in order to relate the principal components with the work activities, we can examine the standard Pearson correlations between the PCs and the work activities themselves.

For the year 2003, PC1 is most strongly correlated with “Inspecting Equipment, Structures, or Material”, and other physical labor intensive activities.

On the other hand, PC2 is most strongly correlated with cognitive activities such as “processing information” and “getting information”.

Note that while the top 3 activities that have the highest correlation with PC1 are the same in both 2003 and 2018, that is not the case for PC2.

For PC2, “Making Decisions and Solving Problems” is present in 2018 but not in 2003, whereas “Getting Information” is present in 2003 but not in 2018.

So while the two dimensions capture activities of similar nature in both periods, they are not exactly the same.

PC1 using PCA:PC2 using PCA:Multidimensional ScalingAnother method that allows researchers to visualize growing polarization across jobs is MDS.

What MDS attempts to do is to preserve the the pairwise distance between the jobs while projecting the matrix into lower dimensions.

The input to the MDS model is a matrix of distances between the jobs, instead of the data values for each activity-job as in the PCA.

So to convert job activity data into a distance matrix such that it is suitable for MDS, we first multiply the matrix of data value by itself.

The code below illustrates this step as well as the fitting of the distance matrix into its embedding:from sklearn.

manifold import MDS### multiplying the x matrix with itselft = np.

dot(x, np.

transpose(x))mds = MDS(n_components= n_dimension, max_iter=3000, eps=1e-9, random_state=12345,dissimilarity=”precomputed”, n_jobs=1)pos = mds.

fit(t).

embedding_After that, we can extract the first two principal components from the position of the dataset in the MDS embedding space as before:# select the top 2 dimensions of dataclf = PCA(n_components=2)pos = clf.

fit_transform(pos)Similar to the PCA analysis, the first principal component in 2018 captures the more manual labor intensive activities such as “Handling and Moving Objects”.

In 2003, however, the first principal component captures different type of activities from the previous PCA analysis.

Instead of the manual labor intensive activities, PC1 is most correlated with more managerial activities such as “Developing Objectives and Strategies” and “Provide Consultation and Advice to Others”.

This difference reveals another dimension that are not apparent from using PCA.

PC1 using MDS:PC2 using MDS:The figures below show the principal components of the MDS embedding for 2003 (left) and 2018 (right).

As the first impression, jobs of different colors are more mixed in with each other in 2003 than in 2018, which means that along the two main dimensions, there are more overlapping in activities across jobs belonging to different job zones.

In addition, none of the jobs that belong to job zone 4 or 5 have a PC1 < 0 in 2018 (All red and green dots lie on the left of 0 in PC1).

These two patterns suggest that the activities that each job performs have become more separated over time.

PC1 and PC2 using MDS, 2003 on the left, 2018 on the rightTo conclude, both methods illustrate a growing polarization in work activities performed by jobs that require different levels of preparation, but they can also reveal differences in the subtleties of the data.

.