Finding Meaning: The Data Problem with Small Area Predictive Analysis in the Social Sciences (Part 2)Joanne JordanBlockedUnblockFollowFollowingJun 21If you haven’t read Part 1 of this project of mine yet, feel free to take a look.

While it isn’t necessary to understand the problems I set forth here, I explain what Small Area Analysis is, my inspiration for this project, how I gathered and prepared my data, and the struggles I faced while doing it.

This is a work in progress and there will be a Part 3 to this series in days upcoming.

This article was originally published in August 2018.

This is a slightly edited copy of the original, which was taken down.

Here, I will talk about my difficulties modeling the public health, environmental, and educational data I found for the 59 designated Community Districts across the five boroughs of New York City.

Community Districts of New York CityI will be discussing these three main themes throughout this post.

Different models I considered and triedThe problem of overfitting with limited dataUsing machine learning for interpretation and analysis rather than predictionBefore even beginning with a model, I found the correlations each feature had with the educational attainment columns.

Some were quite interesting.

Take a look at the health-related indicators for the different community districts— especially the correlation that “sugary drinks” has.

Air quality and “community health” indicators (indicators of whether a community is thriving) also had strong correlations in some cases.

Finally, race had correlations on par with national rates…Modeling the DataThe most obvious model to begin with was simple Linear Regression.

While it is possible to use scikit-learn’s LinearRegression with multiple outputs (in this case rate of adults without a high school diploma, rate of adults with a high school diploma and maybe some college, and rate of adults with a college degree or higher), the model assumes the outputs are independent, and therefore essentially creates three completely separate linear regression models, one for each output.

I went ahead and did it anyway, completed a linear regression over all my data and got an R² value of 0.

95 (averaged over the three outputs).

This is all well and good but there were two glaring issues…The first issue was the fact that the three outputs are obviously not independent; in fact they are so connected such that the three rates should sum to 100 in each case (as is the tendency for percentages to do).

So creating three completely independent linear models isn’t actually describing the situation accurately.

seaborn pairplot of the educational attainment dataThe second issue is that, while this model seems to tell us that the features included in the model do seem to correlate with the educational outcome data, when I separated my data into train and test sets, no matter the size, I invariably got an R² value of greater than 0.

95 for the train data and usually a negative value for the test data — the highest I ever got was 0.

49, and that was sheer luck based on the random splitting of the data.

So there’s definitely a problem of overfitting going on here.

With only 59 samples, since there are only 59 Community Districts, this seems an inescapable issue no matter the model.

To make headway on the first issue of correlated educational attainment features in a few different ways.

First, I researched models that could handle multiple outputs while taking into account the fact that they are related.

The only promising avenue seemed to be cutting edge Neural Networks.

However, these were only mentioned in research papers, and no productionized models have yet been created, so it’s really only in the theoretical stage at this point… so a dead end for me.

The only other way around this was then to consolidate my three outputs into a single score.

The simplest was to do this was to assign each level of educational attainment a value: 0 for did not complete high school, 1 for high school degree and maybe some college, and 2 for college degree or higher.

For each district I multiplied each percent by its corresponding value and added them together.

This created a single, though somewhat abstract, “education score” for each district with which I could then run a single linear regression.

educational_attainment['edu_score'] = educational_attainment.

apply( lambda row: row['Eduhsdegreeorsomecollege_rate'] + 2 * row['Educollegedegreeandhigher_rate'], axis=1)Over the entire dataset, this linear regression produced an R² value of 0.

96; however it fell into the same overfitting trap when separated into train and test sets.

Ordinal RegressionNot expecting better results about the overfitting issue, I decided to try some ordinal regression using the mord module.

So I used my educational scores to split the data into 4 categories using the mean and standard deviation.

# Get the mean and standard deviation (indices 1 and 2 respectively)statistics = educational_attainment['edu_score'].

describe()# Define my categoriesdef get_ordinal(row): if row.

loc['edu_score'] >= (statistics[1] + statistics[2]): ordinal = 1 elif row.

loc['edu_score'] >= (statistics[1]): ordinal = 2 elif row.

loc['edu_score'] >= (statistics[1] – statistics[2]): ordinal = 3 else: ordinal = 4 return ordinal educational_attainment['ordinal'] = educational_attainment.

apply( lambda row: get_ordinal(row), axis=1)To my surprise, this faired slightly better on the over-fitting when I used mord.

LogisticIT() (ordinal logistic regression using the immediate-threshhold variant).

When used on the dataset as a whole, the R² value was 0.

93.

When separated into train and test sets, the R² value of the test dataset was at least always positive, sometimes as high as 0.

5 — still nothing to brag about but quite an improvement.

An alternative way to combat this overfitting might be to find the same data but for different years, or to find the same data for more sets of neighborhoods.

However, seeing how difficult it was just to find this data, I will save that for another day.

Therefore, while it seems we can find correlations, any predictive powers of the model are essentially meaningless, so no cause and effect could plausibly be established.

The more promising path seems to be to use L2 regularized linear regression (also known as Ridge Regression) to compute feature importances to accompany our original correlations in pursuit of understanding and interpreting the data, instead of attempting to make a purely predictive model.

Other models useful for this kind of analysis are Randomized L1 (Lasso) Regression and scikit-learn’s Extra Trees Regressor.

Below are the results I found when implementing these strategies.

Ridge Regression, Randomized Lasso, and Extra TreesThese models are great tools for interpreting data because the models are stable and useful features tend to have non-zero coefficients/scores — as a result of redundancies in the algorithms and averaging trials and/or penalizing high coefficients to make sure features aren’t overrepresented.

As another layer, I used the eli5 package which includes support for sklearn to implement PermutationImportance in which, column by column, values are shuffled to see how well the model can still predict the targets.

# Using robustly scaled columns.

def get_scores(X, Y, scale_method): index_tuples = [] model_data = [] for level in Y.

columns: ridge_model = RidgeCV() et_model = ExtraTreesRegressor(n_estimators=50, bootstrap=True) randL_model = RandomizedLasso() models = [ridge_model, et_model] y = Y[level] for model in models: model.

fit(X, y) score = model.

score(X, y) model_name = f'{model}'.

split('(')[0] try: coefs = model.

coef_ except: try: importances = model.

feature_importances_ except: importances = np.

array(None) else: importances = np.

absolute(coefs) finally: perm = PermutationImportance(model).

fit(X, y) perm_importances = perm.

feature_importances_ index_tuple1 = (level, 'importances', score, model_name) index_tuple2 = (level, 'perm_importances', score, model_name) if importances.

any(): index_tuples.

append(index_tuple1) model_data.

append(importances) index_tuples.

append(index_tuple2) model_data.

append(perm_importances)When I averaged the scores from these different models, and filtered for features that were indicators of community and environmental health, I got the results below.

The most important factors of Environmental and Community Health in predicting Educational Attainment of adults in the community.

“Extremes” is an aggregation of rates for “Did Not Complete HS” and “College Degree or Higher”In continuing this research, I plan to cut down on the number of features by looking at correlations and p-values to select appropriate features in hopes of minimizing the effects of overfitting.

I also plan to continue searching for other sources of data — especially data that can show variations across time.

This is a work in progress.

As is life.

.