Identifying regional differences of Perinatal Mental Health indicators in the UK with R

Identifying regional differences of Perinatal Mental Health indicators in the UK with RMatt.

0BlockedUnblockFollowFollowingApr 25The fingertipsR package provides an easy interface to access the fingertips API.

This repository contains a large variety of public health indicators managed by Public Health England.

I will be focusing on the data related to Perinatal Mental Health as our laboratory is interested in (among other things) the epigenetic embedding of early adversity.

Perinatal mental health problems are those which occur during pregnancy or in the first year following the birth of a child.

Perinatal mental illness affects up to 20% of women.

Perinatal mental health problems can also have long-standing effects on the children’s emotional, social and cognitive development.

If left untreated, it can have significant and long lasting effects on the woman and her family.

I thought it would be interesting to use Principle Component Analysis and Partitioning around medioids on monitoring indicators to identify regional variations that could provide improvements for control in those areas (like has been done with TB).

# load the needs package to load a bunch of libraries quickly (or install them if you don't already have them)library(needs)needs(purrr, dplyr, tibble, fingertipsR, tidyr, magrittr, FactoMineR, impute, DT, broom, stringr, skimr, cluster, ggplot2, ggfortify, viridis, hrbrthemes, ggthemes, cluster, maptools, gpclib )Let’s find out what is within the fingertips APIThe profiles() function is used to search profiles — indicators related to a specific disease or risk factor.

profs <- profiles()DT::datatable(profs, rownames= FALSE)As mentioned before I’m interested in Perinatal Mental Health so I’ll select those profiles.

sel_profs <- profs[grepl("Perinatal Mental Health", profs$ProfileName),]DT::datatable(sel_profs, rownames = FALSE)There are four Domains related to Perinatal Mental Health.

Within these domains are a number of indicators presented at different time periods, geographies, sexes, age ranges etc.

peri_inds <- indicators(ProfileID = sel_profs$ProfileID)DT::datatable(peri_inds, rownames= FALSE)This gives us 49 Perinatal mental health indicators, which we can now extract using the fingertips_data() function combined with a call to purrr::map().

peri_df <- peri_inds$IndicatorID %>% map(~fingertips_data(IndicatorID = .

))This results in 49 tibbles; however, many are empty and need to be removed.

I will use dimension reduction (PCA) and clustering (PAM) on these 49 indicators to generate clusters of counties with similar characteristics of Perinatal Mental Health indicators.

This may help to identify regional variation and possibly a future framework for improvements in research and control efforts.

I will choose Depression: % recorded prevalence (aged 18+) from the Risk & related factors domain as the key indicator for which all others will be matched.

The following code extracts this at the county-level, re-codes the value variable as recent incidence rates, and pulls the overall incidence of cases.

I filtered out missing data and adjusted the time period to represent the final year for each rolling average.

peri_inc <- as_tibble(peri_df[[10]]) %>% dplyr::filter(AreaType %in% "County & UA") %>% dplyr::select(AreaName, Timeperiod, rec_inc_rate = Value) %>% mutate(Timeperiod = Timeperiod %>% str_split("/") %>% map_chr(first) %>% as.

numeric() %>% {.

+ 2} %>% as.

character())peri_df_extraction <- function(peri_df, var_name, area_type = "County & UA") { df <- peri_df %>% filter(AreaType %in% area_type) %>% dplyr::select(AreaName, Value, Timeperiod) %>% rename_at(.

vars = vars(Value), funs(paste0(var_name))) return(df)}var_names <- c("fertility_rate", "booking_under_20", "reproductive_age", "non_UK", "birth_under_20", "booking_over_40", "booking_BAME", "poverty", "IDACI", "homelessness", "contraceptions_under_18", "single_parent", "sever_mental_illness", "drug_treatment", "alcohol_treatment", "child_protection", "child_in_need", "infant_mortality", "child_looked_after", "sole_birth", "complex_social_factors", "multiparity", "caesarean", "preterm", "stillbirth", "abuse", "skin_contact", "distress_lb", "distress_ub", "mild_depressive_lb", "mild_depressive_ub", "chronic_SMI", "postpartum_psychosis", "PTSD", "severe_depressive", "booking_detection", "booking_substance", "booking_support", "booking_alcohol", "booking_complex_social_factors", "booking_early", "less_14_days", "review_8_weeks", "review_12_months", "antenatal_MH_detection", "postnatal_MH_detection", "postnatal_emotional_change", "postnatal_MH_support" )extracted_tb <- map2(peri_df[-10], var_names, ~peri_df_extraction(.

x, .

y)) %>% reduce(full_join, by = c("AreaName", "Timeperiod"))com_peri_df <- peri_inc %>% left_join(extracted_tb, by = c("AreaName", "Timeperiod")) %>% mutate(year = Timeperiod %>% as.

numeric) %>% mutate_if(is.

character, as.

factor) %>% dplyr::select(-Timeperiod)skimr::skim(com_peri_df)We need to check completeness of the data:get_frac_missing <- function(df) { df %>% nest() %>% mutate(missing = map(data,~map_dfr(.

,~sum(is.

na(.

))/length(.

)))) %>% dplyr::select(-data) %>% unnest(missing) }## Get the proportion missing per variableby yearperi_miss_per_year <- com_peri_df %>% group_by(year) %>% get_frac_missing %>% mutate_all(~round(.

, 2)) %>% arrange(year)## Drop full missing variablesperi_partial_miss_year <- peri_miss_per_year %>% select_if(~!sum(.

) == length(.

))## Full missing variablescom_miss_vars <- setdiff(names(peri_miss_per_year), names(peri_partial_miss_year))## Which year has the most complete dataperi_complete_years_all_vars <- com_peri_df %>% group_by(year) %>% nest() %>% mutate(missing = map(data,~mean(colSums(is.

na(.

))/nrow(.

)))) %>% dplyr::select(-data) %>% unnest(missing) %>% mutate(missing = round(missing, 2)) %>% arrange(year)DT::datatable(peri_complete_years_all_vars, rownames = FALSE)Usually I would go with the most recent data available; however, in this case it’s quite sparse so let’s look at the data from 2015.

peri_df_2015 <- janitor::remove_empty(com_peri_df) %>% filter(year == 2015) %>% filter(rec_inc_rate > 7.

39) %>% select(-year)DT::datatable(peri_df_2015, rownames = FALSE)This leaves us with a tidy dataset on 8 maternal mental health indicators for 38 counties from 2015.

Dimension reductionWe are now ready to conduct some clustering analysis on the data.

The first step is to reduce the dimensionality of the data using PCA.

We use the estim_ncp function (which uses a method outlined in this paper from the FactoMineR package to estimate the number of principal components required.

We then perform PCA and plot the variance explained by each component as a check on estim_ncp.

All of the following analysis is done using nested tibbles and so can be easily generalized to higher dimensional use cases.

Let’s calculate the optimal number of principal components is using the estim_ncp function, perform PCA using the prcomp function and plot the variance explained by each component as a check on estim_ncp.

df <- as.

data.

frame(peri_df_2015)FactoMineR::estim_ncp(df[2:9], ncp.

min = 2, ncp.

max = 10) The optimal number of components is 4.

peri_pca <- peri_df_2015 %>% nest() %>% mutate( numeric_data = map(data, ~select_if(.

, is.

numeric) %>% as.

data.

frame()), optimal_pca_no = map(numeric_data, ~estim_ncp(.

, scale = TRUE, ncp.

min = 2, ncp.

max = 6)) %>% map_dbl(~.

$ncp), pca = map(numeric_data, ~prcomp(.

x, center = TRUE, scale = TRUE)), pca_data = map(pca, ~.

$x), pca_aug = map2(pca, data, ~augment(.

x, data = .

y)))## Variance explainedvar_exp <- peri_pca %>% dplyr::select(-optimal_pca_no) %>% unnest(pca_aug) %>% summarize_at(.

vars = vars(contains("PC")), .

funs = funs(var)) %>% gather(key = pc, value = variance) %>% mutate(var_exp = variance/sum(variance) * 100, cum_var_exp = cumsum(var_exp), pc = str_replace(pc, ".

fitted", "") %>% str_replace("PC", ""))## Plot variance explainedvar_exp %>% rename( `Variance Explained` = var_exp, `Cumulative Variance Explained` = cum_var_exp ) %>% gather(key = key, value = value, `Variance Explained`, `Cumulative Variance Explained`) %>% mutate(key = key %>% factor(levels = c("Variance Explained", "Cumulative Variance Explained"))) %>% mutate(value = value / 100) %>% mutate(pc = factor(pc, levels = as.

character(1:max(var_exp$pc %>% as.

numeric)))) %>% ggplot(aes(pc, value, group = key)) + geom_point(size = 2, alpha = 0.

8) + geom_line(size = 1.

1, alpha = 0.

6) + facet_wrap(~key, scales = "free_y") + theme_bw() + labs( title = "Variance Explained by Principal Component", subtitle = paste0("The optimal number of principal components suggested by estim_ncp was ", peri_pca$optimal_pca_no, " which explains ", round(var_exp$cum_var_exp[[2]], 0), "% of the data.

"), x = "Principal Component", y = "Variance Explained (%)", caption = "@MattOldach Source: Public Health England (fingertipsR)" )## Perform pam on pca data 1 to 6 groupsperi_pca_pam <- peri_pca %>% mutate(centers = list(2:10)) %>% unnest(centers, .

preserve = everything()) %>% dplyr::select(-centers, centers = centers1) %>% group_by(centers) %>% mutate( pam = map(pca_data, ~ pam(x = .

x[, 1:optimal_pca_no], k = centers, stand = TRUE)), clusters = map(pam, ~.

$clustering), avg_silhouette_width = map(pam, ~.

$silinfo$avg.

width), data_with_clusters = map2(.

x = data, .

y = clusters, ~mutate(.

x, cluster = factor(.

y, ordered = TRUE))) ) %>% ungroup## Get max silhouette widthmax_silhouette_width <- peri_pca_pam %>% dplyr::select(centers, avg_silhouette_width) %>% unnest(avg_silhouette_width) %>% arrange(desc(avg_silhouette_width)) %>% slice(1) ## Plot average silhouette widthperi_pca_pam %>% dplyr::select(centers, avg_silhouette_width) %>% unnest(avg_silhouette_width) %>% ggplot(aes(x = centers, y = avg_silhouette_width)) + geom_line(size = 2, alpha = 0.

4) + geom_point(size = 3, alpha = 0.

8) + theme_bw() + scale_x_continuous(breaks = seq(1, 10, 1), minor_breaks = NULL) + scale_y_continuous(limits = c(NA, NA), breaks = seq(0, 1, 0.

01), minor_breaks = NULL) + labs(title = "Average Silhouette Width by Number of PAM Clusters", subtitle = paste0("The optimal number of clusters identifed by avg.

silhouette width was ", max_silhouette_width$centers, " with an avg.

silhouette width of ", round(max_silhouette_width$avg_silhouette_width, 2) ), x = "Clusters", y = "Avg.

Silhouette Width", caption = "@MattOldach Source: Public Health England (fingertipsR)")## Plot clusterspca_plot <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% select(data_with_clusters, pca) %>% mutate(pca_graph = map2(.

x = pca, .

y = data_with_clusters, ~ autoplot(.

x, x = 1, y = 2, loadings = TRUE, loadings.

label = TRUE, loadings.

label.

repel = TRUE, loadings.

label.

size = 2, loadings.

alpha = 0.

8, loadings.

label.

vjust = -1, data = .

y, label = TRUE, label.

label = "AreaName", label.

size = 1.

5, label.

vjust = -1, alpha = 0.

3, frame = TRUE, frame.

type = 'convex', frame.

alpha= 0.

05, colour = "cluster", size = "rec_inc_rate") + theme_bw() + labs(x = paste0("Principal Component 1 (Variance Explained: ", round(var_exp$var_exp[[1]], 1), "%)"), y = paste0("Principal Component 2 (Variance Explained: ", round(var_exp$var_exp[[2]], 1), "%)")) + guides(colour=guide_legend(title = "Cluster", ncol = 2), fill=guide_legend(title= "Cluster", ncol = 2), size = guide_legend(title = "Depression: % recorded prevalence", ncol = 2)) + scale_color_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + theme(legend.

position = "bottom", legend.

box = "horizontal") + labs( title = "Maternal Mental Health in England", subtitle = "The arrows are variable loadings and points are counties coloured by cluster membership", caption = "@MattOldach Source: Public Health England (fingertipsR)" ) )) %>% pull(pca_graph) %>% firstpca_plotsum_peri_df <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% pull(data_with_clusters) %>% map(~ gather(.

, key = "Variable", value = "value", -AreaName, -cluster)) %>% first %>% rename(Cluster = cluster)plot_cluster_diff <- sum_peri_df %>% ggplot(aes(x = Variable, y = value, col = Cluster, fill = Cluster)) + geom_violin(draw_quantiles = c(0.

025, 0.

5, 0.

975), alpha = 0.

2, scale = "width") + geom_jitter(position = position_jitterdodge(), alpha = 0.

3) + coord_flip() + theme_minimal() + scale_y_continuous(breaks = seq(0, 100, 6), minor_breaks = NULL) + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + scale_color_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + theme(legend.

position = "bottom") + labs( title = "Maternal Mental Health in England; Summarised by Cluster", subtitle = "Violin plots are scaled by width, with the 2.

5%, 50% and 97.

5% quantiles shown.

", x = "Variable", y = "Recorded % depression prevalance (aged 18+) for rec_int_rate, otherwise proportion (0-100%)", caption = "@MattOldach Source: Public Health England (fingertipsR)")plot_cluster_diffFrom these plots we see that the thrid cluster showing a higher recent increase rate has a highger proportion of postpartum psychosis, mild depression and distress.

We can plot the cluster membership for each county on a map with the outlines of the counties in England from the UK data service (download the ShapeFiles here and here).

peri_cluster_map <- function(peri_pca_pam) {gpclibPermit()england_counties <- rgdal::readOGR("england_ct_2011.

shp") %>% fortify(region = "code") %>% as_tibbleengland_urban_areas <- rgdal::readOGR("england_urb_2001.

shp") %>% fortify(region = "name") %>% as_tibble %>% filter(id %in% c("Sheffield Urban Area", "Plymouth", "Skelton (Redcar and Cleveland)", "Brighton/Worthing/Littlehampton", "Leicester Urban Area" ))## Make custom positions for urban area labelsurban_area_labels <- england_urban_areas %>% group_by(id) %>% slice(100) %>% ungroup() %>% mutate(long = long – 200000, lat = lat + 20000) peri_cluster_results <- peri_pca_pam %>% filter(centers == max_silhouette_width$centers) %>% pull(data_with_clusters) %>% firstperi_cluster_results <- peri_df[[14]] %>% dplyr::select(AreaName, AreaCode, AreaType) %>% filter(AreaType %in% "County & UA") %>% unique %>% left_join(peri_cluster_results, by = "AreaName") %>% left_join(england_counties, by = c("AreaCode" = "id"))peri_cluster_results %>% rename(Cluster = cluster) %>% drop_na(Cluster) %>% dplyr::select(long, lat, Cluster, group) %>% ggplot( aes(x = long, y = lat, fill = Cluster)) + geom_polygon(data = england_urban_areas, aes(group = group, fill = NULL), alpha = 0.

4) + geom_path(data = peri_cluster_results, aes(group = group, fill = NULL), alpha = 0.

4) + geom_polygon(data = peri_cluster_results, aes(group = group, fill = NULL), alpha = 0.

1) + geom_polygon(aes(group = group), alpha = 0.

6) + geom_line(data = urban_area_labels %>% bind_rows(urban_area_labels %>% mutate(long = long + 200000, lat = lat – 20000)), aes(fill = NA, group = id), alpha = 0.

8) + geom_label(data = urban_area_labels, aes(label = id), fill = "grey") + scale_fill_manual(name = "Cluster", values = c("#6b5b95", "#feb236", "#d64161", "#ff7b25", "#87bdd8")) + coord_equal() + theme_map() + theme(legend.

position = "bottom") + labs(title = "Maternal Mental Health Indicators; Map of County Level Clusters in England", subtitle = "Using data from 2015", caption = "Selected urban areas are shown (dark grey) and labelled.

@MattOldach Source: Public Health England (fingertipsR)Contains National Statistics data © Crown copyright and database right 2019.

Contains OS data © Crown copyright and database right 2019")}plot_peri_cluster_map <- peri_cluster_map(peri_pca_pam)plot_peri_cluster_map.. More details

Leave a Reply