Utilizing UN Session Data to Better Understand Global TrendsSanjana RaoBlockedUnblockFollowFollowingMay 11IntroductionThe United Nations, formed in the aftermath of World War II, is one of the most significant steps taken towards creating and maintaining a global community.
Since it was founded in 1945, the annual UN sessions have served as a platform for pressing global issues to be addressed and as a result has documented major issues and trends over time.
Through this project, I wanted to explore these global trends over time using a Kaggle data set called “UN General Debates” .
Understanding the Data SetThe data set had a few but meaningful features.
Each instance contained the year, the session number, the country speaking, and a transcript of the session.
The data set contained session data from 1970 to 2015, a range of 46 years.
My first step was to aggregate the sessions per year and generate a text file for each year.
I then used Sklearn’s CountVectorizer class to perform Bag of Words and output the top 10 words per year.
With this information given, I used Illustrator to generate a visual depictions of the more common words per year.
Thought Process for Generating Data VisualizationI had a few main points in mind for the visualReadable Timeline — Viewer should be able to intuitively understand time chronologyFocus on Trends — The main take away from the the data visualization is to be quickly and fluently understand major trends and anomalies present in the data.
Access to Data — While I wanted to focus on trends, I still wanted the viewer to be able to see the data (the most frequent words for each year) and potentially draw their own conclusions based on prior knowledge of world history or via research sparked by interesting instances present in the data.
I came up with two models:One that sections off the data by decade — Easier to digest and allows audience to pick and choose the details they want to focus onOne that contained every year between 1970 and 2015 — A bit more to take in but does not visually section trends into arbitrary categories when in reality, year to year data trends are continuous in nature.
Below is the second data visualization, which documents every year instance between 1970 and 2015.
Design and Data RepresentationUsing Color to Convey TrendsSome terms appeared consistently throughout the years whereas some appeared only once or twice.
Certain terms appeared in the middle of the data set and continued to be present until the very end.
In order to highlight these major trends and outliers, I assigned colors to each of these terms to tag them.
This way a viewer can visually follow the presence and the shifting popularity of a term over time.
Terms that were consistent throughout the data set, were assigned darker colors which offered enough but little contrast against the background (community, rights, development).
Terms deemed noteworthy but not too significant were also assigned darker colors (reform, 2015).
Terms that emerged in the middle of the data sets timeline and then remained a buzzword, were given slightly brighter color values (global, climate).
In addition, terms related to world crises and causes of human suffering were given bright and jarring colors (terrorism, nuclear).
Lastly, powerful words that were mentioned only a few times were given bright and pastel colors (millennium, hope).
The last two categories were assigned the aforementioned color types so that they stood out the most to a scanning viewer.
Notes on Subjective Data AnalysisI would like to note some brief points on the subjective choices I made in this data representation.
Cleaning the Transcripts: While cleaning the transcripts I first removed general English stop words.
However upon calculating the top ten words per year, I found that most of the remaining terms in the top ten remained consistent over all 46 years.
These terms were all domain specific terms.
In order to be able to dig a little deeper into the content of the sessions, I removed many domain specific terms.
I would like to note that I did this based on what terms appeared very often but set an arbitrary threshold for which terms were frequent enough to be removed and which were to be kept.
Changing the threshold would undoubtedly change the results from Bag of Words.
Looking back on the finished work, perhaps a lower threshold to remove words would give a more specific and unique year to year data set.
A lower threshold would eliminate terms such as ‘community’, ‘rights’, and ‘development’.
Some words that were removed: international, nations, united, we, world, peace, countries, people, organization, session, security, economic, governmentCategory Assignment: My process for assigning terms categories was very qualitative in nature.
Choices like deciding that “cooperation” is relevant and meaningful and “support” were thought out yet still subjective decisions to make.
Roots and Parts of Speech: In my current implementation of this project terms like “developing” and “development” are treated as the same term for simplicity and directness.
However, it is important to note that there is meaningful information encrypted in the difference between the two.
Equipped with a background in linguistics, one could potentially draw conclusions about the time period based on the sentence structures a noun or verb form of the words suggest (i.
perhaps frequency of the verb form suggests a likelier chance of taking action).
Interesting Data Findings1989 — environment: First mention of the environment in the top 15.
1990–1991 — hope: While we saw ‘hope’ in 1988, there was a reoccurrence of the term within the top 15 in 1990 and 1991.
1991 — democratic: ‘democratic’ appeared in the top 15.
1996 — weapons: ‘weapons’ appeared in the top 15.
2001 — September: This term appeared in the top 15 of 2001, suggesting that 9/11 and terrorism was one of the major topics2004 — today: The term ‘today’ made it into the top 10 for 2004.
After which, the term ‘today’ began to frequently appear in the top 15, possibly suggesting a shift in perspectiveClimate and Sustainability — These terms began appearing frequently in the 2000s2015 — ‘2015’ appeared more times in 2013 and 2014 than it did in 2015.
This may suggest a policy taking place in 2015 or a goal being reached in 2015.
2015 likely discussed the effects or findings of the 2015 agenda.
Community — Most frequently used word over the years, followed by ‘rights’While I included this section as interesting data findings, seeing that many of the interesting take aways were in the top 15 shows that perhaps removing some more buzzwords would show a more unique story per year.
At the same time, it was interesting to see the evergreen discussions such as human rights and community remain consistently important yet fluctuate in priority on the scale.
Fun ExtensionI ran a topic generation algorithm (Latent Dirichlet Allocation) on the year to year sessions data just to see what the algorithm would output as the number one topic per year.
Here are some of the highlights:1977: new rights africa south nuclear relations time situation independence east struggle that disarmament great israel — — — — — — — — — 1990: new south africa rights region europe war time hope state end crisis relations kuwait east — — — — — — — — — 1995: new rights nuclear social cooperation war time work process many important member region year weapons — — — — — — — — — 2000: new millennium rights summit cooperation process global state globalization time work first role poverty social — — — — — — — — — 2001: terrorism new global cooperation year poverty terrorist secretary september many process work africa state rights— — — — — — — — — 2009: global change crisis climate new financial that president challenges cooperation nuclear 09 time year rights — — — — — — — — — 2014: global new that climate sustainable agenda 2015 rights change state year today goals post many.