Can We Use Social Media to Locate Legitimate Power Outages?An exploration into using Natural Language Processing to classify tweets based on cosine similarity.
Jen HillBlockedUnblockFollowFollowingApr 29We all know how disruptive power outages can be to our lives.
Not only do the outages effect millions of people in our country every year, according to Blackout Tracker’s annual report, billions are being lost every year in revenue from them as well.
And because of that, we need better mechanisms in place for tracking them when they happen so they can be dealt with quickly.
Utilities’ Outage Management Systems has been rolling out new smart grid technologies to supplement traditional methods for detecting and reporting on power outages.
However, the new technologies won’t be completely rolled out until 2030.
While this new technology is rolled out, supplemental efforts will be needed to help fill in identifying and reporting on power outages.
A team of us decided to see if we could build a tool to help.
We wanted to see if we could use social media to correctly identify power outages.
DATA COLLECTIONOur approach was to first decide what social media platforms to pull from.
We found Twitter to be the most viable platform in terms of data we would have access to, and with how quickly news gets shared on the platform.
We used TwitterScraper for targeted keywords based on location to scrape 5 years of data across 12 cities.
While we would have preferred to use Twitter’s API here because it includes geolocations, it only provides one month of historical data.
To select our keywords, we spent time really diving into how people talk about power online.
We explored the variations in vocabulary between how energy companies and consumers post online about outages, making sure to watch for potential misclassifications (ex: Blackout is a video game and Power is a TV show).
For narrowing down region, we chose the top ten most populated cities in the US, but had to adjust which ones we scraped slightly.
New York City pulled in quite a lot of noise and we didn’t want to train our model on that.
So we removed that city from our list, and added the eleventh highest populated city, Austin.
We also decided to include Detroit and Columbus because Michigan and Ohio rank high for yearly power outage rates via the Blackout Tracker reports.
From reviewing the tweets we pulled, we hypothesized that weather was the biggest influencer of a power outage.
We used the NOAA API wrapper to pull historical weather data by date for our select cities.
Not all weather stations pull in the same data, so we focused on what was consistently collected across all states, high temp, low temp, and precipitation.
We converted these numbers into words based on ranges of temperature and precipitation in order to be able to append the words to our tweets.
We knew we wanted to use Natural Language Processing and needed words instead of numbers to do that.
EXPLORATORY DATA ANALYSISOnce we had a complete dataset, we wanted to examine what kind of story the data we collected tells us.
To do this, we looked at frequency of words.
We removed common use stopwords, including mention of Twitter, Instagram, and anything url related because those all got in the way of seeing the true picture of what the data is telling us.
In the chart below, you’ll see the top 10 single-word terms on the left, while on the right you’ll see multi-word terms.
We took a look at both because it tells a more complete story this way.
We were not surprised to see “power,” “outage,” and “without” showing up a lot since those align with our targeted webscraping keywords.
But digging deeper we can also see that we’ve pulled in a lot of what we are thinking of as two kinds of messaging, customer complaints and energy companies reporting on customer outages.
Frequency of Words in our Tweet DatasetWe also found it interesting that Detroit popped out up there because Michigan is one of the top states for number of power outages they experience each year.
Also, note that “internet outage” shows up high here and that’s not a power outage.
It’s appearance helps us with identifying a misclassification type that we kept in mind when moving on with our model.
MODELINGBased on what we learned when reviewing our data, we set out to prepare the text for modeling by lower casing all words and removing punctuation and stopwords.
We also removed complete url strings before tokenizing to make sure we captured all the url pieces, especially link shorteners.
For modeling, we chose to use Word2Vec because of the way it focuses on the relationship of words and gives weight to that value.
It brings context of word choices into play, which will give us a better understanding of the group of words used in a tweet to talk about a power outage.
While machines have no problem understanding high dimensional space, after using Word2Vec, we needed to convert it back to two-dimensional space using a t-SNE model in order to examine and understand it.
Then we used cosine similarity to compare our words to targeted keyword lists to help determine if the word was related to a legitimate power outage or not.
Cosine similarity score results were manually reviewed to find misclassifications and tune our model.
The chart pictured below shows the t-SNE plot, where we’ve selected the word “tornadoes”.
The dots in blue represent words related to the power being out, while gray dots are associated with it not being a legitimate power outage.
The size of the dot also shows us how high it’s cosine similarity score is.
Larger dots have larger scores, meaning a stronger relationship to its classification.
The interactive html file of this plot can be downloaded on Github.
EVALUATIONTo evaluate our model’s performance, we had to manually confirm whether or not it was classified correctly.
Ideally, we’d be able to go through all the tweets to confirm this but we ran out of time.
We reviewed one thousand tweets and found six misclassifications.
We’d need to review more to confirm whether this is what to expect in terms of number of misclassifications or an anomaly.
From there we’d need to fine tune the model to lessen misclassifications.
When we started this process, we believed that weather played a large role in power outages, and it does.
However, we also discovered that the word “snakes” related high for flagging legitimate power outages.
And that’s how we learned that snakes can climb into transformers and cause power outages.
If you are interested in other oddities that cause power outages, you should check out Blackout Tracker’s annual reports.
It’s a fun rabbit hole to head down.
MAPPINGSince we had to use a scraper for collecting Twitter data, we were unable to pull locations that were more detailed than a city.
Plotting that on a map means we’d have a lot of dots on top of each other, which is not ideal.
However, we thought it would be interesting to plot outages by time for our selected cities.
The image below shows an example of one city being highlighted, Detroit.
You can see that in January 2014, there were 24 tweets related to power outages, based on the data we scraped.
The interactive html file of this map can be downloaded on Github.
CONCLUSION AND NEXT STEPSIn conclusion, we were able to successfully create a prototype that can classify a tweet as being a legitimate power outage.
However, there are some limitations to it that would need to be addressed before rolling it out to classify more data.
Because we had to use a Twitter scraper instead of Twitter’s API, our location data is not as exact as it could be.
And even if we could use the API, not all users list location data on their Twitter profile out of privacy concerns.
Plus, the evaluation process for accuracy is a manual one that rolling out countrywide would demand resources and time for review.
However, we do see great possibilities in a more widespread rollout, based on live data.
Had we more time, the next steps we would have taken and recommend considering are the following:Test the model on live data: tweets from Twitter’s API and weather data from Dark Sky’s API.
Use K-means clustering to group power outage findings to better be able to confirm an entire area is without power.
Sectioning out clusters by region and by weather would be interesting to look at as well.
We only used the weather options that were consistently available for all our select cities, but looking into more detail at other options (ex: wind speed) would be beneficial here.
Explore other dimensionality/data reduction methods besides t-SNE, such as using principal component analysis before data preprocessing.
Full code for this project is available on GitHub.
Cheers!.. More details