A Basic Python Tweet ClassSimple strategies for processing tweet dataMatthew CliffordBlockedUnblockFollowFollowingMay 27Photo by Ray Hennessy on UnsplashMotivationsTwitter is a amazing source of data with all kinds of opportunities for analysis.
NLTK, spaCy, and other Python NLP tools have many powerful, applicable features, and pandas makes it easy to wrangle tabular data.
Still, there are some challenges.
Tweets, while short, often contain special elements common to social media like handles, hashtags, links and emojis.
Depending on the data idiosyncrasies (e.
JSON from a Twitter API vs a CSV from an intermediary source), these elements may be untagged and also put the text field far over the ostensible 280 character limit.
Available tools may not have Python implementations and it may not be easy to assess how much time might be required to fit them into a Python workflow.
Depending on our objectives, special elements can sometimes just be set aside, leaving the ‘static’ text ready for a straightforward NLP pipeline.
Often, however, these elements are grammatically active or integrated in creative ways which we might want to preserve.
Perhaps they contain relevant signal, perhaps not — we might want keep options open.
I was surprised to find that when starting to analyze tweet text data, there seemed to be no single tool that accomplished all of the seemingly simple tasks I planned.
Ultimately I decided that in order to combine the functionality I wanted in an organized, reusable way, writing a class was a reasonable step.
Without a doubt there are more sophisticated approaches out there, but when learning there is often value is doing things from scratch.
Below, we will walk through this process and discuss some of the design considerations that arise.
Writing a class allows us to to work in pure Python; to carefully consider what functionality we care about; and to implement cheap, simple versions of methods which can be extended later.
While Jupyter notebooks have many fantastic features, they tend to encourage a linear, one-off style of coding.
For example, it is common to create cells that produce errors or unwanted effects if run more than once, or to destructively edit or re-assign objects to conserve memory or avoid creating many versions.
Working with classes brings out the more object-oriented side of Python and lends itself to more persistent, explicit ways of interacting with data.
It’s not necessarily better, but it’s worth exploring.
Getting StartedFirst, let’s define a basic Python class.
We may want create an instance without necessarily incurring the cost of invoking all the methods, so let’s provide a ‘lazy’ initialization with a separate fit() method to spin up richer features.
Some of our design decisions may depend on how much we are going to rely on pandas and other tools.
For example, when using the pandas .
apply() method, it is often easiest (though not essential) to have no required parameters besides the the content of the field in question, so we may later decide we do want to fit by default.
Since we are writing our own class, we can make this change easily.
For now, we can just make fitting on initialization optional.
We will also make the tweet_id optional, and let an instance report whether it has been fit, mirroring the convention used in Sci-Kit Learn and elsewhere of marking post-fit attributes with a terminal underscore.
is_fit_ is a bit of an exception as it’s available (and = False) pre-fit, but we’ll let it stand for now).
Filling out methodsNow that we have our basic functionality sketched out, let’s implement our methods.
These functions will use the re and emoji libraries imported above.
With the exception of the url regex with source link, these are crude methods depending entirely on the presence of initial ‘#’ and ‘@’ characters.
This observation may seem painfully obvious (after all, that’s just how hashtags and handles are) but we may encounter additional, unmarked occurences of these elements (e.
“RT @somehandle: somehandle claims…”).
These marked/unmarked occurrences will usually but not necessarily occur within the same tweet.
Why does this matter?.Broadly, it matters because tweets are short and we want to handle their content with care.
More specifically, for example, if we train a model in spaCy, it matters for how we tag these terms (e.
Out of Vocab vs.
a specific Part of Speech).
As another example, if we pursue signal within some elements, for example using a statistical method to tokenize multi-word hashtags (which could sometimes be complete sentences and usually lack a delimiter), we will want to to have the same facility with unmarked tags.
While avoiding try/except wrappers may be desirable in general, we’ll use them here to avoid failures when processing large lists of tweets.
Finally, we’ll include a simple status report method to show the values for each attribute as available.
Some additional design considerations are:These four methods all have the same form (generate a sublist from pattern), and so could be combined into a more general form.
Here we might consider whether we want to be able to call them separately, bypassing our one-stop fit() method, or modify them separately as needs arise.
Do we want our class instances to store the results of these methods at all, or do we want to call them on-demand and perhaps just expose them as attributes, e.
hashtags_ = self.
find_hastags()?.If we are going to store these outputs in dataframes, we probably want to avoid storing them elsewhere.
Do we want to initialize all attributes explicitly in our __init__() or fit() methods , or just let them emerge from the functions that populate them?.This is more of a readability/consistency consideration.
Each of these considerations may be minor taken separately, but these kinds of tradeoffs impact overall usability.
In most cases the local optimum probably matters less than stylistic consistency.
We are not going to attempt PEP8 compliance here, but when writing code for reuse, it is definitely worth keeping in mind.
Cleaning up textNow that we can identify special elements (have we overlooked any possibilities?), let’s generate a ‘clean’ version of the text.
It is easy enough to remove elements, but as noted above, we may want to honor the way these items are employed in the text in some cases.
There is probably a better strategy but for now, leaning on the word order tendencies of English, we’ll simply enumerate the tokens in self.
split()and see where elements occur.
It is not ideal that the clean() method iterates through split_text.
We could find() / rfind() the first/last ‘clean’ text tokens to cut down on iteration.
We could also rewrite our find functions to use enumeration, or even store the original self.
text_ attribute as an enumerated dictionary.
We’ll set these aside for now and see what other demands arise later.
Wrapping UpThere are a number of tests we should run our class against, but here let’s confirm just the basic functionality on one example.
It appears that all attributes are displayed; hashtags, handles and urls were identified, andis_complex_ has the expected value.
Our class seems to be a reasonable first draft for a base class on which to build more complex functionality.
Additional Resources:Twitter librariesEdit descriptiondeveloper.
comtwitterAs mentioned above, in order to collect data from Twitter, you first need to register a new application – this is…www.
orgRule-based matching · spaCy Usage DocumentationFind phrases and tokens, and match entitiesspacy.
io.. More details