Well the CH postfix is the Alpha-2 code of the international ISO-Country Code standard for Switzerland.
Code values (or reference values) are the ingredient of any program to encode information.
Best practice is to use standardized one if possible (don’t reinvent the wheel here).
, for countries we decided to use Alpha-2 code of the standard ISO 3166–1.
As you will see later for encoding languages, we take a similar approach.
ISO, by the way, means “ International Organization for Standardization.
”Refactoring and EnhancingThe GovernmentSocialMediaAnalyzer ClassFor our generalized program, we do a first refactoring step.
Code refactoring is the process of restructuring existing computer code without changing its external behavior.
So we rename our class in sample1.
py to GovernmentSocialMediaAnalyzer and enhance its class constructor __init__ method by a parameter country_code.
We took a first design decision:Design Decision 1: An instance of our class GovernmentSocialMediaAnalyzer will encapsulate the data and behavior of a dedicated country.
The code enhancements are shown below:The passed in country_code parameter (e.
, CH) during class creationwill be stored as a private instance variable __country_code and usedto create the yaml configuration file_name,from where we will load the configuration data and store the data in the private variable __cfgSo now we are ready to generalize our get_government_members method by reading the twitter account- and list-name out of the configuration data, which is stored in self.
__cfg instance variable.
But le’ts finalize first the refactoring and enhancement of our __init__ class.
We take another design decisionDesign Decision 2: The init method should encapsulate the loading of all Twitter accounts of the list from Twitter, as well as the conversion to the relevant attributes in attributes column (= array of Strings).
The column should be made available as private class instance variablesThat means when we create a GovernmentAnalyzer instance for a dedicated country, the initialization phase of the case will include the all the necessary steps to get the data from Twitter into our intern data structures (as represented as columns)We will do this step in a dedicate helper method, which will be called __extract_columns.
We defined it as a private method ( __ prefix) because it shouldn’t be used by anybody outside of this class.
The refactored class out of lesson one now looks like this.
We made the column attributes more descriptive and defined them as class instance variables so that the columns can be used by any method within our class.
So we have finalized and refactored the class instance creation class5–12: code block to load the country-specific configuration file16–21: code block to read the twitter security token and keys from the secret configuration file and then connect to the Twitter API39: call to the _extract_columns method to retrieve the data and convert it into columns.
Our check_for_party algorithm of tutorial one was hard coding the party abbreviations in the code itself.
Well, let’s refactor the code and move the party information to our configuration file.
Thanks to the flexibility of the YAML file, this can be done quite easily.
Design Decision 3: We want to use several party abbreviations (potentially in multiple languages) and keywords (e.
the parties Twitter screen-name) per party to try to identify the party ownership of a politician.
So our configuration config-CH.
yaml will require configuration information per party.
A list of parties and their abbreviations can be found on parlament.
ch in four languages for Switzerland.
In YAML you can quickly build up a list of configuration items (e.
a party configuration item).
List members are denoted by a leading hyphen (-) with one member per one to multiple lines, or enclosed in square brackets ([ ]) and separated by comma space(,).
a party list member is denoted by the hyphen notation.
A party-list member has a twitter and abbrs attribute.
The abbrs (abbreviations) attribute itself is a list of string indicated by the square bracket notation.
yaml (with parties list)If we check the loaded configuration file (stored in the self.
_cfg variable) in the Python debugger, it should be clear how the data structure is looking like using Python lists and dictionaries.
A side-remark to the abbrs attribute, we introduced their a list of party abbreviations, i.
, having multiple national languages also means that a party has various abbreviations (e.
, for German and French).
In our case above “FDP” and “PLR.
” And we want to check for all of them.
In other countries, there may be potentially just one abbreviation, but with the decision, we are future proof.
Our improved check_for_party method nows look like follows.
It will iterate over all twitter accounts and each parties configuration record and checks if the twitter account has a match in its description or screenname to a party.
On line 6,10 and 19 we are getting the data out of our configuration structure.
Dependent on the attribute type, we have to iterate over the list of values (6, 10) or fetch the data directly (19)If we have a match, the first abbreviation will be returned, as our code value to identify parties: res = party[‘abbrs’]Fine Tuning the AlgorithmIntroducing a Second Plotly Table: Grouping Accounts by PartyTo fine-tune our algorithm, we have to check its effectiveness on finding a party to the twitter account.
For that, we have to introduce a second table, which will group our twitter accounts according to their party allocation.
The powerful panda package will provide us with the necessary tooling.
You can refer to the following panda API description with all the details how to group data.
Some comments to the code fragment:4–8: We are creating here a panda_data record consisting of 4 columns.
The __col_party, __col_followers_count, __col_friends_count.
The __col_party is used twice, the first column is used for labeling each row (as you see on line 11 we do the grouping by party) and in the second column we do sum up of the rows which have the same party,9: We create a first panda data frame of this table with the four columns11: Here we transform the created data frame by using the groupby function.
We also define the aggregation agg operations for the 2nd, 3rd and 4th row.
15–19: The basic stuff to create a nice plotly table.
Let’s run the program and check the accuracy of our party assignment allocation algorithm.
As a result of the program execution you will now have to tables in your plotly account (also some grid tables will be created, which are not relevant at the moment).
Enhance our Config File with Keyword AttributeOur first run shows that the majority of the politician (65) don’t mention their party abbreviations or party twitter screen_name in their account description/screen_name.
So let’s try to fine-tune our algorithm.
Let’s go through the list once again and check for other keywords which could help us to identify their party relationship.
We found the following key-words:socialist (SP)glp (GLP),Grüne (GLP)LegaSo let’s add this one to our configuration file with a new attribute: keywords.
That’s the beauty of YAML you easily can extend with additional attributes.
In our case another list attribute.
And we add the additional check in our check_for_party method (23–28)Et voila, we could identify 13 other twitter accounts with over 20’000 followers.
Still, 52 accounts can’t be mapped to a party, but for that, we have to connect another data source, which will be done in a later tutorial.
As a final step for today, we refactor the create_politican_table method.
Mainly we standardize the file name used in plotly by using the country code in the file name.
That allows us to generate tables for different countries and ensuring that they are not overwriting each other in our plotly account (20).
There we are, we have now generalized and refactored our overall application and have good foundation backed by a configuration file for a further build out.
We can now instantiate a GovernmnetSocialMediaAnalyzer for a dedicated country (supposing that we have provided the necessary configuration file) and extract twitter relevant data into a plotly table for further processing.
As a UML Sequence Diagram visualized, the interaction flow of our class can be represented as follows:If you want to understand more details about UML sequence diagrams, refer to the following tutorial.
It’s an excellent technique to visualize various aspects of a program.
In the above diagram, the return calls of a message are depicted in blue.
As an example: The message line to the panda package to create a data frame is depicted in red ( createDataFrame), its return message line of a dataFrame object in blue.
ExerciseUse one of the lists offered via the Government Twitter account (https://twitter.
com/TwitterGov), for example, the list of US Cabinet members (https://twitter.
Work out the corresponding yaml configuration fileCheck out what kind of information could be used to identify the party of the politician.
Enhance the keywords with your findingsEnhance the main program with a user input question, something like “Which government do you want to analyze”.
Provide a list of available configurations and then run the program with the user selection.
Think about the changes necessary to analyze multiple politicians lists per country.
, we want to differentiate various government bodies per country and have that generalized in the configuration files.
The source code can be found here: https://github.
com/talfco/clb-sentimentOriginally published at dev.
net on February 3, 2019.