Instagram Data Analysis Using Panoply and ModeKa Hou SioBlockedUnblockFollowFollowingMay 27picture credits to panoply.
ioBackgroundThis project is built on top of the data challenge that Panoply has released in Apr 2019.
Panoply is a cloud data warehouse that you could gather data from different data sources (i.
AWS S3, Google analytics and etc.
) easily into one place and then connect to different Business Intelligence tools (i.
Chartio, Mode and etc.
) for analytics and insights.
Panoply has recently integrated their data warehouse with Instagram API to collect data.
This challenge is about using Panoply as an ETL tool to explore Instagram data for marketing use (i.
promotion, segmentation and etc.
In this challenge, challengers are asked to set up aPanoply account, and connect to self Instagram or provided instagram data to perform analyses, draw insights and build visualization for storytelling.
If you have an Instagram account, you can use the data from your own instagram account.
Or if you prefer to use the data provided by Panoply, you can choose from 2 accounts:Kirsten Alana: A top Instagram Influencer.
Shinesty: An innovative and edgy apparel brand.
Then you can use any of your choice of BI tools for data visualization.
The final deliverable of this challenge will be in English, data visualizations for communicating your findings and the SQL queries that you have used.
Project DesignSteps of my project would be:Explore Panoply, using resources it provided for the challenge and documentations on its own website.
Create Panoply free trial account, connect to provided data source, connect to BI tool.
Explore the datasets, understand the schemas and relationships.
Research and come up with questions/KPI that an Instagram owner would like to track and have them answered.
So they could get insights to grow their brand.
Use Mode to build visualizations and submit to the challenge.
ToolsThe tools and technologies that I am going to use in this project would be Panoply, Instagram data, SQL and Mode.
ProcessI started by reading the resources that Panoply provided and explore their website to better understand what Panoply is and how it plays a role in data analytics process.
Then I created the free trial of Panoply account and follow their documentations to connect to Amazon S3 to collect the provided instagram data.
Create a Panoply accountBelow is the flow for singing up an Panoply account:go to https://panoply.
io landing page screenshot2.
sign up a free 21-day trail account with an email account.
sign-up page screenshotConnect to data sources and collect dataBelow is the flow for connecting to Amazon S3 and collect the Instagram data using APIs.
In this challenge, I used the data from Shinesty, an innovative and edgy apparel brand as my analytic target.
Go to Data Sources tab.
Data Sources tab screenshot2.
Select the data source you want to connect to, for this project we used Amazon S3.
Input your credentials and start to collect data from your source to Panoply data warehouse.
Amazon S3 api info page screenshot3.
Your tables are ready to go.
Tables tab screenshotTables for analysisAfter collection of data, I found out that there are 31 tables in total.
After exploration and examination, I identified the 5 most useful tables and I used them to draw meaningful insights.
Below are the tables and the columns that I have used.
shinestyinstagram_instagram_user table screenshotshinestyinstagram_instagram_media table screenshotshinestyinstagram_instagram_comment table screenshotshinestyinstagram_instagram_media_tags table screenshotshinestyinstagram_instagram_media_location table screenshotMetricsAfter my research and I identified some metrics and questions that an Instagram account owner would like to know in order to improve their account awareness given the data the API provided.
Engagement per each hashtag.
Performance over time.
Performance by day of week using when a user comment.
Performance by hour of day using when a user made a comment.
Performance by hour of day using when the post is created.
Performance by types of media.
Performance for different media by filter.
Performance by location tag used.
Performance difference for posts that has location tag versus there is no tag.
Top active commenters.
Connect Mode for visualizationGo to https://mode.
com to create a free non-business account.
Mode home page screenshot2.
After you are in your account dashboard, hit the down-arrow on the left that is next to your name.
home dashboard screenshot3.
Then a new drop down menu will come up and select Connect a Database tab.
drop down menu for connecting a database screenshot4.
Then you can connect to different database using your credentials.
connect to database screenshotFor this project, we used Amazon Redshift because that is the database that Panoply used to store our tables.
After you entered and collected to your Panoply database, Mode will start to collect all the tables to your account database in Mode.
Once it is done, you can start using SQL to analyze your data in Mode.
using SQL to analyze the data in Mode screenshotAnalysisBelow are the queries that I have used for answering the questions in the Metric section.
Finding engagement for each hashtag used:with t AS(SELECTvalue as hashtag,likes_count as likes,comments_count as commentsFROM public.
shinestyinstagram_instagram_media mleft JOIN public.
shinestyinstagram_instagram_media_tags mtON m.
id = mt.
instagram_media_id) select hashtag, AVG(likes) as avg_likes, AVG(comments) as avg_commentsfrom twhere hashtag is not nullgroup by 1Below visual tells you the average likes for each hashtag, #housetonstrong and #theperfectcrime have the most average likes performance.
Average Likes vs.
HashtagBelow visual tells you the average comments with respects to each hashtag used, #buttstuff and #macrobrews have the best performance.
Average Comments vs.
HashtagBelow visual combine the above twos.
Performance By HashtagFind out performance over time:SELECTDATE_TRUNC(‘week’, created_time)::DATE as week,SUM(comments_count) AS total_comments,AVG(comments_count) AS avg_comments,SUM(likes_count) AS total_likes,AVG(likes_count) AS avg_likes,count(distinct id) as nums_of_postFROMpublic.
shinestyinstagram_instagram_mediaGROUP BY1ORDER BY 1Performance Over TimeFinding performance by day of week using when a user comment:select TO_CHAR(created_time, ‘DY’) as day,COUNT(distinct media_id) AS nums_of_post_got_commented,COUNT(distinct from_username) AS nums_of_commenter,ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)), 0) as average_commenter_per_postfrom public.
shinestyinstagram_instagram_commentsgroup by 1order by 1Below visual shows that Thursday and Friday are the time when the users like to comment the most.
Engagement Performance by Day of WeekFinding performance by hour of day using when a user made a comment:select TO_CHAR(created_time, ‘HH24’) as hour,COUNT(distinct media_id) AS nums_of_post_got_commented,COUNT(distinct from_username) AS nums_of_commenter,ROUND((nums_of_commenter/ cast(nums_of_post_got_commented as FLOAT)), 0) as average_commenter_per_postfrom public.
shinestyinstagram_instagram_commentsgroup by 1order by 1Engagement Performance by HourFinding performance by hour of day using when the post is created:SELECTTO_CHAR(created_time, ‘HH24’) as hour,SUM(comments_count) AS total_comments,AVG(comments_count) AS avg_comments,SUM(likes_count) AS total_likes,AVG(likes_count) AS avg_likes,count(distinct id) as nums_of_postFROMpublic.
shinestyinstagram_instagram_mediaGROUP BY1ORDER BY 1The purpose of me showing this query is that it is not appropriate to use this insight to make a recommendation to post a IG post during the 7am to 4pm of a day.
Because this query is using when a post is created to calculate the numbers of likes and comments.
In contrast, the last visual used when a user has commented to calculate the performance, which I think would be more accurate.
So I would like to suggest @shinestythreads to post during 11pm to 2am of a day to get more comment engagement.
Unfortunately the API didn’t provide the same info for likes, which I would like to use to calculate for like engagement.
Performance by HourFinding performance by types of media:SELECTtype,SUM(likes_count) as total_likes,AVG(likes_count) as avg_likes,SUM(comments_count) as total_comments,AVG(comments_count) as avg_comments,COUNT(distinct id) as nums_of_postFROMpublic.
shinestyinstagram_instagram_mediaGROUP BY1There are three types of medias that IG offers right now: image, video and carousel.
Below visuals show the video media has the most average comments and the carousel media has the most average likes.
And the video media shows itself is the best media to get engagement.
Performance by Media Type using average commentsPerformance by Media Type using average likesFinding performance for different media by filter:SELECTfilter,type,SUM(likes_count) as total_likes,AVG(likes_count) as avg_likes,SUM(comments_count) as total_comments,AVG(comments_count) as avg_comments,COUNT(distinct id) as nums_of_postFROMpublic.
shinestyinstagram_instagram_mediaGROUP BY1, 2In below visual, we can see the filter named Crema has the best performance beside the Normal filter for image media.
Performance by Filter for Image MediaIn below visual, we can see the filter named Ashby has the best performance beside the Normal filter for video media.
Performance by Filter for Video MediaFor carousel media, we can do the same to get insight but I didn’t do it in here because @shinestythreads only has normal filter for carousel media.
Find out performance by location tag used:SELECTlocation,SUM(likes_count) as total_likes,AVG(likes_count) as avg_likes,SUM(comments_count) as total_comments,AVG(comments_count) as avg_commentsFROM(SELECTname as location,m.
shinestyinstagram_instagram_media_location lLEFT JOIN public.
shinestyinstagram_instagram_media mON l.
instagram_media_id = m.
id) as tGROUP BY1This visual shows us that among all the posts that has a location tag, Augusts National Golf Club has the most average likes and comments engagement.
Performance by Location TagFind out performance difference for posts that has location tag versus there is no tag:WITH t AS(SELECTm.
name as locationFROMpublic.
shinestyinstagram_instagram_media mLEFT JOINpublic.
id = l.
instagram_media_id),w as(SELECT*,(CASE WHEN location IS NULL THEN 0 ELSE 1 END) AS have_locationFROM t)SELECThave_location,SUM(likes_count) as total_likes,AVG(likes_count) as avg_likes,SUM(comments_count) as total_comments,AVG(comments_count) as avg_commentsFROMwGROUP BY1Posts with location tag has more average likes but slightly less average comments compare to posts that has no location tag.
Performance of Location Tag vs.
No Location TagFind out top active commenters:SELECT*FROM(SELECTfrom_username as username,COUNT(media_id) as nums_of_comments,RANK() OVER(ORDER BY nums_of_comments DESC)FROMpublic.
shinestyinstagram_instagram_commentsGROUP BY1ORDER BY2 DESC) as tWHERErank >1 and rank <=15This visual shows us who are the top active commenters (users), not including the user @shinestythreads.
Most Active CommentersCohort Analysis:with t AS(selectmedia_id,from_username as username,DATE_TRUNC('week', created_time) as weekfrompublic.
shinestyinstagram_instagram_comments),w AS(selectusername,min(week) as first_time_commentingfromtGROUP by1)SELECTx.
cohort::DATE AS week,MAX(x.
week_number) OVER (PARTITION BY x.
cohort) AS total_nums_of_week,x.
nums_of_commenter) OVER (PARTITION BY x.
cohort) AS nums_of_new_commenter,x.
nums_of_commenter) OVER (PARTITION BY x.
cohort)::FLOAT AS retention_rateFROM(SELECTw.
first_time_commenting as cohort,FLOOR(EXTRACT('day' FROM t.
week – w.
first_time_commenting)/7) AS week_number,COUNT(DISTINCT t.
username) AS nums_of_commenterFROMt tLEFT JOINw wONt.
username = w.
usernameGROUP BY1,2) as xORDER BY 1,2,3Above are all the analyses I have done for the project.
For the data I have used, I didn’t upload them to my Github because of confidential purposes.
If you are interested in learning what resources I have used for this project, please visit my Github repo.
If you have any questions, feel free to comment below.
Thank you so much for reading!.