GitHub User insights using GitHub API — Data Collection and Analysis

GitHub User insights using GitHub API — Data Collection and AnalysisExploring GitHub APIKaran BhanotBlockedUnblockFollowFollowingJun 24Photo by Yancy Min on UnsplashAs I was working with GitHub pages, I decided that I’d like to have some statistics about my GitHub projects on it.

Thus, I decided to use GitHub’s own API to draw insights.

As this might be useful to others as well, I decided to create it as a project and publish on GitHub itself.

Check out the repository below:https://github.

com/kb22/GitHub-User-Insights-using-APIThere are two parts to this project:Data Collection — I used GitHub’s API using my credentials to fetch my repositories and some key information regarding them.

Data Analysis — Using the data collected above, I drew some insights from the data.

You can also use this project for your own data collection.

Add your credentials to the file credentials.


If your username is userABC and password is passXYZ, the json file should look like:{ "username": "userABC", "password": "passXYZ"}Once the changes to the json file are made, save the file.

Then, simply run the file get_github_data.

py to get data from your profile and save it to the files repos_info.

csv and commits_info.


Use the following command to run the Python file:python get_github_data.

pyData CollectionImporting libraries and credentialsI first saved my credentials inside the credentials.

json file.

After reading the credentials file, I used the username and password to create the authentication variable which I’ll use for GitHub API authentication.

Authenticating while accessing our own account allows us to make 5000 calls per hour.

User informationI’ll use the https://api.


com/users/<USERNAME> API to get data for my account.

There are several keys in the response.

From the json, I’ll extract user information such as name, location, email, bio, public_repos, and public gists.

I’ll also keep some of the urls handy including repos_url, gists_url and blog.

At the time of this article, I have 36 public repositories and 208 public gists.

RepositoriesI’ll now use the repos_url to fetch all repositories.

The url however limits the maximum number of repositories at 30 in each batch.

Hence, I had to handle it.

I make a call to the endpoint, and if the number of repositories returned is 30, it means that there might be more repositories and I should check the next page.

I append a parameter in the API called page with value set as 2, 3, 4….

based on the page I am referring to.

If the repositories returned is less than 30, this means there are no more repositories and I end the loop.

As I have 36 repositories, I was able to fetch them all in two API calls and save the result in repos_data.

To get any further insights, I had to take a look at the response.

So, I checked out the first repository information.

As we can see, there is a lot of information about each repository.

I decided to select the following for each repository:1.

id: Unique id for the repository.


name: The name of the repository.


description: The description of the repository.


created_at: The time and date when the repository was first created.


updated_at: The time and date when the repository was last updated.


login: Username of the owner of the repository.


license: The license type (if any).


has_wiki: A boolean that signifies if the repository has a wiki document.


forks_count: Total forks of the repository.


open_issues_count: Total issues opened in the repository.


stargazers_count: The total stars on the repository.


watchers_count: Total users watching the repository.


url: The url of the repository.


commits_url: The url for all commits in the repository.


languages_url: The url for all languages in the repository.

For the commit url, I removed the end value inside the braces (including the braces).

I created the languages_url myself from the repository url.

The dataframe repos_df now has all the repository information I needed.

However, I wanted to take a step further and decided to extract all languages here itself and append it to the dataframe.

Languages for each repository can have multiple values, so I decided to combine all languages in the form of a comma separated list.

Once this was complete, I saved the dataframe to a file repos_info.


CommitsI also had access to the commits url for each repository.

I decided that I could collect the commits for each repository and save them to their own file too.

Just like the repositories API, the commits API is also limited to 30 commits in one call.

So, using the same technique of using page parameter, I retrieved all commits.

I took a look at the response json.

For each commit, I saved the repository Id the git commit belongs to, the sha value of the commit, the date of commit and the message of the commit.

I saved the dataframe data to the file commits_info.


Data AnalysisNow that the complete data is available, I decided to draw some insights out of the data.

Basic AnalysisI noticed that I had 36 repositories and 408 commits.

I then decided to use the describe() method to take a look at the forks, watchers, issues and stars.

I noticed the following:I’ve had maximum forks as 67 while the minimum are 0.

The number of watchers and stars go hand in hand.

I’ve had no issues reported in any repository.

I also observed that the two most common commit messages I’ve used are Update README.

md and Initial commit.

It appears that sometimes I do tend to update readme files on GitHub itself and use its default message as the commit message.

Commits per repositoryNext, I wanted to see how the commits were distributed amongst my various repositories.

I combined the two datasets (repos and commits) based on the repository id and created a plot.

Commits per repositoryFrom the figure above, we can clearly see that I’ve had the maximum commits in IEEE MyEvent App, which is an Android application for event management.

The second most committed repository is the one associated with IBM’s Applied Data Science Capstone course called Coursera_Capstone.

Yearly analysisIt’s been a long time since I started working on projects and pushing them to GitHub.

I’ve worked the most during the year 2018 and 2019 and expect to see the same in the yearly analysis.

Commits in each yearI have made maximum commits in this year itself even though it’s just June.

The second place goes to the year 2016.

I was pursuing my Bachelors in Computer Science back then and I had started working on my own projects and thus this high number of commits.

I expected more commits in the year 2018 but I started late so probably there were less commits in total.

Let’s now break down the year 2019 and see how I progressed there.

Monthly analysis of 2019I broke down the year into months and visualized the data on a bar plot.

Commits in each month of 2019It appears that I made the maximum commits in the month of March.

There are still 6 days left in June but it has already taken the second place.

The least commits were in January.

Let’s break down March, 2019 further.

Everyday analysis of March, 2019Let’s also see the commit history in the month of March.

Commits in March, 2019I made the maximum number of commits on March 2, 2019.

Most popular languagesAs my interest grew in Data Science, I worked on many projects using Python.

Thus, Python would be the most dominant language.

Language distribution amongst all repositoriesI’ve worked on a variety of languages including HTML, CSS, C++, Java, and others.

However, the most common is Jupyter Notebook.

My code in Jupyter notebooks is written in Python 3 and thus Python is the most common language in all my projects.

ConclusionIn this article, I discussed the steps I used to collect GitHub data for my profile and then used the data to draw insights.

Hope you liked this.

Do share your thoughts, ideas and suggestions.

I’d love to hear from you.


. More details

Leave a Reply