@headwayioWhen Excel isn’t enough: Using Python to clean your Data, automate Excel and much more…How a Data Analyst can survive in a spreadsheet-driven organizationAndres VourakisBlockedUnblockFollowFollowingMar 23Excel is a very popular tool in many companies, and Data Analysts and Data Scientists alike often find themselves making it part of their daily arsenal of tools for data analysis and visualization, but not always by choice.
This was certainly my experience at my first job as a Data Analyst, where Excel was part of everyone’s workflow.
My team was using Excel’s data tool, Power Query, to aggregate and manipulate CSV files as well as connect to our database.
Eventually, this data would be displayed as a pivot table or dashboard and shared with the rest of the company.
Almost every report I got my hands on lived on Excel, and it didn’t take me long to realize that this was a big issue.
Just to paint you a picture, here are some of the things I heard my co-workers say multiple times about Excel, and that I eventually started saying myself:“It crashed again!!”Refreshing data on Excel reports was a daily task, and sometimes, it would be the only one we could perform at once.
Even though our computers had decent hardware, we knew that as soon as we opened other programs (Did someone say Chrome?) while Excel was refreshing, it was almost guaranteed that it will crash.
“It is still refreshing…”Not only we couldn’t use other applications while refreshing excel, but some of our reports would take 30 minutes or even a few hours to finish refreshing.
Yes, Excel loved holding our computers hostage!“We can’t load that much data.
”Our biggest frustration with Excel was not being able to load as much data as we needed.
Everyone in the company was demanding more and we simply couldn’t deliver.
It was clear that something needed to be done.
We were wasting too much time working around these issues, and there was very little time left to do any actual analysis or forecasting.
Luckily, I was very much fluent in Python and and it’s tools to manipulate CSV files, so my team and I began the long-overdue task of optimizing our reports.
Since we needed to keep reporting in Excel, and there was no budget for a BI tool, we decided to use Python to do all of the heavy lifting and let Excel take care of displaying the data.
So with the help of Python and Windows Task Scheduler, we automated the entire process of gathering our data, cleaning it, saving the results, and refreshing the Excel reports.
Since everyone’s workflow is different, and I want to make this article as useful as possible, I will keep things high level and include the links to some great tutorials in case you are looking to dig deeper.
Keep in mind that some of these tips may only work on a Windows machine, which is what I was using at the time.
Downloading Data from an FTP serverUsing the ftplib module in Python, you can connect to an FTP server and download files into your computer.
This was a module I used almost daily, since we were receiving CSV reports from an outside source.
Here is some sample code:To learn more about FTP servers and how to use ftplib check out this tutorial.
Running SQL queriesUsing the pyodbc module in Python, you can easily access ODBC databases.
In my case, I used it to connect to Netsuite and extract data using SQL queries.
Here is some sample code:Just note that you’ll need to have the appropriate ODBC driver installed in order for the module to work correctly.
For more information check out this tutorial.
Cleaning DataUsing the pandas module in Python, you can manipulate and analyze data very easily and efficiently.
This one is without a doubt one of the most valuable tools I posses.
Here is some sample code:This tutorial is a great place to get started with pandas.
If you are dealing with large files then you might want to also check out this article on Using pandas with Large Data Sets.
It helped me to reduce my memory usage by a lot.
Refreshing ExcelUsing the win32com module in Python, you can open up Excel, load a workbook, refresh all data connections and then save the results.
Here is how that’s done:I haven’t yet stumbled upon any good tutorials for the win32com module but this Stack overflow thread might be a good starting point.
Running your scripts at prescribed timeWith the help of Windows Task Scheduler you can run your python scripts at prescribed times and automate your work.
Here is how you can do that:Launch Task Scheduler, and find the Create Basic Task action, located under the Actions pane.
Clicking on Create Basic Task opens a wizard where you define the name of your task, the trigger (when it runs), and the action (what program to run).
The screenshot below shows the Action tab where you specify the name of your Python script to run as well as any arguments to the script.
For more details on creating a task, check out this tutorial.
By introducing Python into the equation, my team and I were able to dramatically reduce the time we spent processing data.
Also, including historical data into our analysis was no longer an unattainable task.
These improvements did not only freed us to think more analytically but also to spend more time collaborating with other teams.
I hope you found this article useful.
if you have any questions or thoughts, I’ll be happy to read them in the comments :).