Automating Scientific Data Analysis Part 1:Why and How You Can Write Python Programs that Automatically Analyze Scientific Data SetsPeter GrantBlockedUnblockFollowFollowingFeb 22Teaching Each Other New Coding SkillsEverybody reading Towards Data Science is likely familiar with the typical application of data science techniques.
A company with an incredibly large data set asks somebody to mine the data set for understanding, develop algorithms trained to the data set, and let the company use their models to drive business decisions.
Data science writing typically focuses on this valuable application, but there are other applications where people can benefit from these techniques and mindsets.
For instance, scientific researchers.
Scientific research has a lot in common with data science.
There are often large data sets to study.
Those data sets typically contain the answers to important questions.
Those answers are often important in decision making.
The main difference is that scientific researchers typically do their data analysis manually in spreadsheets, whereas data scientists typically leverage the many powerful packages available in Python.
The purpose of this post is to introduce scientists to some of the ways data science techniques and mindsets can improve scientific research, and why scientists should consider using these techniques over their current methods.
The fundamental principle is simple: The data analysis portion of most scientific data analysis is routine, and can be automated with Python scripts.
That automation enables the scientist to process larger data sets than their competition, with fewer mistakes, in a faction of the time.
Why would I want to automate my data analysis?This is perhaps the most important question.
Nobody is going to learn a new skill, in this case the two new skills of Python programming and data analysis automation, if they don’t think it will benefit them.
Fortunately there are many reasons scientists should automate data analysis, including the following:Faster processing of data: Analyzing scientific data sets can consume weeks, or months of every year.
Each project whether it includes lab experiments, field studies, or simulation studies can yield hundreds if not thousands of data files.
Each of these files must be opened, studied to ensure that the test/monitoring/simulation proceeded correctly, and analyzed to find the result contained in that file.
Then the result must be added to another file and saved for later analysis.
Manually doing this takes a lot of time.
It’s repetitive and boring.
Automation solves all of those problems.
If the project is planned out in advance, scientists can write a Python script that performs all of these tasks on every data file automatically.
Then this process can be performed in minutes instead of months.
Reduced error potential: Humans make mistakes.
That’s simply part of being human.
Analyzing hundreds of test files requires thousands of calculations.
It involves creating hundreds of plots.
It requires saving hundreds of data points in the right location.
Each of these actions has the potential for typos, for incorrectly remembered constants, for files to be saved in the wrong location, for inconsistent plot axis labels, and so on.
This has always been part of the process, and requires both significant amounts of care and time to avoid.
Again, automation has the potential to avoid this issue completely.
Instead of ensuring that all calculations and plots in hundreds of data files are correct individually, a scientist only needs to ensure that a single Python script is correct.
Then that script is applied to each file.
And if there’s a mistake in the script there’s no need to dig through hundreds of files checking to see where else the mistake was made; simply update the script and re-run it on all files.
While getting a cup of coffee.
Access to Python packages: There are many Python packages designed specifically to make life easier for scientists.
Scikit-learn is an excellent package for scientists needing to make regressions, or implement machine learning.
Numpy is a numerical package capable of performing most calculations that scientists would need.
Matplotlib and Bokeh both offer plotting options with different features allowing flexibility in plot creation.
Pandas replaces the Excel table with DataFrames enabling the data to be structured and manipulated in a familiar manner.
Time available for other purposes: Since automated data analysis allows you to complete that part of your job in less time, suddenly you have time available for other activities.
Maybe you’d rather spend the time on business development and proposal writing.
Or maybe you have a staff member that you’d like to be mentoring.
Or customer relationships that you’d like to spend more time on.
Regardless of what activity you find more meaningful, analyzing your data analysis will help you spend more time there.
I believe that these reasons provide a solid justification for learning to automate data analysis, and that it would be wise for any scientist to do so.
But I’m sure that these aren’t all of the reasons.
What additional benefits do you think that you could gain?Since laboratory experimentation and the associated data analysis is a common part of scientific research, this series of posts will focus on how to automate this process.
What steps do I need to take to automate laboratory data analysis?First, we’ll present the structure and big-picture design of a project before moving on to discuss several of the topics in significantly more depth.
This series of posts will focus on the planning and data analysis aspects of the process.
Unfortunately each project must be approached individually and a detailed, yet generic solution doesn’t exist.
However, there is a fundamental approach that can be applied to every project, with the specific programming (Primarily the calculations) changing between projects.
The following general procedure provides the structure of an automated data analysis project.
CREATE THE TEST PLANDetermine what tests need to be performed to generate the data set needed to answer the research question.
This ensures that a satisfactory data set is available when generating regressions at the end of the project, and avoids needing to perform extra tests.
DESIGN THE DATA SET TO ALLOW AUTOMATIONThis includes specifying what signals will be used to identify the most important sections of the tests, or the sections that will be analyzed by the script.
This ensures that there will be an easy way to structure the script to identify the results of each individual test.
CREATE A CLEAR FILE NAMING SYSTEMEither create a data printing method that makes identification of the test conditions in each test straightforward or collaborate with the lab tester to do so.
This ensures that the program will be able to identify the conditions of each test, which is necessary for analyzing the data and storing the results.
STORE THE RESULTING DATA FILES IN A SPECIFIC FOLDERThis allows use of the Python package “glob” to sequentially open, and analyze the data from each individual test.
ANALYZE THE RESULTS OF INDIVIDUAL TESTSCreate a program to automatically cycle through all of the data files, and analyze each data set.
This program will likely use a for loop and glob to automatically analyze every data file.
It will likely use pandas to perform the calculations to identify the desired result of the test, and create checks to ensure that the test was performed correctly.
It will also likely include plotting features with either bokeh or matplotlib.
INCLUDE ERROR CHECKING OPTIONSAny numbers of errors can occur in this process.
Maybe some of the tests had errors.
Maybe there was a mistake in the programmed calculations.
Make your life easier by ensuring that the program provides ample outputs to check the quality of the test results and the following data analysis.
This could mean printing plots from the test that allow visual inspection, or adding an algorithm that compares the measured data and calculations to expectations and reports errors.
STORE THE DATA LOGICALLYThe calculated values from each test need to be stored in tables and data files for later use.
How these values are stored can either make the remaining steps easy or impossible.
The data should often be stored in different tables that provide the data set needed to later perform regressions.
GENERATE REGRESSIONS FROM THE RESULTING DATA SETCreate a program that will open the stored data from Step 7 and create regressions.
It should include an algorithm to create each desired regression, matching the data storage structure determined in Step 7.
Ensure that this program provides adequate outputs, both statistical and visual, to allow thorough validation of the results.
VALIDATE THE RESULTSValidate the resulting regressions using the statistical and visual outputs provided in Step 8.
Determine whether the model is accurate enough or not.
If not, either return to Step 7 and generate different regressions, or Step 1 and add additional tests to create a more comprehensive data set.
If the model is accurate enough, publish detailed descriptions of its strengths and weaknesses so that future users understand the situations when the model should/should not be used.
Next StepsThis post presented the concept of, motivation for, and procedure for automating scientific data analysis using Python scripts.
The remaining posts in the series will guide you through the 9 steps presented above.
The next post will discuss steps 1 through 6 leaving you with a firm understanding of how to automate analysis of individual laboratory tests.
The third and final post will discuss ways to store your data from each test, and combine it to form regressions.
When the topics covered in the two posts are combined, you’ll be able to write scripts that automatically perform the entire data analysis process for a particular project.
I hope to see you there, and I hope you find the posts useful.