Checking Automated Data Analysis for ErrorsHow to Check for Errors, both Manually and Automatically, when Automating Data AnalysisPeter GrantBlockedUnblockFollowFollowingMar 6This is the fourth article in a series teaching you to how to write programs that automatically analyze scientific data.
The first presented the concept and motivation, then laid out the high level steps.
The second taught you how to structure data sets to make automated data analysis possible, and automatically identify the conditions of each test.
The third article discussed creating a for loop that automatically performs calculations on each test result and saves the results.
This fourth post will cover perhaps the most important part of the entire process: Checking the data and analysis results for errors, to avoid letting them impact the final results.
Checking the Quality of Testing and Analysis ResultsPerhaps the strongest argument against automation of data analysis, is that of reliability.
A computer algorithm that mindlessly performs calculations and analysis won’t identify poorly performed tests or mistakes in analysis, and these oversights will lead to errors in the final project results.
This isn’t simply people being afraid of new ways of doing things, it’s a legitimate concern.
It’s especially true in the case of laboratory testing; lab tests don’t always progress as planned, and sometimes those mistakes need to be identified during data analysis.
Fortunately, it’s possible to add error checking methods to the program.
Some of these techniques, namely printing intermediate outputs and plotting all data, require manual data checking though still significantly less human effort than fully manual data analysis.
Others, namely creating an automated data checking algorithm, allow the program to do the error checking on their own significantly reducing data checking time demands.
Combined with the natural repeatability of computer programs, these approaches can lead to data analysis processes that are more robust than manual calculations, while also being significantly faster.
The following sections will describe these three approaches to data quality checking in Python programs.
Printing Intermediate OutputsPrinting intermediate outputs is analogous to showing all calculations.
This is a good practice because it both facilitates debugging when the program is created, and allows others to check the results and build confidence in the automated data analysis program.
Because many people won’t want to review Python code directly, they won’t be able to check the formulas as they might with an Excel spreadsheet.
This makes it especially important to provide as many intermediate outputs as possible so people can check the calculations on their own.
The basic premise behind printing intermediate outputs is to present as many steps in the calculations as possible in a format similar to Excel.
This facilitates checking the results from the code by making it easier for others to understand.
Then they can easily perform their own calculations, and compare their results to the Python output.
This is generally performed using the two following steps.
Present as many calculation details as possible within the data frame.
External variables, or lists may be necessary, but should be used sparingly.
Containing all of the data and calculations in a single data frame makes it easier for others to understand and check the calculations.
Print the data frame from each test to a unique .
Some aspects of presenting calculation details within the data frame come naturally.
Most calculations will be performed on the data frame, and the results will be stored in the data frame as a direct result.
Other aspects require a little extra effort.
For instance, constants would more naturally be used as independent variables, but adding a column for them in the data frame allows others to check these constants while examining only the output table.
Saving a new .
csv file for each test is a useful way to ensure that all calculations throughout an entire project are saved, rather than overwriting old calculations as results from new tests are analyzed.
This is generally done by 1) Creating a new folder as each test is analyzed, to store the results from that test and 2) Saving the results to a .
csv file using a dynamic file name that changes to represent the conditions of the test currently being analyzed.
These two goals can be accomplished using the following code.
Note that the example code uses the techniques shown in Part 2 to track the conditions of the test.
Flow_Hot = Find_Between(filename, ‘_FlowHot=’, ‘_FlowCold)Flow_Cold = Find_Between(filename, ‘_FlowCold=’, ‘_TemperatureHot’)Temp_Hot = Find_Between(filename, ‘_TemperatureHot=’, ‘_TemperatureCold’)Temp_Cold = Find_Between(filename, ‘_TemperatureCold=’, ‘.
csv’)Folder = r’C:UsersJSmithDataAnalysisDataFlow_Hot=’ + Flow_Hot + ‘Flow_Cold=’ + Flow_Cold + ‘Temp_Hot’ = Temp_Hot + ‘Temp_Cold=’ + Temp_Coldif not os.
to_csv(Folder + ‘Flow_Hot=’ + Flow_Hot + ‘Flow_Cold=’ + Flow_Cold + ‘Temp_Hot=’ + Temp_Hot + ‘Temp_Cold=’ + Temp_Cold + ‘.
csv’, index = False)Breaking the preceding code into five steps, it:1.
Uses the techniques from Part 2 to identify the nominal conditions of the data set currently being analyzed.
For the context of this chapter, this ensures that the program has the information it needs to create names of folders and files, and that the results are stored in an orderly fashion which can easily be understood later.
Uses the conditions read in Step 1 to create a variable named Folder.
The path of Folder is unique to this test and uses the conditions of the test in the name of the file.
Checks to see if there is already a folder matching that name and, if not, creates one.
The fourth step is to perform all calculations on the data.
To simplify this discussion, and keep it focused on printing intermediate outputs, the calculations are represented by a single comment above.
After the analysis is completed, this step writes the data frame to a new .
csv file is located within the new folder, containing the nominal test conditions, and itself contains the nominal test conditions.
This rigorous naming structure ensures that the correct data set can easily be located as needed later.
Using Plots to Check ResultsOne convenient way to check the quality of test data and the performance of the associated data analysis is via plots.
The advantage of automated data analysis programs is that they can rapidly create plots for hundreds of tests, reducing the manual labor required to generate them individually.
This allows the user to rapidly peruse the plots, and rapidly ensure data quality.
The process of automatically generating and savings plots for each test is now something that the reader can do.
The general process is to use the techniques described in Automating Scientific Data Analysis Part 2 and Automating Analysis of Scientific Data Sets to create a program that cycles through all the data sets, performs the needed calculations, generates the needed plots, and saves the results.
These techniques have been covered already.
The key concept here is to plot enough features of the data set to be able to quickly, and visually check to ensure that the test proceeded correctly.
Returning to the heat exchanger example as described in Part 2, this means that saved plots must allow the user to quickly determine:1.
That the test hot-side and cold-side flow rates closely matched the nominal test conditions called for in the test plan,2.
That the test hot-side and cold-side inlet water temperatures matched the nominal test conditions called for in the test plan,3.
That all parameters were steady enough to ensure quality, steady-state operation,4.
That the filters used to identify the steady state period of the test selected the correct portion of the data set, and5.
That the final steady-state effectiveness value is steady and reasonable.
This goal can be accomplished with three plots.
Figure 1 presents an example plot showing the water flow rate across both sides of the heat exchanger.
Assume for the sake of this example that the nominal flow rate conditions for this test were 3.
5 gal/min on both sides of the device.
In Figure 13, both flow rates range between 3.
4 and 3.
This is a small amount of variation, within the anticipated range of variation in a test, and thus closely matches the nominal test conditions.
Figure 1 shows that the first condition is satisfied.
Examples of dissatisfactory operation include an average flow rate of 3.
2 gal/min on either side, or scatter ranging from 3.
0 to 4.
Figure 1 also shows that the flow rates meet condition three.
While there is a small amount of scatter in the data, as anticipated, the long-term trend is very stable right around 3.
A test would be considered dissatisfactory if, for example, the flow rate had temporarily dropped to 3.
0 gal/min before returning to the set flow rate.
Figure 1: Flow Rate Data from an Example Data SetFigure 2 serves the same purposes for temperature data.
The example data assumes a 100.
4 °F inlet temperature on the hot side, and 50 °F inlet temperature on the cold side.
Like the flow rate data, the test is considered valid if the recorded inlet temperatures are close to those nominal conditions, and do not vary dramatically throughout the steady state portion of the test.
Both of those conditions are met, so this plot indicates that the recorded temperatures meet conditions two and three.
Figure 2: Temperature Data from an Example TestFigure 3 presents a plot of the calculated effectiveness of the heat exchanger.
Per the techniques shown in Part 2, this data set is filtered to only show data after the valves switch to the test flow rate.
The data clearly bounces around an effectiveness rating of 0.
34 with some variation.
The variation is expected, because there’s some variation in both the temperature and flow rate data.
The first few data points make it clear that the unit was in a transition period when the filter first applied, but there are only a few so the impact on calculated average effectiveness is minimal.
Aside from those beginning points, the presented effectiveness is quite stable.
This plot confirms that conditions four and five are met by this data set.
Figure 3: Effectiveness Data from an Example TestHaving these three plots available allows the user to ensure that this test was performed correctly, and that the data is valid, with mere seconds of active involvement.
Including a section in an automated data analysis program that creates and saves the required plots is a very effective and efficient way to ensure the quality of a data set.
Automated Data CheckerThe most detailed and automated way to check the quality of the data in a test is to create an automated data checker.
This is a script that reviews the measured data to determine what was recorded, compares it to the nominal test conditions, determines if they’re acceptable, and report unacceptable results to the user.
This can be advantageous for the user as it limits the number of plots that need to be manually reviewed.
Projects with hundreds of tests can easily yield thousands of plots, and reducing the number that need to be reviewed saves significant amounts of time, tedium, and project budget.
The following code provides an example of how this process can be accomplished.
Assume that the program already has a data frame named “Temp” to temporarily store information about questionable results, and a second data frame name “SuspiciousTests” with the full list.
if abs(Temperature_Hot — np.
mean(Data[‘Hot Inlet Temperature (deg F)’])) > Threshold_Difference_Temperature:Temp.
loc[0, ‘Filename’] = filenameTemp.
loc[0, ‘Test Parameters’] = ‘H’ + str(Flow_Hot) + ‘-C’ + str(Flow_Cold) + ‘-T’ +str(Temperature_Hot)Temp.
loc[0, ‘Code’] = ‘Temperature_HotInlet_Avg’Temp.
loc[0, ‘Value’] = np.
mean(Data[‘Hot Inlet Temperature (deg F)’])SuspiciousTests = SuspiciousTests.
csv', index = False)That code works as follows.
First, it compares the nominal hot-side inlet temperature, Temperature_Hot, to the average measured hot-side inlet temperature.
If the difference is greater than a pre-specified value, set with the variable Threshold_Difference_Temperature, it determines that there was an issue with the test.
If there was an issue with the test, it records the parameters of the test.
These parameters include the filename, nominal test conditions, a code representing the test condition that was not met, and the measured value of that condition.
These entries are all added to the SuspiciousTests data frame.
After the program processes all the data, SuspiciousTests is printed to a .
csv file creating a record of which tests appear to be dis-satisfactory.
The example showed a single check, comparing the average hot-side inlet temperature to the nominal condition for the test.
A thorough program would include checks for the other test parameters, as well as checks of the standard deviation to ensure that all parameters were stable.
In general, it’s best to remember the following guidelines when creating an automated data checking algorithm:It should check all nominal test conditions, to ensure that they are adequate in both average and standard deviation,It should check the results of any filters, to ensure that they caught the correct range of data,It should examine the final calculated output, to ensure that it is within the expected range and to ensure that the test provided a reliable signal, andIt needs to be checked thoroughly, to ensure that it is identifying problematic tests but not adequate ones, before being trusted.
Next StepsThe content provided in the articles thus far have taught you how to structure data sets for automation, automatically open and analyze all files, and check the data sets for errors.
The next step is to store these results in a way that allows you to naturally develop regressions.
That will be covered in my next article.