Automating Scientific Data Analysis Part 2:How to Structure Your Project so Python Programs Can Understand Your Data FilesPeter GrantBlockedUnblockFollowFollowingFeb 27The first step in automating scientific data analysis is ensuring that your programs are able to understand the data contained in each file.
There are three main challenges associated with this step:First, the data from the test must be structured in a way that the program can identify the important portion of the data.
Many tests will have warm-up periods, conditioning periods,and/or transient periods before getting to the part of the test with usable data.
To analyze that data set there must be a consistent signal the program can use to identify the correct period to analyze.
This article will present several potential solutions you can use to overcome this challenge.
Second, the program must have a way of identifying the conditions of each test.
The conditions are vital for both performing calculations to make sense of the data, and storing the results in an appropriate location.
This article will present a few different ways you can instruct your program to obtain the conditions of each test.
Third, the data for each test must have its own unique file for the previous suggestions to work.
Sometimes data files from an entire project come in a single file, making it much harder to identify the conditions of each test.
This article will provide two solutions to help you split these files apart, creating a separate data file for each test.
Designing Data Sets for Automated Laboratory Data AnalysisRepeatable data sets allow creation of programs which process the data set in the exact same way, every single time, with no modification.
At the same time, all experimental and/or simulation data sets will be complex, with changes throughout the test, or periods that are of more interest than others.
Therefore, the goal is to create repeatable signals in a potentially non-repeatable process.
There are several ways to do this.
They include using valve status or a control setpoint as a signal stating that the test has changed phases, comparing a measured condition to a setpoint, or using the time of the test.
These options will be described using the example of a single test on a heat exchanger.
Introducing an Individual Data SetThis topic will be described using an example data set emulating what occurs during testing of counter flow heat exchangers.
This is not actual laboratory data; instead it is an artificial data set created specifically for use demonstrating these concepts.
When characterizing a heat exchanger the main output of interest is the steady state effectiveness.
The equation to identify effectiveness is shown in Equation 1.
Per Equation 1, the important measurements in this testing are the flow rate, inlet temperature, and outlet temperature on both sides of the device.
Equation 1: Heat Exchanger EffectivenessFigures 1 and 2 show example data that could be from a typical test, with Figure 1 showing water flow rate data and Figure 2 showing temperature data.
The data sets are emulating a test with identical flow rates across both sides of the device, 100.
4 degree hot-side water temperature, and 50 degree cold-side water temperature.
To emulate the variability inherent in experimental data some randomness has been added to the artificial data set.
For the sake of this discussion, the important thing to note is that there are three phases to this test.
The first phase is the conditioning period of the test.
During this phase, the entire system is being flushed with 70-degree water to ensure that the unit has a stable starting condition.
The first phase can be identified because the temperature data is all approximately 70 degrees, and the flow rates are close to 0.
5 gallons/minute This phase continues for the first 410 seconds of the test, before the second phase begins.
The second phase is the warm-up period, as the hot and cold water enter the device, and the device approaches steady state operation.
It can be identified because the flow rates suddenly jump from roughly 0.
5 gallons/minute to roughly 3.
The third phase is the steady state effectiveness portion of the test.
In the third phase, the flow rates are fairly stable at 3.
5 gallons/minute, the hot-side inlet temperature is fairly stable at 100.
4 degrees, and the cold-side inlet temperature is fairly stable at 50 degrees.
Because the system is operating steadily, it is used to identify the performance of the device under steady state conditions.
This phase progresses from 440 seconds to the end of the test.
Figure 1: Example Water Flow Data from a Typical TestFigure 2: Example Temperature Data from a Typical TestFigure 3 shows the effectiveness calculated using the data shown in Figures 1 and 2 and Equation 1.
The effectiveness calculated during the first two phases of the test is not valuable.
This is because the system is not operating at steady state during those portions of the test, and steady state effectiveness is what we care about.
Figure 3: Calculated Effectiveness From the Sample DataFigure 3: Calculated Effectiveness Using the Sample DataNote that there are extreme differences in steady state effectiveness between the phases of the test.
Those differences present a challenge when automating the data analysis.
Since the desired outcome from each test is to calculate the average effectiveness of the device in steady state including the data from the first two phases will create erroneous results.
This necessitates filtering the data set such that the analysis focuses on the desired portion and designing the experiments to enable it.
Methods of Isolating the Desired Data SetThere are an infinite number of ways to isolate the desired data set.
The trick is to find one that works well for any given application, based on one’s needs and control over the available data.
This section will describe a few different methods, while comparing their strengths and weaknesses.
USING A CONTROL SIGNALThe most powerful method of isolating the desired data is printing a control signal in the data set.
This control signal states some condition of the test that identifies the appropriate phase of the test.
Filtering the data set using that control signal allows the analyzer to reduce the data set to include only the relevant data.
Using the example heat exchanger analysis, printing the status of flow control valves would achieve this result.
A valve status must change in order for the flow rate through the device to suddenly change from 0.
5 gallons/minute to 3.
5 gallons/minute at the end of the first phase.
Printing this control valve signal gives the program a solid point to identify that the conditioning phase has ended and the testing phase has begun.
Figure 4 shows the same flow data from a typical test with the control valve signal added in.
Figure 4: Valve Status and Flow Data in a Typical TestFigure 5 shows the effectiveness data from the same data set, filtered by the valve status signal.
Because of the use of that valve signal this effectiveness data set includes only data when there is flow through the device (Phases 2 and 3).
The inclusion of data from Phase 2 would cause a minor amount of error, though nowhere near as much as including the data from Phase 1.
The steadiness of the effectiveness data shown in Figure 5 indicates that the average effectiveness over the steady state period would be an accurate calculation.
Figure 5: Effectiveness Data Filtered by a Valve SignalUSING A KNOWN TIMEA second way of isolating the desired portion of the data is filtering the data to select a specific time period.
This method works well because it gives the user the most control of the resulting data set.
Compared to using a control signal this method can isolate the data set to solely the third phase without including the second phase.
Removing the data from the second phase will slightly increase the accuracy of the result.
The downside of using time-based control is that it requires the user to identify a time period which is identical in every test.
If there is any deviation from the original test plan, the analysis script must be modified accordingly.
The third phase in the example data set began at 440 seconds so that would be a logical time condition to use to analyze the data.
However, there is no way to be certain that each test would proceed identically, and that the third phase would begin by 440 seconds each time.
Thus, setting the filter a bit later increases the safety of the analysis.
The downside is that the analysis is completed using a smaller quantity of data, but the impact should be negligible as the steady state portion of the test is 935 seconds long.
Figure 6 shows the impact of filtering the data to the last 900 seconds of the test.
Figure 6: Effect of Filtering the Data to the Last 900 SecondsUsing the data from the final 900 seconds of the test would isolate the resulting data set to only the third phase of the test.
It would also exclude a small portion of the third phase in the sample data in case the second phase is slower in other tests.
At the same time, 900 seconds converts to 15 minutes, which is more than enough time to identify the steady state operation of the device.
Figure 7 shows the effectiveness data when it is filtered to include only the last 900 seconds of the sample data set.
Figure 7: Effectiveness Data Filtered to the Last 900 SecondsHow to Identify the Conditions of Laboratory Tests and Split Large Data FilesThe next step in the process is to program your script so that it knows the conditions of each test it’s analyzing.
The program must be able to determine the test conditions to perform the necessary calculations and to correctly store data for further analysis later.
There are several ways to provide the test condition information to the program, including the following methods:Writing the test conditions in the name of the data file,Putting the test number in the name of the data file and referencing the test plan, andReading the test conditions from the data set.
These solutions all share the assumption that each test comes in an individual data file.
However, sometimes data sets have several tests contained in a single data file.
In that case, another algorithm must be developed to split the large data file into individual data files.
Some possible approaches to this issue include:Splitting the data set based on a known conditioning period between tests, andUsing control information printed in the test.
The following sections provide details on the three listed methods of identifying the test conditions.
Writing the Conditions in the Name of the Data FileThis strategy uses the name of the data file to store the test conditions.
This way the analysis program can read the conditions while opening the file.
One major strength of this approach is that the test conditions are also easily understood by the reader.
The downside is that file names can get long and cumbersome.
As an example, assume that a test is performed with the following parameters:Hot and cold-side flow rates are both 3 gal/min, andHot-side inlet temperature is 100.
4 deg F, andCold-side inlet temperature is 50 deg F.
The following is an example file name that contains this information:FlowHot=3_FlowCold=3_TemperatureHot=1004_TemperatureCold=50.
csvNote that decimal points cannot be used in the filename, so the hot-side inlet temperature is written as “1004” instead of “100.
To read the conditions from this filename the data analysis script will need an algorithm to read the contents of the file name.
The easy way is to create an algorithm that reads the values located between two identifiers.
These identifiers are the characters used to separate the parameters from the descriptive terms in the file name.
In the case of our example, they’re “=” and “_”.
Continuing the heat exchanger example, the values for the four parameters can be identified using the following code.
def Find_Between(String, First, Last): try: Start = String.
index(First) + len(First) End = String.
index(Last, Start) return String[Start:End]Flow_Hot = float(Find_Between(filename, ‘FlowHot=’, ‘_FlowCold))Flow_Cold = float(Find_Between(filename, ‘_FlowCold=’, ‘_TemperatureHot’))Temp_Hot = float(Find_Between(filename, ‘_TemperatureHot=’, ‘_TemperatureCold’))/10Temp_Cold = float(Find_Between(filename, ‘_TemperatureCold=’, ‘.
csv’))Find_Between requires three inputs: 1) The full string that is being searched, 2) A section of the string prior to the section of interest, and 3) A section of the string following the section of interest.
It then returns the string located between the descriptors.
If it doesn’t find one of the specified descriptors it returns an error.
The four lines of code following the definition of Find_Between call the function to identify the values of the four parameters as specified in the file name, and converts them to float type.
In the case of T_Hot, the value is divided by 10 so the program treats it as 100.
4 instead of 1004.
Referencing the Test Number in the File NameAn alternative solution is to print the test number in the file name and have the program open the test file to identify the conditions.
This approach is less complicated to implement and leads to data files with much shorter names.
However, the filenames are much less descriptive making it more challenging to identify the contents of each test manually if that proves to be important.
For example, consider a file named “14 HX 30Jan2018.
This file represents heat exchanger test number 14, performed on 30, January 2018.
In order to identify the test conditions, the program needs to identify that it’s test number 14 and identify the conditions of test 14 by referencing the test matrix.
The following line of code can identify the test number:Number_Test = int(filename[0:3])This code reads the first two indices in the filename, in this case ‘14’, and converts them to an integer value.
Note that in tests with only a single digit, the returned string would be followed by a space.
That space is dropped when converting to an integer.
The test matrix must then be referenced to identify the conditions of the test, based on the test matrix.
It can be opened using the pandas .
read_csv function, as shown below.
Test_Matrix = pd.
csv’)Note that the path in that code is completely made up, and must be modified to read the location of your test matrix.
The final step is reading the conditions of the test out of the test matrix.
This is accomplished using the .
loc function from pandas.
Assume that the test matrix appears as shown in Table 1.
The rows showing “…” for test number indicate that the matrix has additional tests but they aren’t shown.
The conditions of the test can be identified using the following code.
Flow_Hot = Test_Matrix.
loc[Number_Test, ‘Hot-Side Flow Rate (gal/min)’]Flow_Cold = Test_Matrix.
loc[Number_Test, ‘Cold-Side Flow Rate (gal/min)’]Temperature_Hot = Test_Matrix.
loc[Number_Test, ‘Hot-Side Inlet Temperature (deg F)’]Temperature_Cold = Test_Matrix.
loc[Number_Test, ‘Cold-Side Inlet Temperature (deg F)’]This code will search the Test_Matrix data frame created when using pandas.
read_csv to read the data frame for the 14th row, where Test_Number = 14, and pull the value from the appropriate column, as specified in each line of code.
In this way the conditions of each test can be identified, and are ready for use in later analysis.
Reading the Test Conditions from the Data FileA final option is to read the data that’s in the file and identify the test conditions based on what occurred.
This option is the least accurate as it runs the risk of misidentifying the conditions, and should be considered as a last resort.
The strength of this approach is that it doesn’t require assistance from the lab tester, and can identify the test conditions of unspecified data files.
Remember that the steady state portion of the test is performed in the final 920 seconds of the test.
The nominal conditions of the test can be approximated by calculating the average value of each point during this period.
The four required parameters can be estimated using the following code.
Flow_Hot = Data_File[‘Hot-Side Flow Rate (gal/min)’][-50:].
mean()Flow_Cold = Data_File[‘Cold-Side Flow Rate (gal/min)’][-50:].
mean()Temperature_Hot = Data_File[‘Hot-Side Inlet Temperature (deg F)’][-50:].
mean()Temperature_Cold = Data_File[‘Cold-Side Inlet Temperature (deg F)’][-50:].
mean()These four lines of code select the last 50 entries in each data column and calculate the mean.
This provides an average value for that data point during the steady state period of the test, and an approximate value representing the test conditions.
The next step is identifying the nominal conditions used when running the test.
Because measured data never perfectly matches the nominal conditions there will be slight differences between the values specified in the test matrix and the values identified by the analysis script.
Using the test matrix to identify the nominal values in the script allows easier tracking of results.
This is accomplished by comparing the identified values from the test to the nominal values in the test matrix and identifying the case with the least difference.
Consider the following code that can be used to identify the nominal Flow_Hot value.
Test_Matrix = pd.
csv’)Test_Matrix[‘Difference Flow Hot (gal/min)’] = abs(Test_Matrix[‘Hot-Side Flow Rate (gal/min)’] — Flow_Hot)Flow_Hot = Test_Plan.
idxmin(), ‘Hot-Side Flow Rate (gal/min)’]Three things happen in this code:1.
The first line reads the test matrix.
The second line creates a new column in the table that identifies the absolute value of the difference between the nominal conditions of each test and the estimated conditions of the current test.
The third line identifies the row with the least difference between the nominal conditions and measured conditions, and sets Flow_Hot equal to the hot-side flow rate specified for that test.
Keep in mind that this code returns the nominal conditions that most closely match what occurred during the test.
In situations with high measurement uncertainty or with small differences in nominal conditions between each test this code may identify the incorrect nominal conditions.
Therefore, this approach should be considered the last resort.
Splitting Data FilesAll the previously described approaches assume that each test has an independent data file.
That doesn’t work if the data is delivered with several tests included in a single large data file.
In that case, a script must split the different tests out of the data set and create individual data files.
This can be done by identifying a repeatable signal and creating new data files whenever that signal appears.
Two examples of repeatable signals in our heat exchanger example include:Each test features a flush period before beginning.
The flush period is used to ensure that the device starts each test at a repeatable condition.
It consists of 4.
15 gal/min at 70 deg F on both sides of the device.
The flow rate and/or temperature conditions can be used to identify this flush period.
Control valves are used to direct water flow through the different parts of the test apparatus.
One valve configuration indicates a flush period, while another indicates the test period.
The flow valve configuration is another option for identifying the flush period.
Note that the second option, using control valve status, is more reliable than using the temperature and/or flow conditions.
This is because some tests may use conditions similar to the flush period, and there’s a risk that the script could confuse the two.
Once the identifier is selected, code must be written to locate the instances where this occurs and create new test files accordingly.
This can be done by removing the data representing the flush period and identifying the instances where the index of the table changes.
The following code shows an example where the flush occurs when “Solenoid Valve 1 Status (-)” = 0, and the test occurs when “Solenoid Valve 1 Status (-)” = 1.
Data_File = pd.
csv’)Data_File[‘Index’] = Data_File.
indexData_File = Data_File.
drop(Data_File[Data_File[‘Solenoid Valve 1 Status (-)’] == 0].
index)Data_File = Data_File.
reset_index()Data_File[‘Index_Shift’] = Data_File[‘Index’].
shift(periods = -1)Data_File[‘Index_Shift’] = Data_File[‘Index_Shift’] — Data_File[‘Index’]End_Of_Tests = End_Of_Tests = End_Of_Tests + Data_File[Data_File[‘Index_Shift’] > 1].
index[-1])That code performs the following:1.
The first two lines read the data file, and create a column titled ‘Index’ which matches the index of the table.
The third line removes all rows of the table where ‘Solenoid Valve 1 Status (-)’ = 0, thus removing all portions of the code representing the flush period.
The fourth line resets the index of the table, such that it appears as if no rows have been removed.
Note that the column titled ‘Index’ is not changed in this step.
The fifth line creates a new column titled ‘Index_Shift’ which replicates the values from ‘Index’, but shifted up one row in the table.
It allows an easy way to identify how much ‘Index’ changes from one row to the next.
The sixth line changes ‘Index_Shift’ to equal ‘Index’ minus ‘Index_Shift’ in any given row.
During the steady state portion of a test, this value will equal 1.
In cases where the flush period was removed and the data now jumps from one test to the next, ‘Index_Shift’ will be greater than 1.
The last three lines create an array called ‘End_Of_Tests’ with entries representing the index of the row which corresponds to the last data readings in any given test.
It is filled with the index of any row where ‘Index_Shift’ is greater than 1.
It also accepts the final index in the data set, because the ‘Index_Shift’ method will not identify that row as a final reading in a test.
After this code runs, the ‘End_Of_Tests’ array provides a list of all the indexes corresponding to the end of each test.
The code then needs to split the data set into individual files, and save them as new .
csv files with descriptive names.
This can be accomplished using the following steps:1.
Create a for loop that cycles through the ‘End_Of_Tests’ array.
Use the techniques described in Reading the Test Conditions from the Data File to identify the nominal conditions of each test.
Create a file name for each data file that describes the conditions of the test.
The code to do so will look something like:File_Name = “FlowHot=” + str(Flow_Hot) + “_FlowCold=” + str(Flow_Cold) + “_THot=” + str(T_Hot) + “_TCold=” + str(T_Cold)This creates a set of individual data files that can be analyzed individually with file names containing the conditions of the test.
They are ready to be opened, and understood using the techniques descried in Writing the Conditions in the Name of the Data File.
Next StepsThe previous post has focused on analyzing independent results from independent tests.
So far we’ve discussed how to design tests that allow automation, how to identify the conditions of tests from data files, how to split files as needed, some useful Python packages that can be used to analyze the data, and how to automatically ensure that the tests and analysis are valid.
The next post will focus on how to logically store those data files to facilitate later analysis of the data sets, how to generate regressions from the stored results, and how to validate/document the validation of those regressions.