Using advanced and “smart” analytics to boost profitability in the cyclic chemical processrishabh jainBlockedUnblockFollowFollowingApr 18Objective: Increase the profits by optimizing the controllable parameters to boost the yield for a typical chemical industry involving cyclic process.

BackgroundI believe that analytics will have a significant impact on many areas of the chemical industry, primarily in gains in manufacturing performance.

Chemical industries have already invested in IT systems and infrastructure that generate and track enormous volumes of data with high velocity.

However, they lacked vision and capability to take advantage of this potential intelligence.

With cheaper and advanced analytics tools in the market, one can leverage machine-learning & visualization to optimize the plant parameters for increasing their profitability.

Advanced analytics can help them to understand what happens in a chemical process which might be unknown to many chemical engineers present out there.

This, in turn, helps them to overcome various bottlenecks, and break some stereotypes (traditional thinking) in process monitoring and functioning.

In this article, I will talk about the three main smart cum advanced analytics tricks incorporated which helped us to create a stable model which in turn employed by the industry to maximize their profits.

Change modeling for time series data- Sometimes the data exhibits an inherent trend (like continuously falling yield) which makes the model to learn these trends very difficult.

Hence, we observe very poor performance in the out of time test data.

To overcome this, we predicted the change in yield instead of the absolute values of yield.

These change in values are somewhat stable and thus relatively easy for the model to learn.

Intelligent feature engineering — Feature engineering is the most vital step of any predictive model.

This is where a true data scientist spends most of his/her energy.

With the help of industry expert and basic mathematics, we created a set of intelligent (but not intuitive) features which proved to be super important in predicting the yield.

Additionally, at the same time helped chemical engineers to understand the plant functions.

Special or non-traditional techniques which can come handy while solving time series based modeling- During building a predictive model, many-a-times we need to perform certain analysis which helps us to understand the data and interim process.

In this section, I will talk about some non-traditional methods or techniques which might be handy to understand the data and hence would lead in building a stable and robust model.

Let’s deep dive into the first part i.

e.

change modeling and know how it is better than traditional time series modeling.

Change ModelingFor the cases involving trends in the data, it becomes extremely difficult to train the model without any capable features capturing that trend.

Every day features with the trend will become values that model has never seen before, hence adding noise and inaccuracies in any ML model.

For solving such scenarios, generally, data scientists create a synthetic variable capturing the trend like the month of the year, time passed, or simply row number.

In scenarios involving decaying property or auto-regressive properties with an unknown rate, this strategy tends to fail.

To overcome this trend problem, we introduce change modeling.

We predict change happening in yield between given time intervals.

With change modeling, we take the derivative of both x and y features with respect to time to smooth out the trend.

This method enabled us to increase our models’ predictive power significantly.

△ P(time = t) = P(time = t) – P(time = t-1)P is the distribution which is dependent on time.

Mind that outlier treatment and missing value imputations should now be applied to these derivatives.

A graph showing a declining yield of a plant over the course of 2 years.

It is called as ‘non-stationary’ behavior i.

e.

changing mean & variance which is not ideal for the time series modeling.

When we take a derivative of such trends, the resultant would look more stable and appropriate for time series modeling.

Constant mean and variance over time and mitigate the noises due to any reasons in data collection.

Intelligent feature engineeringI will talk about three types of broad features other than traditional ones -Features whose definitions changed due to change modeling: When you take derivatives of both x and y, then there ought to be some crazy things going on with features.

Through studying bivariate relationships and consulting industry expert, many-a-times data scientist need to create some transformation of the features.

For instance, the logarithm of a feature, multiplying two of them, taking the square of one and many more.

Now, this transformation will change completely in the light of change modeling.

Here are the changes -y = log (x) || △y = △x / xy = x² || △y = x △xy = p X q || △y = p △q + q △py = 1 / x || △y = -△x / x²2.

Some other cool features which might be helpful from both business and data science point of view: Features like the lag of change variables, taking a second differentiation to capture the rate of change and also a lag of the absolute variables.

Feeding all of these variables sometimes surprise you with the model performance.

However, one should keep caution of data-leakage which is a very serious issue in time-series modeling.

Data leakage is when information from outside the training dataset is used to create the model.

This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

The tried and tested method of preventing data leakage is ask yourself —if any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model3.

Leveraging the advanced algorithm to fetch creative features especially for time series modeling: There are always new advancements in feature engineering automated coding modules which capture special features for time-series data.

One of such modules is “tsfresh” which is available in Python.

It automatically calculates a large number of time series characteristics.

Further, the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.

To understand the working and implementation of the package in details, please refer the following page.

One thing to note: While using such packages, it becomes extremely difficult for the business to explain the feature and its significance in the practical world.

Hence, we should not use such libraries when the predictions are not the sole purpose of the model building, but involves the drivers for that predictions.

pip install tsfreshfrom tsfresh.

examples.

robot_execution_failures import download_robot_execution_failures, load_robot_execution_failuresdownload_robot_execution_failures()timeseries, y = load_robot_execution_failures()Smart or non-traditional techniques which can come handy while solving time series based modelingHere, we will discuss some techniques which I learned and implemented in various time series problems.

These techniques are non-traditional and it is very difficult to find any useful content on the web.

However, they turned out to be super handy in answering complex questions involving chemical process, especially cyclic one.

Impulse response function: To understand the time taken by a variable to change, when another different variable got changed.

From this analysis, you can answer questions of the following types — a.

What time required to stabilize yield, if I make a small change in the temperature?b.

What is the time period when the system realizes the effect of the change in the chlorine level?The y-axis shows the change in selectivity given a change in chlorine level is initiated at t= 0 hours.

The x-axis shows the time from the change in chlorine levels.

Grey region is the confidence interval of changes recorded in selectivity at t = T hours after the change in chlorines.

This graph shows that the selectivity levels effect recorded after 8 hours of the change in chlorine and it finally stabilized after 12 hours.

In R, we have a library “VARS” which has function irf (Impulse reaction function) which does the same job as mentioned above.

Here is an illustration of the code (R, tidyverse) used —library(vars)# p represents the lag order.

This have 17 orders of all lag variables.

m.

var <- df.

reg %>% VAR(p = 17, type = "const")irf.

var.

cum.

ccf <- irf(m.

var, n.

ahead = 12, # No of Steps runs = 1000, #Runs for the bootstrapping impulse = "d_ccf", #Variable which changes response = "d_S44" #Variable wohse effect is recorded)# Generates the plotirf.

var.

cum.

ccf %>% plot()2.

Basis expansion: It is a technique which enables us to capture the recent impact of one variable by maintaining the longer-term impact of other variables in a linear model.

In a time series model, when you want a coefficient of one variable to capture more recent sample and other coefficients from the longer-term sample.

You can try other traditional techniques as well —Weighting the sample set so that more recent time periods have a higher weight (might result in a loss in overall predictive power)Creating features or trying to understand why this positive and negative behavior changes over time (very difficult and data dependent)Changing the length of train period (might not result in desirable results like coefficients tending to zero, etc.

)Sometimes, the effect of one X variable on Y changes with the duration of time.

For instance, for 6 months it has a positive effect on the Y variable and for the next 4 months, the effect is negative.

You should make sure that other variables must have stable relationships with the Y variable.

A simple linear model would look like —????(????)=????????(????)+????Where y(t) is some output (yield) that changes over time, tYou can have many other time-varying signals, but for simplicity let’s assume there’s just the one: X(t).

You are trying to learn the relationship between X(t) and Y(t) according to some linear model with some error, epsilon.

This relationship is described by the coefficient, beta.

Complication: Depending on the period of data you use, the value of beta changes, suggesting the relationship between X and y may actually change over time.

Resolution: In this case, we need to introduce a new function called an “Indicator function”.

An indicator function is simply a function that is “1” when some conditions are met and “0” otherwise.

Mathematical representations of the equations mentioned to the right of the imageLet’s call the set of months January to June set “A”.

We can use an indicator function to describe this as mentioned in right.

Similarly, we can create a similar function to describe the set of months July to December, calling these set “B”.

Now, using these indicator functions, we can account for a relationship (Beta) that changes over time by rewriting our equation.

To implement this in practice, you’ll engineer the features as described previously, creating 2 copies of X and zeroing out values for the appropriate times.

Then you can fit the model the same way you always do.

New complication: You don’t just have a single feature in your regression model.

Your regression equation is actually something more like this (albeit more complex):????=????_1*????_1(????) + ????_2*????_2(????) + ????_3*????(????_1(????), ????_2(????)) + ????It is simplified to only 2 independent variables, let’s say X_1 is the one we’ve been discussing so far.

X_2 is some other signal from the plant and f(X_1, X_2) is some feature you have engineered based on the interaction of X_1 and X_2.

If we implement the changes suggested above and split up the impact of X_1 into two features, you have this:????(????) = ????_????1*(????_1(????)∗????_????(????)) + ????_????1*(????_1(????)∗????_????.(????)) + ????_2*????_2(????) + ????_3*????(????_1(????),????_2(????)) + ????“If we believe the relationship between X_1 and y changes with time, do we also need to re-evaluate the relationship between X_1 and X_2 over time?”New resolution: The answer is that it’s up to you to figure out whether it’s necessary to change this.

There’s no quick answer, but you can certainly try some things:Only change the single problematic feature (X_1) — leave other engineered features the sameChange your single problematic feature (X_1) and a few of the higher-impact engineered features that incorporate X_1Change all features that rely on X_1All of these above approaches are completely plausible and may or may not improve the fit of your model.

Offcourse your interpretation will change depending on which approach you take.

If you only implement the single basis function described on X_1, you are effectively saying: “We believe that the impact of X_1 on yield changes over time — and we can demonstrate this.

However, other processes impacted by X_1, e.

g.

f(X_1, X_2), have a constant relationship over time and their relationship with yield does not change over time.

”You will need to use your knowledge of the data and the processes to determine whether such a conclusion is sensible.

3.

Quantile regression: This technique is not commonly used but it has its own advantages over traditional linear regression.

One advantage of quantile regression, relative to the ordinary least squares regression, is that the quantile regression estimates are more robust against outliers in the response measurements.

Quantile regression has been proposed by various statistician and used as a way to discover more useful predictive relationships between variables in cases where there is no relationship or only a weak relationship between the means of such variables.

Quantile regression implementation in Python -# Quantile regression packageimport statsmodels.

formula.

api as smf# Building the formulaformula = 'Conversion_Change ~ ' + ' + '.

join(X_columns)# Training the modelmod = smf.

quantreg(formula, df_train)res = mod.

fit(q=.

5)# Evaluation metrics on test datasetr2_test = r2_score(y_test, res.

predict(df_test))References[1]https://en.

wikipedia.

org/wiki/Quantile_regression#Advantages_and_applications[2] https://web.

stanford.

edu/~hastie/Papers/ESLII.

pdf[3] https://en.

wikipedia.

org/wiki/Indicator_function[4] https://machinelearningmastery.

com/data-leakage-machine-learning/[5] https://tsfresh.

readthedocs.

io/en/latest/.