Early Detection of Sepsis Using Physiological Datakaran sindwaniBlockedUnblockFollowFollowingJul 5What is Sepsis ?Sepsis is a potentially life-threatening condition caused by the body’s response to an infection.
In a usual case, the body releases chemicals into bloodstream to neutralise an infection.
Sepsis occurs when the body’s response to these chemicals is out of balance, triggering changes that can damage multiple organ systems.
Sepsis is caused by infection and can happen to anyone.
Sepsis is most common and most dangerous in:Older adultsPregnant womenChildren younger than 1People who have chronic conditions, such as diabetes, kidney or lung disease, or cancerPeople who have weakened immune systemsStatisticsIn USA, 270,000 people die from sepsis each yearInternationally , 6 Million people die from sepsis each yearUS hospitals spend 24 Billion each year on sepsis (13 % of Health Budget)Each hour of delay in treatment can roughly increase mortality by 4–8 %Source : https://www.
org/diseases-conditions/sepsis/symptoms-causes/syc-20351214ObjectiveThe goal of this blog is the early detection of sepsis using physiological data.
The early prediction of sepsis is potentially life-saving, and we aim to predict sepsis 6 hours before the clinical prediction of sepsis.
Conversely, the late prediction of sepsis is potentially life-threatening, and predicting sepsis in non-sepsis patients (or predicting sepsis very early in sepsis patients) consumes limited hospital resources.
Challenge DataThe Challenge data repository contains one file per patient (e.
Each training data file provides a table with measurements over time.
Each column of the table provides a sequence of measurements over time (e.
, heart rate over several hours), where the header of the column describes the measurement.
Each row of the table provides a collection of measurements at the same time (e.
, heart rate and oxygen level at the same time).
Features:Vital Signs : Heart Rate, Temperature , Blood Pressure, Respiratory rate,Laboratory Values : Platelet Count, Glucose , Calcium etcDemographics : Age, Gender, Time in ICU , Hospital Admit timeLabel :0 (Non-sepsis) and 1 (Sepsis)Note :Feature description and data can be downloaded from https://physionet.
org/challenge/2019/Plan Of ActionThe data for the problem is an hourly time sequence record for each patient.
But the records do not have a time-label associated with them, so that opens the scope of interpreting it as a non-temporal problem (ignoring the time component)There are two ways in which one can approach this problem:Temporal Approach : Take into the account the time component for the data.
Sepsis is diagnosed for each patient at each hour using the past data.
Non-temporal Approach : Ignore the time component and treat record as independently and identically distributed.
This approach would help in predicting Sepsis at each hour for any patient(with or without patient past data)For this blog I am going to talk about only the Non-temporal approach .
Non-Temporal ApproachIn this approach we ignore the time component associated with each patient hourly record and treat them as independently and identically distributed.
Train-Validation-Test -SplitThe data repository has data from two hospitals and a total of 40 thousand patients.
The actual number of records would be higher as a patient could have stayed in the hospital for a variable amount of time.
Splitting these records to train , validation and test.
While splitting I have made sure that each patient is fully contained in exactly one of the splits.
Train : 30K PatientsTest : 5K PatientsValidation : 5K PatientsNote : The script to divide the data to train -test-validation split can be found here https://github.
com/kskaran94/Sepsis_IdentificationExploratory Data AnalysisAfter performing descriptive data analysis on the train data, these were the concerns that highlightedConcernsExtremely Imbalance data : As we can see from the bar plot, the records are extremely imbalanced (Less than 1 % vs 99 %+) with the minority class being Sepsis (1).
Label DistributionMissing Data : High Percentage of missing data in most of the featuresNote : Detailed EDA and baseline can be found on https://github.
com/kskaran94/Sepsis_IdentificationHandling Class ImbalanceThere are various pre-defined ways of handling class imbalance in machine learning, which have proven to be successful in many scenarios.
However most of them can not be applied in this problem.
As we have an extreme rare imbalance , undersampling the majority class would lead to 99% data loss.
Oversampling or SMOTE(Synthetic Minority Over-sampling Technique) can applied to the minority class, but it would not be a conceptually correct idea as we are dealing with real world health care records and we would want to preserve the original distribution of data .
For Sepsis Identification non-temporal approach, we are going ahead with the original distribution of data and choosing an appropriate evaluation metric for modeling the data.
Handling Missing DataThere are various pre-defined ways of imputing continuous data such as Median , Mean etc, which have proven to be successful in many scenarios.
However continuous data imputation can not be applied in this problem.
It would not be a conceptually correct idea as we are dealing with real world health care records and we would want to preserve the original distribution of data .
For Sepsis Identification non-temporal approach, instead of imputing continuous features.
We engineer categorical features out of existing continuous features and impute the missing with a new category.
Feature SelectionThe data has 40 features which can broadly be classified intoDemographicsVital SignsLaboratory valuesAfter doing research on Sepsis from credible sources like www.
gov, Symptoms of Sepsis are High Fever, Abnormal Blood pressure, High respiratory rate.
These symptoms give us a direction that features like Heart Rate, Temperature and Blood pressure may be important while predicting sepsis.
Also sepsis is mostly prevalent either in infant or Old patients.
This makes age an important feature.
As pointed in the previous section, we cannot handle missing data in the usual way.
And the features in the category of Laboratory values have 90% or more missing data.
Imputing these features even after engineering them as categorical would lead to features with low variance.
Hence they may not add much information to the modelTherefore the features with more than 80% of missing data are ignored.
Feature EngineeringThe Feature engineering for the selected features has been described belowHeart Rate : Converted to Categorical (Normal, Abnormal, Missing)O2Stat : Converted to Categorical (Normal, Abnormal, Missing)Temperature : Converted to Categorical (Normal, Abnormal, Missing)SBP/DBP : Converted to Categorical (Normal, Abnormal, Missing)Respiratory_Rate : Converted to Categorical (Normal, Abnormal, Missing)Age : Converted to Categorical (Old,Infant, Child/Adult)Gender : UnchangedHospAdmTime(Hospital Admission Time) : UnchangedICULOS(ICU Length of Stay): UnchangedNote : Code for Feature Selection and Engineering can be found on https://github.
com/kskaran94/Sepsis_IdentificationModel Training and EvaluationEvaluation MetricChoosing a suitable/useful evaluation metric is important than we think.
In this case choosing a metric like accuracy would not be useful, as a model which predicts majority class would have high accuracy.
For imbalance data problems one can choose either Average_precision or F1_weighted.
Since they give a complete representation of the confusion matrix.
We will be going ahead with average precision as the primary metric but still look at other metric like precision and recall.
Machine Learning ModelsConverting the Categorical features to OnehotEncoding and scaling the Continuous features.
Performance of different models can be seen in the table belowDeep Learning ModelsAuto-encoders have proven to be useful for anomaly detection use-cases which involves high class imbalance.
This blog post offers a comprehensive approach for using auto-encoders for extremely rare classificationBlog link : https://towardsdatascience.
com/extreme-rare-event-classification-using-autoencoders-in-keras-a565b386f098Auto-encoders were expected to perform better than traditional Machine Learning models as they are modeling the behavior of positive class and treating the negative class as an anomaly.
Auto-encoders increased the average precision to 7 percent, which is a good number for the health care domain.
Considering the case we are not compromising neither false positives nor false negatives.
Plotting a precision-recall curve shows that over any threshold , both precision and recall don’t take higher values.
This explains our relatively low average precision value.
Precision-recall curveWe can confirm our thinking by plotting the reconstruction error for a single threshold.
As we can see the red line (threshold) cannot perfectly divide the data.
Note : Code for Model Training and Evaluation can be found on https://github.
com/kskaran94/Sepsis_IdentificationConclusionAfter looking at both Machine and Deep Learning models we can conclude that we need to add more features or data , for a better model performance.
Or even switch to temporal approach for better performance.
Connect with me on Linked-In : https://www.
com/in/karansindwani/.. More details