Predicting environmental carcinogens with logistic regression, knn, gradient boosting and molecular fingerprintingBalancing imbalanced data, exploring accuracy metrics, and an introduction to cheminformaticsGurkamal DeolBlockedUnblockFollowFollowingJul 10Introduction and backgroundWhat is Tox 21?The Toxicology in the 21st Century program, or Tox21, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health.
Substances assayed in Tox21 include a diverse range of products such as: commercial chemicals, pesticides, food additives/contaminants, and medical compounds.
Why the p53 protein?The p53 gene encodes a protein of the same name and is known as a tumor-suppressor protein.
The p53 protein is expressed in cells when they undergo DNA damage — which can transform a normal cell into a cancerous one.
To counteract the effects, p53 can cause growth arrest, repair DNA, or begin the process of cell death.
Therefore, when DNA damage occurs, there is a significant increase in p53 expression.
This increase in protein expression is a good indicator of irregular cell health.
The Tox21 data was generated by testing cell lines which produce a florescent reporter gene product under the control of p53 cellular machinery.
By measuring levels of the reporter gene product against various compounds, researchers were able to determine whether a compound was an agonist (activator) of the p53 pathway or not.
Predicting agonists using molecular fingerprintingFingerprinting is a way to represent molecular structure and properties as binary bit strings (0’s and 1's).
This representation was initially developed and applied to searching databases for molecules with a specific substructure — but it can also be applied to machine learning.
A hash function is a random number generator and is applied to each feature of a molecule, such as the types of bonds and molecules present, which means that they act as seeds to the function.
The following image illustrates how a two molecules can have their similarity assessed (using The Tanimoto index) by generating molecular fingerprints.
Each molecule first has the hash function applied to it and then a fingerprint is generated based on features.
The fingerprint generator in the image below looks at a certain bond distance radius and the features within that distance.
Types of fingerprintsThis project will be using four different types of fingerprints and comparing their predictive strength.
Morgan circular fingerprint — This fingerprint is part of the Extended-Connectivity Fingerprints (ECFPs) family and are generated using the Morgan algorithm [2,3].
These fingerprints represent molecular structures and the presence of substructures by means of circular atom neighborhoods (bond radius).
Another important feature is their ability to determine the absence or presence of molecular functionality, which can further help discriminate when classifying molecules.
Daylight-like fingerprint — This fingerprint generator (using RDKit) produces a fingerprint similar to the fingerprint generated using the Daylight fingerprinting algorithm.
An oversimplified explanation: the algorithm hashes bonds along a path within a molecule (topological pathways).
The image below shows how bond paths are identified and described.
Atom-pair fingerprint — The atom-pair fingerprint is constructed using — as the name suggests — pairs of atoms (as features) as well as their topological distances.
The atom-pair fingerprint is generated as shown below by first identifying heavy atoms and the shortest distance between them.
Features of the individual atoms within a pair (like the number of bonds) are encoded .
These encoded features are then converted into bit strings and represented as one number.
This concatenated string of integers is then passed to the hash function which then assigns it as a 1 or 0.
Topological torsion fingerprints — are 2D structural fingerprints that are generated by identifying bond paths of four non-hydrogen atoms.
These four-path fragments then have their features, such as the number of π electrons of each atom, atomic type and number of non-hydrogen branches calculated.
Each atom is characterized by 8 bits and the 4 atoms have their bits stored in a 32-bit array.
Preparing the dataThe Tox21 data is already labeled with active (1) and inactive (0) states so there really isn’t much to do in this step other than load, format column names, and generate mol files (these files are generally classified as data files that contain molecular data information, atom, bonds, coordinates and connectivity information) for each molecule.
To follow along, edit the path to where the data is on your machine.
We load an sdf file — which is a structure-data file and contains associated data for compounds.
We actually don’t need these details as they wont add to classifying p53 agonists, hence we drop them.
For example the “formula column” gives us no usable information at all since its the empirical formula and doesn’t contain the valuable structural information a SMILES string provides.
We do need mol representations of the molecules so we generate those.
We should also ensure that there aren’t any duplicates in the data.
This data set in particular has a total number of 8629 records and only 6826 unique records, meaning that ~20% of the data is duplicated and needs to be dropped.
Fingerprint generationHere is an example of fingerprint generation using the Morgan algorithm.
A fingerprint is generated for each compound in the “mol” column with a radius of 2 and a bit length of 2048.
A list with the same parameters is also generated for each mol.
A for loop then iterates through the list to create a numpy array and append a new list denoted by “np” in the name.
These lists will become our x variables when running the classifiers.
Setting the x and y variables:Sampling: Are the classes balanced?When the classification labels (in our case: active = 1; inactive = 0) are skewed in proportion, bias is introduced into the model.
For example, if the data set contains 95% active labels then the model will likely classify an inactive molecule as an active one, simply put, the accuracy will reflect the class class distribution.
Determining if class imbalance is straight forward — count the distribution of labels.
Once we determine how prevalent the minority class is in our data set, we can move forward with balancing it out.
The count plot above shows us that there is a disproportionate distribution of classes in our data, a ratio of 15:1 (inactive to active) meaning approximately ~93.
84% of our classes being inactive.
Methods of class balancingThere are two ways to balance the data: over-sampling and under-sampling.
Over sampling brings balance by generating synthetic data points using using one of many algorithms.
The one we will use today is a variation of the SMOTE (Synthetic Minority Over-Sampling Technique) algorithm.
SMOTE works by searching for the k-nearest neighbors for a given point of data and then generates data points along the line between neighbors.
Since these points fall on a line between the real data points, there is some correlation, which is why we will use the ADASYN algorithm (Adaptive Synthetic).
ADASYN works essentially the same way SMOTE does, however, it adds small randomly generated values to the points to increase variance and simulate more realistic data.
Note — I opted not to use under-sampling since it balances the classes by removing data points from the majority class and therefore there is a loss of data, possibly affecting the classifier’s ability to discriminate.
Applying the ADASYN algorithm:Here we apply the ADASYN algorithm to each of the x variables which will ultimately result in different minority class counts for each fingerprint but the range is tight and variance in count is minimal so it shouldn’t pose a problem.
We can plot one of the re-sampled distributions to visually represent class balance:Creating training, testing, and validation test setsIn order to train, test, and then test again we need to create three separate data partitions.
The training data will be used to train our model — essentially the model will use this data to learn which fingerprints (the 2048 bit patterns) likely belong to p53 agonists.
The test set will be used to evaluate the accuracy of our model, and lastly a validation set will be used as unbiased data set to test the predictive ability of our model.
Our data will be split into 85/15 train/test sets and then the training data will be further split into an 85/15 train/validation set.
Logistic RegressionThe logistic regression algorithm classifies a categorical response (outcome) variable between 1 and 0 based on its relationship with predictive features.
In contrast, linear regression outputs response variables that are continuous and can be any real number.
Most importantly, linear regression does not output probabilities and instead fits the best hyperplane.
Therefore logistic regression is the natural choice for a binary classification problem such as ours.
Cross validation and model fittingCross validation is a statistical technique to reduce the risk of over-fitting the data.
Over-fitting occurs when the classifier memorizes the data and fits noise and error instead of the underlying relationship between variables.
This can occur if the model is overly complex and if there is limited data to train on.
Our model doesn’t seem to suffer from these two limitations but its good practice to perform cross validation anyway.
The top image illustrate how a model may “memorize” and essentially fit the data too well.
The bottom image shows how cross validation creates 10 folds, each of which is split into a test and train set to fit a model, and then averages the error from each fold.
Over-fitting (source) | cross validation (source)The following lines of code show how to perform logistic regression, obtain cross validations scores for each fold (just one iteration on the training data to estimate scores using the chosen classifier).
We then model logistic regression over 1000 iterations while performing k-fold (10) cross validation.
For the rest of the fingerprints we can simply write a function to model the data:Model Metrics and predictionTest set predictions and confusion matrixA confusion matrix is a table which describes the performance of the trained model on test data.
There are four quadrants comprising the matrix:true positives (TP): These are compounds that were predicted positive (active p53 agonist) and actually are positive.
true negatives (TN): These compounds were predicted to be inactive and actually are inactive.
false positives (FP): Cases where the compounds are predicted to be active but are actually inactive (Type I error).
false negatives (FN): Cases where the compounds are predicted to be inactive but are actually active (Type II error)We can plot the confusion matrix for both test and validation set predictions along with accuracy — which is determined through (TP +TN)/Total cases.
The confusion matrix on the left is that of the test set predictions and the right is for the validation set.
K-nearest neighborThe k-nearest neighbor algorithm (knn) takes a data point and looks at the k closest data points (k =7 would mean the 7 nearest), this data point is then labeled 0 or 1 according to the majority label of the k closest points.
For example, if given a point of data with k=7 and the closest points happen to be five 0’s and two 1’s then the point would be classified as 0 (inactive).
Knn performs best when working with fewer features, in turn resulting in lower chances of over-fitting.
Normally we would have to perform PCA or another form of dimension reduction, however since we’re using a single predictive feature we don’t need to.
knnGrid searching and model fittingPerform a grid search with cross validation to determine the optimal number of neighbors for our data.
We will test 9 candidate numbers (3, 5, 7, 9, 11, 13, 15, 17, 19) and perform a 5 fold cross validation.
This however may take a while since 9 neighbors x 5-fold validation for each amounts to 45 fits.
The k-fold can be reduced so that we can keep a decent range of neighbors to try.
Note: set “n_jobs” to -1 — this allows for parallel processing with max CPU usageThe grid search suggests that the best number of neighbors to use is 3.
To test out a range of k neighbors numbers we can fit different fingerprint data with varying k values.
Using the Atom-pair training data I tested k = 3, 5, 7, 9 which resulted in following accuracy scores in order:0.
785It seems that 3 and 5 had the same scores, however 3 was likely chosen due to less computation time.
Now to finish off, train the models with k=3 and compute their accuracy scores on first the test data and then validation dataGradient BoostingSo far we have fit a single model to the training data for each classier.
Gradient boosting combines many individual models to create a single accurate one — this is known as an ensemble.
Boosting is the method used to create an ensemble.
It works by fitting the data with an initial model, then a subsequent model is built which works on accurately predicting the labels that the initial model classified incorrectly.
Essentially, every subsequent model works to minimize the prediction error of the previous model which leads to an overall decrease in prediction error.
Therefore predictions made by the final model are the result of the weighted sum of predictions calculated by all previous models.
By Alex Rogozhnikov (source)Grid searching, model training, and metricsPerform a grid search to tune parameters as we’ve done before:Check out the accuracy on the test and validation dataAUC values and F1-scoresAUCThe Area Under Curve (AUC) value is another measure to evaluate the performance of a model on a binary classification task and is one of the most widely used metrics for evaluation.
Two properties of a given model are needed to calculate the AUC — sensitivity and specificity.
Sensitivity is the proportion of correctly classified positive points of data compared to all data points classified as positive.
Sensitivity is the proportion of data points that were correctly classified as negative compared to all data points classified as negative.
Sensitivity of a model: True positive / (False Negative + True positive).
Specificity of a model : True negative / (False positive + True negative)When the true positive rate is plotted against the false positive rate, the AUC of this curve can be calculated and essentially scores the ability of the model to distinguish between the two classes.
A higher AUC means that molecules classified as p53 agonists are indeed more likely to be agonists.
F1-scoreThe f1-score of a model is calculated using recall (calculated like sensitivity) and precision, which is the rate of:true positive / (false positive + true positive)The F1-score can then be found using:A F1 Score finds a balance between a models recall (fewer predicted points turning out to be false negatives) and precision (fewer predicted points turning out to be false positives).
Results and conclusionThe table below shows the predictive power of each classifier when using the four fingerprints.
Metrics included are accuracy scores are shown for training, test, and validation data sets, AUC scores for validation data, and the f1 scores of the validation data.
The highest scores are highlighted in green for each fingerprint and classifier.
Which is the best classifier?When looking at validation accuracy, logistic regression has the consistently highest scores.
Next, when evaluating AUC — arguably the most important metric — gradient boosting produced the highest score but logistic regression had the highest average scores.
Lastly, the highest individual and average f1-scores belong to logistic regression, suggesting that it produces models with the greatest balance between precision and robustness.
winner: Logistic regressionWhich is the best fingerprint?Literature suggests that although topological torsion may not be as popular as the Daylight-like and Morgan fingerprints, they show higher performance in comparison to other fingerprints .
The results show that the Daylight-like fingerprint had the highest AUC score for both logistic regression and knn — the Atom-pair fingerprint scored only marginally higher when using gradient boosting.
However when looking at gradient boosting, Atom-pair fingerprints outscored the Daylight-like fingerprints in validation accuracy, AUC, and F1-scores.
winner: Daylight-like fingerprintPossible issuesSometimes accidental collisions between patterns can occur in fingerprints due to similar molecular substructures.
This happens when the same bit is set by multiple patterns (hashing a specific feature of two similar structures as the same bit).
For example, the topological torsion fingerprint produced eight identical fingerprints, this can be avoided by increasing the bit length (from 2048 to 4096).
References Lo, Y.
, Rensi, S.
, Torng, W.
, & Altman, R.
Machine learning in chemoinformatics and drug discovery.
Drug Discovery Today, 23(8), 1538–1546.
010 Morgan, H.
The Generation of a Unique Machine Description for Chemical Structures — A Technique Developed at Chemical Abstracts Service.
1965, 5: 107–112.
 Rogers, D.
; Hahn, M.
2010, 50(5): 742–754.
html Jelínek, J.
, Škoda, P.
, & Hoksza, D.
Utilizing knowledge base of amino acids structural neighborhoods to predict protein-protein interaction sites.
BMC Bioinformatics, 18(S15).
1186/s12859-017-1921-4 Skoda, P.
, & Hoksza, D.
Exploration of topological torsion fingerprints.
2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
7359792 Lee, H.
, Kim, J.
, & Kim, S.
Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions.
INTERNATIONAL JOURNAL of FUZZY LOGIC and INTELLIGENT SYSTEMS, 17(4), 229–234.
229.. More details