Predicting Lung Cancer Mutations with Machine Learning

Predicting Lung Cancer Mutations with Machine LearningJerry WeiBlockedUnblockFollowFollowingJun 27I read a recent Nature Medicine article in which the authors used machine learning to predict lung cancer gene mutations with deep learning (link).

How did they do it?Photo by Ousa Chea on UnsplashLung Cancer.

There are two key subtypes of lung cancer: adenocarcinoma and squamous cell carcinoma.

Being able to distinguish between these subtypes is extremely important because each subtype has its own treatment options — targeted therapies differ for adenocarcinoma and squamous cell carcinoma.

In particular, adenocarcinoma requires analysis of gene mutations; the primary mutations that are targeted include the epidermal growth factor receptor (EGFR), anapestic lymphoma receptor tyrosine kinase (ALK), tumor protein 53 (TP53), and KRAS mutations.

The future of healthcare is being shaped by a Big Tech invasion – Data Driven InvestorLast decade has seen a massive digital disruption across all sectors of the global economy and the Health sector is now…www.


comIt is vital to identify these mutations because there are tailored treatments for each mutation.

For example, EGFR and ALK mutations already have FDA-approved targeted therapies available.

The current method of analyzing lung cancer tissue samples (manual visual inspection of tissue samples) is both exhaustive and, at times, inaccurate.

Furthermore, it can also be difficult to distinguish between adenocarcinoma and squamous cell carcinoma.

An automated machine learning model that can accurately analyze lung cancer tissues would thus be extremely beneficial.

Number of whole-slide images for each class, where LUSC represents squamous cell carcinoma and LUAD represents adenocarcinoma.

Image credits to Coudray et al.

, the original authors of the paper.

Lung Cancer Image Dataset.

The authors used data from the NCI Genomic Data Commons; they retrieved about 1,700 whole-slide-images, where 609 of them were positive for squamous cell carcinoma, 567 were positive for adenocarcinoma, and 459 were normal.

They used a sliding-window algorithm to generate about one million 512×512 pixel windows from those whole-slide-images.

Essentially, they slid an imaginary “window” over the entire tissue sample (which can be up to 100,000 pixels by 100,000 pixels) and used each of those windows as an individual sample.

They then split the resulting 1 million windows such that 70% was used as a training set, 15% was used for validation, and 15% was used as a test set.

The data processing strategy used in the paper.

Image credits to Coudray et al.

, the original authors of the paper.

Machine Learning with Inception v3.

The authors based their model on Inception v3 architecture36, which uses inception modules made of convolutions of different kernel sizes and a max pooling layer.

What’s this convolution you speak of?.I’m basically talking about convolutional neural networks (CNNs); these neural networks are particularly good at image processing, which happens to be what the paper is trying to do!Transfer Learning.

The paper also used transfer learning for the classification between adenocarcinoma and squamous cell carcinoma.

But what is transfer learning?.Transfer learning is basically a way to use someone else’s model.

Neural networks have weights between layers, and these weights facilitate the actual functioning of the model.

So if you can get those exact weights, you are essentially copy-pasting a model.

And that’s what transfer learning is — using weights that someone else trained and just fine-tuning it for your own purpose.


Steal Yo Model.

In this case, the authors used the weights that did the best on the ImageNet competition and fine-tuned it on the lung cancer data.

Of course, there are some other hyper-parameters that they used for their model — loss function (cross entropy), learning rate (0.

1), weight decay (0.

9), momentum (0.

9), and optimizer (RMSProp).

Heatmap showing what the model is looking at.

Image credits to Coudray et al.

, the original authors of the paper.


Because they had two different tasks (predicting adenocarcinoma vs.

squamous cell carcinoma and predicting gene mutations for the slides with adenocarcinoma), they trained multiple variations of their model.

For the first task, they trained their model to predict normal tissue vs adenocarcinoma vs squamous cell carcinoma.

For the second task, they trained their model to predict each gene mutation binarily instead of as a multiclass classifier.

This means that their implementation allowed for each 512×512 patch of lung cancer tissue to be positive for more than one gene mutation.

For both tasks, they trained the model for 500,000 iterations.


A few methods were used to validate their model’s effectiveness.

First, they compared their model to pathologists.

On an independent test set, 50% of the slides that were misclassified by their model were also misclassified by at least one pathologist, and 83% of the slides that were misclassified by at least one pathologist were correctly classified by the model.

This was seen as evidence that the model’s performance was on par with pathologists.

The authors also calculated the accuracy of their model for each gene mutation, finding that the model did much better than guessing for all of the mutations.

Area under Receiver Operating Characteristic scores for each mutation achieved by the model.

Image credits to Coudray et al.

, the original authors of the paper.

What does this mean?.The authors created a machine learning model that was able to both classify lung cancer gene mutations with reasonable accuracy and identify the difference between two subtypes of lung cancer.

This shows how powerful machine learning is and how it has a huge variety of applications.

The model is primarily useful for assisting pathologists in diagnosis so that the diagnosis process remains semi-manual.

Well, what else could be done with this model?.In the future, the authors would apply the model to try to classify less common lung cancers, including large-cell carcinoma and small-cell lung cancer.

The introduction of their model could also potentially lead to fully automated analysis of lung cancer tissues with high accuracy, which would both reduce the time for analysis and the potential for human errors.

Perhaps in the future, we’ll be able to have a computer diagnose diseases for us through machine learning…I’ll list some additional resources below that may be interesting:Original paperGitHub repository for the paperSome more information about lung cancer.

. More details

Leave a Reply