Machine learning applied to geophysical well log dataIn recent years Machine Learning(ML) has become very popular and a wide range of industries are applying it to their dataset. This article is about applying ML to well-log data and to appreciate the ways ML can help in learning the lithology in a well.An offshore drilling rigWhat are well-logs?Well-logs are single point measurements of physical properties of the subsurface recorded in a well and vary vertically with depth. The properties are random and aperiodic and depend on factors like mineral composition or lithology, porosity, cementation and compaction, presence of fluids etc., A suite of conventional well-logs recorded from an offshore area is shown. Each log shows how the properties vary with depth. Here, we are looking at sediments that were deposited 5 to 16 million years ago!! Isn’t that impressive!!Well-logs propertiesOkay, now some logs like Gamma ray and Poisson’s ratio are excellent lithology indicators, while others like Density and P-wave velocity logs are useful for understanding the rock type, types of pore fluids as well as the pressure and compaction trends in the deposited sediments. Density porosity, a physical property that is derived from density logs gives a measure of the amount of pore space in the same rock type or different rock-type.The sediments are generally transported by rivers and deposited in the basin and so the lithology depends on what types of rocks got weathered and what area the rivers carried the sediments through. The well-log shown here is from a fluvial deltaic system that is dominantly sands and shales. The common rock types encountered are:Sandstones (Reservoir) and Shalestones (Non-reservoir)# Sandstone/ Sand: rock made up of quartz grains feldspars, calcite, heavy minerals and other rock fragments.# Shalestone/Shale: rock made up of clay minerals some of which are radioactive.# Siltstone/Silt, a rock composed of both sand and shale can be present too.Special Note: This study would not have been possible without Ramya Ravindranathan. The dataset and interpretation were provided by her.Fluvial- Deltaic systemThe log responses with respect to bed boundaries are different for different lithologies. Lower GR, Poisson’s ratio, density, velocity values and higher porosity values are characteristics of clean sands. Shales tend to have higher GR values compared to sand. The porosity is extremely low compared to sands, while density, Poisson’s ratio and velocity values are higher. Silt has properties in between sand and shales.Suite of well logs extending from 2800 ft to 10250 ftThe individual sand beds encountered in this well ranges from 10–150 ft (3.04–45.72m) separated by thick or thin shale beds. The reservoir quality of the sand is measure by clay percent. The best reservoirs have lowest clay content. It would be great if lithology can be classified based on clay content. It would be ideal to have many wells drilled to get a clear picture of the subsurface, but that’s a very costly affair because drilling a well costs many billions of dollars.Reliably predicting lithology becomes one of the key problems in reservoir characterization. In practice, combination of physical models, local geological knowledge, and experience to reduce large seismic and well log datasets into low-dimensional models of the Earth are used*. Unfortunately, these simplified physical and geological assumptions do not always hold true in practice, making the inferred model highly uncertain and biased.This problem can be reformulated using general machine learning models. Hence it would be a good idea to predict the lithologies of the huge well dataset in a basin with some algorithms.Unsupervised Learning with well datasetIdentifying lithologies with well dataset are a classic example of unsupervised problem with unlabeled data. Unsupervised machine learning is used on the well logs to get clusters that can be correlated to lithology of the well. The unsupervised methods are particularly useful when the inferred structure is lower dimensional than the original data. Coincidentally, interpreted images and geological maps produced by geoscience workflows are substantially lower-dimension than the original field data.This post explains the following factors.· Can unsupervised learning categorize the data based on lithology in the region?· Do the clusters match up with the depth at which the well-logs vary in its characteristics?The two unsupervised learning tasks explored here are (a) clustering the data into groups by similarity. K-means clustering used in this study falls under this category. and (b) reducing dimensionality to compress the data while maintaining its structure and usefulness which includes PCA and t-SNE.The dataset consists of various well logs that have wide range of units. Hence, the dataset needs to be normalized before using the clustering algorithm. In this problem, there are > 7000 discrete depth points.k-mean clustering is used to create groups of data points such that points in different clusters are dissimilar while points within a cluster are similar. With k-means clustering, the well log data points are grouped into k groups that defines different lithologies. A larger k creates smaller groups with more granularity, while a lower k means larger groups and less granularity. The “elbow” method computes an average score for a range of cluster numbers. The line chart resembles an arm, and the “elbow” (the point of inflection on the curve) is a good indication of the optimal number of clusters.In the present analysis of the elbow plot, the inflection point can be at 6 or 7. Elbow graphs, many a times, do not indicate a single inflection point hence, further verification is essential.t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.The t-SNE algorithm clusters the raw data into 7 classes. The axis are numbers that are generated when the high dimension data is mapped onto two dimensions. The figure shows that t-SNE generated distinct and well separated clusters. The color scheme is according to the labels generated by k-mean algorithm. It seems that few points from k-means clusters don’t agree with t-SNE clusters (few purple points are classified with green points).t-SNE clusters mapped onto 2-D space. Color scheme according to labelled clusters of k-means.A final verification for optimal number of clusters is performed with PCA. Principal component analysis (PCA) is an unsupervised method that reduces the dimensionality hence the complexity while maintaining structure (variance) of a dataset. It performs a rotation of the data that maximizes the variance in the new axes. Projects high dimensional data into a low dimensional sub-space (visualized in 2–3 dimensions.).The PCA clustering clusters the dataset into 7 classes and the boundaries between the classes are distinct. The centroids are distinctly visible.PCA clusters mapped onto 2-D spaceExtracting labeled clustersBased on the three clustering algorithms, the optimal number of clusters in the well log is seven. The k-means method is run again with seven clusters and the cluster labels are generated. The algorithm assigned the corresponding cluster labels to all the depth points within the dataset.The histogram plot shows 7 clusters plotted according to the depth points. The color scheme and labels of the histogram plot is now used to plot the well-logs along with the lithologies (generated from clusters). The interpretation table also have the same color scheme and labels.Histogram showing the distribution of points in cluster generated by k-means. The color scheme and labels of the match with the tableOne of the challenegs of the study of k-means algorithm is that if the intial state is not the same then cluster size changes. About 10–15 points differ which is about 0.2% of the total number of points for the logs.InterpretationTwo depth ranges 7000 to 8500 ft and 8500 to 10250 ft are selected to interpret and compare if the clustering of data matches the interpreted lithologiesfrom the well-log data. The table summarizes cluster labels correlated with well-log charecterstics and lithologies.The major end member lithologies found were sands and shales with various gradational lithologies like sandy shales and shaly sands. Clean sands with good porosity is what we are categorizing as a good reservoir Cluster 1 and 6) in this study area and generally such sands having hydrocarbons (oil or gas) are very few in a vertical column and they are in good correlation with the clustering method (Cluster 6) that shows that such points are very few in number. Shales are present as very thick columns and exactly the cluster(Cluster 3) shows maximum number of points. It is interesting to see how the clustering methods can distinguish between gradational lithologies like shaly sands (Cluster 5)and sandy shales (Cluster 0). This study seems to be very useful in distinguishing the potential reservoirs from non-reservoirs.Well log data from 7000–8500 ft. Clusters are interpreted from k-means unsupervised learningWell log data from 8500–10250 ft. Clusters are interpreted from k-means unsupervised learningWas integration of ML algorithm to lithologies analysis useful?In summary, it is proved that ML can open vast arena in interpretating the huge well log data. Geophysical data do fall within the regime of machine learning models. Unsupervised learning can be applied to extract useful information directly from the data. The unsupervised learning method did classify the dataset into useful clusters. These clusters, when matched with depth did generate useful lithologies and did provide useful rock characterization.In future studies, geological interpretations can serve as labels to train a classifier and make predictions on similar datasets. If adequate dataset was available, then more scientific benchmarking on geophysical problems would have been possible *.I welcome feedback and constructive criticism. I can be reached through LinkedIn. The code to this study is found here._______________________________________________________________*Bougher, B. B. (2016). Machine learning applications to geophysical data analysis (Doctoral dissertation, University of British Columbia).