How machine learning could help find ancient artifacts, a walk throughIn this blog post we will use archaeological data from the British museum and the Staatliche Museen zu Berlin to train a gradient boosting model.
The first part is the data analysis and data structuring part.
In the second part, we will take a closer look at the gradient boosting algorithm and use it on the data.
Finally, the algorithm will be used to discover potential new archaeological sites.
Matt de KoningBlockedUnblockFollowFollowingJun 5Ancient ruins of Laodikeia, Turkey (Own work)Examining the dataBefore we start programming to retrieve and structure the data, it’s useful to understand the data.
The data consists of 5763 rows.
Each row has a longitude and latitude and represents a place in ancient Greece where an ancient artifact has or has not been found.
The artifacts are mostly coins, amphora’s, weapons and small statues.
The following map shows the location of the found artifacts:The first feature (column) is whether the location is near an ancient Greek temple.
These temples vary from the well-known Oracle of Delphi to less-know temple of Dionysus in Teos.
Temples used to attract pilgrims and commerce in ancient times, so the chance the proximity of a temple indicates an archaeological finding is substantial.
The following map shows the locations of the various ancient Greek temples:The second feature indicates whether the location was nearby an ancient Greek or Roman battle site.
In ancient times, the Greeks were at war with the Persians, Athens was at war with Sparta, and the Romans waged war against the Macedonians.
After these battles, weapons and armor were lost and forgotten, waiting for later times to be discovered.
The following map shows the location of the most important ancient battles in modern Greece and Turkey:The third feature indicates whether the location was nearby an ancient harbor.
This can be natural harbors, the remaining of an ancient jetty or a (bigger) harbor that is mentioned in ancient literature.
As you can see below, there are a lot of ancient harbors:The fourth feature checks if the location is nearby an ancient polis or Roman city.
The Greek culture is famous for it’s polis-culture, where there was no central Greek government, but rather a collection of independent city-states.
The most famous of the Greek poleis is Athens.
The Romans also founded/upgraded cities in the area, the most famous being Constantinople.
The ancient cities had a radius around them where they used the land for farming, commerce, hunting and mining.
Since these areas around the cities had human activity, the chance that these humans left traces is probable.
The following map shows the location of the various ancient cities:The final feature indicates whether the location is nearby a modern city:Retrieving the dataNow we know how our data looks like.
The next step is to retrieve the data so we can analyze it.
Open your Python interpreter.
I work with PyCharm, the community edition is free to use and can be downloaded here.
After you opened PyCharm, type ctrl+alt+s and click the cross (+).
Install the following libraries, numpy, pandas, sklearn and matplotlib.
If you are using a different application to code in Python, then use pip install numpy,pandas,sklearn,matplotlib to install the libraries.
Next we have to download the data set.
The data set can be found here.
Right click on the web page, and click save as, or press ctrl+s.
Save the file as csv.
Open a new Python file.
First we have to import the libraries we added above.
Pandas is a library used for data retrieval and structuring.
Numpy is scientific Python a library used to do mathematical calculations.
Sklearn is a library that contains various machine learning algorithms.
Matplotlib is a library that is useful to visualize data, by plotting graphs.
Type the following to import the libraries:# libraries needed to run algorithmsimport pandas as pdimport numpy as npfrom sklearn.
ensemble import GradientBoostingClassifierfrom sklearn.
model_selection import train_test_splitfrom sklearn.
metrics import roc_curve, aucimport matplotlib.
pyplot as pltGood, so now we have all the libraries we need.
Next, create a class named Trainer.
This class will get the data, do analysis, train the gradient boosting algorithm and use the trained algorithm to predict new cases.
Create a method in the Trainer class called createABT as well.
In the first line of the method, create a data frame by reading the csv.
In the second line, mix the data rows.
This is important to counter overfitting of the model in later stages.
Finally, return the data frame.
class Trainer: #method to get data and build Analytical Base Table def createABT(self): df = pd.
csv") df = df.
index)) return(df)Create an instance of the class called algo and in a next line let it use the createABT method.
Now the data is retrieved and put into an analytical base table.
algo = Trainer()df = algo.
createABT()Analyze the dataLet’s inspect the data by using the pandas describe command.
This commands shows us the statistical information of each column.
describe())The mean of each column tells something about the distribution of the data, since all the features are binary (0 or 1) features.
The feature ‘Near ancient harbor’ has a mean of 0.
This means that 94% of all datapoints are near an ancient harbor.
This is a high number, so the influence of this feature will probably be limited.
The other features are more evenly distributed.
The classification column has a mean of 0.
54, which means the positive and negative training examples are equally distributed.
The second analysis step will be to compare the feature means compared to the classification column.
In this way we can inspect two means of each feature, one when the classification column is 1, i.
an object has been found, and one mean when the classification column 0.
In our previously created Trainer class, we will create a new method called analyzeABT, which takes the data frame as input.
First we strip the data frame of the unnecessary columns.
Then we create two empty arrays where the positive and negative means are stored for each column.
Then we use two for loops to determine the mean in each column while the data is filtered on positive or negative classification.
After this code, two arrays (pos_mean and neg_mean) are filled with all the means of the features.
Using the Matplotlib library, the following code visualizes the result:Make this method executable by adding this method to the algo instance of the Trainer class at the bottom of the python file.
analyzeABT(df)The code will produce the following graph:The ‘Temple nearby’ feature and the ‘Polis nearby’ feature will probably be most important features for the algorithm.
This graph show that proven archaeological sites are on average more in the neighborhood of temples, ancient harbors and old polis, while sites with no artifacts are on average more close to ancient battle sites and modern cities.
So the total code so far is:How does gradient boosting work?Before diving in the gradient boosting code, it is useful to get a little understanding of the gradient boosting algorithm.
Gradient boosting is basically about boosting many weak predictive models into a strong one, in the form of an ensemble of weak models.
A weak predict model can be any model that works just a little better than random guess.
To build the gradient boosting model, it is important that the weak models are optimally combined.
Weak models are trained in an adaptive way, as is seen by the following steps:1) Train a weak model by using a data sample from your population.
2) Increase the weights of the samples that are misclassified by the model of step 1 and decrease the weights of the samples that are correctly classified.
3) Train the next weak model using new samples with the updated weight distribution from step 2.
In this way, gradient boosting always trains using data samples that are difficult to learn in previous rounds, which results in an ensemble of models that are good at learning different parts of the data.
To measure the quality of our model, we will use the ROC-value.
This is a metric that measures the true positives (predicted positive and indeed classified positive) against the false positives (predicted positive but not classified positive).
A ROC-value of 1 is the best possible outcome.
Training our modelThis code retrieves the data and converts it into an data frame.
Now we will build a gradient boosting classifier to predict whether a potential site is good for archaeological investigation.
First create a new method named trainingModel, which takes the data frame as input.
In the method, create a data frame for the X values (the features) and the Y values (the classification).
Split the X and Y data in train and test data and train a gradient boosting classifier from the SKlearn library on the data.
Next, calculate the ROC-value by determining the true positive and false positive rate.
Print the feature importance as well, to see which features (X) are most important for the model to predict the outcome.
Finally, return the gradient boosting model, the true positive rate and the false positive rate.
The following code is an example of how to manage this:Add the following code at the bottom of your python file to execute the trainingModel method (fpr = false positive rate, tpr = true positive rate, gb = the trained gradient boosting model:fpr_gb, tpr_gb, gb = algo.
trainingModel(df)Results of the trained modelThe following code visualizes the result of the trained gradient boosting model.
Add the following code at the bottom of your python file to execute the visualizeResults method:algo.
visualizeResults(fpr_gb, tpr_gb)The code generates the following graph, which shows the ROC-curve:As you can see, the area under the ROC-curve is almost 1 (it’s 0.
97), which means our model is performing really well!.Next, we are going to inspect the individual features.
After a 1000 iterations, the mean feature importance looks like this:As we expected in the analysis step of part 1, the ‘Near ancient harbor’ feature is not important, because 94% of all the data is nearby an ancient harbor.
It turns out the ‘Near temple’ and ‘Near polis’ features are most important for the model to determine the classification of the location.
Using our model to identify potential archaeological sitesWe have a trained gradient boosting model.
After some analysis everything looks fine for the model to go in production.
To use the trained model for new predictions, use the SKlearn function ‘predict’.
It will need the X features as input, and will predict the ŷ output.
The following code will ask the user for the input features and the model will give the prediction as a response.
If the potential search location is good, it will return ‘this is a good place to investigate’.
Else, the code will return ‘Better look somewhere else!’.
For convenience reasons, the following code is the complete final version of code.
The testAlgo method is the newly added part:We used the algorithm to look for new potential archaeological sites in Turkey.
We went to the Turkish coast near the Greek island of Rhodes.
This place was near an ancient temple, near an ancient polis and not near a modern city.
The trained algorithm gave a positive advice.
We snorkeled in this area looking for amphora’s and walked the coast looking for new traces.
On the opposite side of the Greek island of Simi we found the first unknown traces of ancient times.
Ten meters underwater, we found a broken piece of an amphora.
Some miles further, we tested the positive outcome of the algorithm again.
After wandering near the coast, we found the following part of an amphora:So far the algorithm seems to give a good indication whether a spot is good to look for ancient relics.
We have left the pieces amphora behind, according to Turkish law.
As amphora’s are not rare enough, Turkish museums are not interested.
Obviously the algorithm is not precise enough to detect the exact spots of new archaeological sites.
But it does give a good first indication.
Following this research, we want to combine this algorithm with the use of satellite image recognition.
To be continued .
A big thankyou toAncient harbor data from:http://www.
com/the-catalogue/greece-islands/Staaliches Museum zu Berlin:https://www.
org/wiki/List_of_Ancient_Greek_templesGoogle maps API:https://developers.