Simply yes — we will look into the benefits of approximating the route first later.

Direction — this is more complicated; however fortunately for this example we do not need a high level of directional accuracy, put simply we need to know whether the user is travelling north or south.

A moving average is likely to deal with any natural variations in the data set and provide a sufficiently accurate value change in the Long / Lat, which the resulting vector can be taken (in red below) which is the direction.

Strictly speaking this gives an indication or speed as well however we are not going to get sufficient data from the train data to make use of speed as a predictor.

Time — Simple this is provided by the phone.

TrainLocation — the train has a highly accurate route but has sparse location with respect to time data (i.

e.

location with respect to time is only known at stations).

At certain points on the route (i.

e.

stations) the trains location is know with a high degree of certainty.

A simple way to approximate the location with respect to time of the train is to assume a linear relationship between time and distance.

Assuming travel from South Acton to Acton CentralLevel upThis is a very simple assumption; and will not hold true for a significant number of cases — the effect of which and possible mitigation's we will examine later.

ModelOne of the major modelling constraints is speed; the users expects very high levels of responsiveness so the model must be kept simple enough to run quickly (which will also keep the processing costs down).

This will become a particular challenge when considering more complicated routes.

Lets quickly review the two data sets and their attributes:TrainHighly precise route and station location dataSparse but relatively precise time dataUserFrequent, variable accuracy location data and high accuracy time dataThe aim of the model is to determine which class a datapoint belongs to, this can be seen in the figure below.

Also included is another possible mode of transport which the model will be required to identify between.

In this simple example its easy to see how by simply comparing the user series of data points to the transport modes in a similar location — allows for easy identification between different modes/routes.

The data points in the blue box can easily be labelled, however the ones in the red box are more challenging — this is where model tuning becomes really important, especially if using smaller datasets.

Level Up — Model SelectionA simple but effective model that can be used for this simple case is k-Nearest Neighbours classification model — where a data point is compared to it’s nearest data points and labelled as the class of which there are the greater number (more information here)Level Up — Higher DimensionalityThis is an over simplification because to be able to get the instance (i.

e.

specific train) and not just the mode (the route) the 2 dimensional plot shown above the actual model will have at least 5 dimensions (Long, Lat, Change Long, Change Lat, Time), which allow for the model to account for the three predictor characteristics we are currently using; location, direction & time.

The model can be configured in either a classifier of user -> transport instance or transport instance -> user; in this case the transport instance is the labelled data set so the users should be matched to the transport instance.

Below is an example data set that would be the likely input into this simple model.

Example data input into modelLevel Up — Model ValidationOne of the challenges in improving the model is a lack of labelled user data — the App does not necessarily get any validation that the prediction of which train the user was on was in fact the train the user was on; in other words the data set does not get “labelled”.

The longer the model is run for an individual user the higher level of probability the model can determine if it’s prediction was accurate — for example if the user were to ride 10 stops and the users track matches the train for each it’s much more likely that the user is on this train.

This improved probability can be used to give an assumed label.

iNterpretSo we’ve successfully identified which instance of the transport mode the user is on (simply in this case what train is the user on) — how do we use this?The user is likely to have little interest on being they are on a train and not a bus (hopefully they are already aware of this) — but what we can do is start to solve the problem highlighted; generating a route plan for a user who is already in motion.

By using the data we have gathered, explored and modelled we are able to indicate that the user is already on the Piccadilly line, and instead of “Board the next train Circle line train at Hammersmith” the far more useful line of “Remain on board the Piccadilly line at Hammersmith” can be given; preventing a recommendation which takes 40% longer than the optimal!Key opportunities for improvementOpportunity 1An overly simple model for predicting location with respect to time for the train between stations.

In reality not likely to be linear as the speed of the train is likely to vary between stations.

One mechanism to overcome this is to improve the model of the location with respect to time between stations; this can be completed with the assistance of user data (with permission of course) to provide additional data.

There is also the potential to get additional data from TfL, i.

e.

for line sections rather than just stations.

Level Up — Opportunity 2Improved model; whilst k-Nearest Neighbours is a good model for simple low dimensional cases like this for more advance high dimensional models (and to improve accuracy there is likely to be a need to at least at first increase the amount of information and consequently the dimensionality).

Logistic regression is a good move, however relies on being able to linearly separate the label classes — not impossible but as the dimensions increase likely means using Principal Component Analysis (PCA) or another dimensionality reduction technique to reduce the number of dimensions without loosing the information contained within — we did after all increase the number of dimensions for good reasons.

Using a decision tree (either as an ensemble method or as a gradient boosting method) has distinct benefits in that it can cope with non-linearly separable classes and also categorical predictor variables; suitable for dealing with large input datasets and high dimesionality.

However can get far more challenging to implement with a number of hyper-parameters to optimise.

.