We will end up adding many additional features, the majority of which will be useless, and are likely to miss what really matters.
What if instead of hand-crafting aggregate features we can have the algorithm to learn them.
While this approach was studied in the past (see, for example, Cascade-Correlation Approach), when there is no correlation between transactions, we can use a simple neural network that will learn the aggregate functions using some standard TensorFlow functions.
The neural network’s architecture in our case is very simple.
We have one fully connected layer, which is followed by the segment_sum function that actually does group by operation and sums the outputs of the first layer.
The second fully connected layer is connected to the output, which in our case is an aggregate function we are trying to learn.
For simplicity, in this POC we used linear units, but for more complex situations we have to introduce nonlinearity.
Sample neural network architectureThe number of units in the first layer gives us the number of aggregate functions it can simultaneously learn.
Let’s consider two examples: learning count and sum.
If we set all the weights in the first layer to zero, and set all biases to 1, then (assuming linear unit), the aggregate sum of the outputs will give us the count of transaction for every user.
If, however, we set the bias and all the weights to zero, but set the weight of purchase_amount to 1, we will get total purchase amount for every user.
Let's demonstrate our ideas in TensorFlow.
The function segment_sum works as follows:The image taken from TensorFlow documentationIt accepts a separate tensor with the segment ids, and the data have to be labeled with the same segment ids.
It groups the data by segment IDs and sum-reduces over the zero’s dimension.
Cost after epoch 0: 187.
700562Cost after epoch 100: 0.
741461Cost after epoch 200: 0.
234625Cost after epoch 300: 0.
346947Cost after epoch 400: 0.
082935Cost after epoch 500: 0.
197804Cost after epoch 600: 0.
059093Cost after epoch 700: 0.
057192Cost after epoch 800: 0.
036180Cost after epoch 900: 0.
037890Cost after epoch 1000: 0.
048509Cost after epoch 1100: 0.
034636Cost after epoch 1200: 0.
023873Cost after epoch 1300: 0.
052844Cost after epoch 1400: 0.
024490Cost after epoch 1500: 0.
021363Cost after epoch 1600: 0.
018440Cost after epoch 1700: 0.
016469Cost after epoch 1800: 0.
018164Cost after epoch 1900: 0.
016391Cost after epoch 2000: 0.
011880MSE loss vs.
iterationsHere we plotted the cost function after each iteration.
We see that the algorithm learn count function pretty quickly.
By tuning hyperparameters of the Adam optimizer we can try to get even more accuracy.
Cost after epoch 0: 8.
718903Cost after epoch 100: 0.
052751Cost after epoch 200: 0.
097307Cost after epoch 300: 0.
206612Cost after epoch 400: 0.
060864Cost after epoch 500: 0.
209325Cost after epoch 600: 0.
458591Cost after epoch 700: 0.
807105Cost after epoch 800: 0.
133156Cost after epoch 900: 0.
026491Cost after epoch 1000: 3.
841630Cost after epoch 1100: 0.
423557Cost after epoch 1200: 0.
209481Cost after epoch 1300: 0.
054792Cost after epoch 1400: 0.
031808Cost after epoch 1500: 0.
053614Cost after epoch 1600: 0.
024091Cost after epoch 1700: 0.
111102Cost after epoch 1800: 0.
026337Cost after epoch 1900: 0.
024871Cost after epoch 2000: 0.
iterationsWe see that cost also goes down, but there are spikes in cost that can be explained by high gradients as we feed new data to the algorithm.
Perhaps we can tune the hyperparameters to improve the learning procedure convergence.
Conclusion and next stepsWe have demonstrated a simple neural network that can learn basic aggregate functions.
While our demonstration used linear units, in reality we have to use non-linear units for the layer 1 to enable to learn more complex aggregate functions.
For example, if we want to learn the total amount for category2 = 5, then the linear units will not work.
But if we use, for example, sigmoid function, then we can set the bias to -100, then set the weight for category2 = 5 to +100, and set the weight for purchase_amount to a small positive value ωω .
In the second layer we can set the bias to zero and the weight to 1ω1ω .
This architecture does not learn function mean.
But it learn both of its components: sum and count.
If our decision boundary depends on the average sales, this is the same as if it depended on number of transaction and total amount.
This architecture also will not learn more complex functions like variance and standard deviation.
This can be important in the financial sector, where you may want to make decision based on the market volatility.
Additional layer before aggregation may be required to get that feasible.
Finally, in the example the learning was slow because we had to present the data in the aggregate form.
It is possible to improve speed by pre-aggregating the data and then resampling them.
All code used in this article can be found in my github repo.
.. More details