The input of the Critic network is the user state s generated by the user state representation module and the action generated by the policy network, and the output is the Q-value, which is a vector..According to the Q-value, the parameters of the Actor network are updated in the direction of improving the performance of action a..I am not going to write all the gradients, primarily because I use Pytorch..The loss function is plain MSE since we are going to estimate real-valued rewards that are not typically normalized so it is a regression problem.The State Module ZooDRR-p — utilizes pairwise dependency between items..It computes the pairwise interactions between the n items, by using the element-wise product operator..(The user-item interactions are neglected!)In DRR-u, we can see that the user embedding is also incorporated..In addition to the local dependency between items, the pairwise interactions of user-item are also taken into account.When we work with a large long-term batch of news, we don’t expect the positions to matter..But memorizing the positions of items may lead to overfitting if the sequence H is a short-term one..As an average pooling layer is adopted, we call the structure DRR-ave..We can see from Figure 6 that the embeddings of items in H are first transformed by a weighted average pooling layer..Then, the resulting vector is leveraged to model the interactions with the input user..Finally, the embedding of the user, the interaction vector, and the average pooling result of items is concatenated into a vector to denote the state representation.In the next article, I will try to implement this network in Pytorch using Deep Deterministic Policy Gradient, as it’s described in the original paper.. More details