Leveraging the Present to Anticipate the Future in Videos

Leveraging the Present to Anticipate the Future in VideosPredict future action labels instead of predicting pixel level information — Summarization of research paper by Facebook AISharmistha ChatterjeeBlockedUnblockFollowFollowingJun 23MotivationAnticipating actions before they are executed serves wide range of practical applications including autonomous driving and robotics.

Prior work done in this field requires partial observation of executed actions.

In contrast, this blog concentrates around anticipating actions seconds before they start.

The proposed approach discussed here, is the fusion of a purely anticipatory model with a complementary model constrained to reason about the present.

The complementary model predicts present action and scene attributes, and reasons about how they evolve over time.

Figure 1: Action anticipation.

Examples of action anticipation in which the goal is to anticipate future actions in videos seconds before they are performed.

Automatic video understanding has improved significantly over the last few years, involving action recognition, temporal action localization, video search, video summarization and video categorization.

The goal of action recognition is to recognize what action is being performed in a given video.

For instance, an autonomous car should be able to recognize the intent of a pedestrian to cross the road much before the action is actually initiated in order to avoid an accident.

Other examples of future model prediction involves predicting entire sequence of locations based on a human activity using recurrent neural networks (LSTMs).

This is more frequent in the domain of sports analytics such as basketball, water polo, tennis and soccer, as they aim to anticipate intention and future trajectories of a ball and individual players, by combining on-wrist motion accelerometer and camera.

In practical applications where we seek to act before an action gets executed, being able to anticipate the future given the present is critical.

Anticipating long-term future actions is a challenging task because the future is not deterministic: several outcomes are possible given the current observation.

Limitations of Previous WorkRequires restricted setup with a strong “action grammar” specific to the context, for example videos specific to cooking, need predefined recipes.

Action anticipation approach can only be applied to videos with annotated sequence of actions whereas the recent research work can be applied to any type of video dataset.

This blog is structured as follows:Anticipating future actions in videos as illustrated in the above figure.

Propose a new framework for the task of anticipating human actions several seconds before they are performed, even when no partial observation of the action is available.

The model is decomposed into two complementary models.

The first, named the predictive model, anticipates action directly from the visual inputs.

The second one, the transitional model, is first constrained to predict what is happening in the observed time interval and then leverages this prediction to anticipate the future actions.

Modeling ProcessAmidst future pixel, motion, semantic mask prediction, future frame prediction has recently attracted many research efforts.

It offers different mechanisms to generate future video by:Predicting future frames of a video by using a multi-scale network architecture.

Using an adversarial training approach to minimize an image gradient difference loss.

Generating future video frames using a transformation of pixels from the past.

Using probabilistic modeling to generate future frames from a single image.

Action Anticipation ModelAction Anticipation Models works on the principle of anticipating an action T seconds before it starts.

If V denotes a video, then Va:b, denotes the segment of V starting at time a and ending at time b, and with Yc the label of the action that starts at time c .

The objective is find a function f such that f(V0:t) predicts Yt+T .

The main idea behind our model f is decomposed as a weighted average of two functions, a predictive model fpred and a transitional model ftrans:where α is a dataset dependent hyper-parameter chosen by validation.

The first function fpred is trained to predict the future action directly from the observed segment.

On the other hand, ftrans is first constrained to compute high-level properties of the observed segment (e.


, attributes or the action performed in the present).

Then, in a second stage, ftrans uses this information to anticipate the future action.

In the next subsections, the prediction strategies to learn fpred and ftrans are explained.

Model combination of two complementary modules: the predictive model and the transitional model.

The above figure illustrates the task of predicting an action T seconds before it starts to be performed.

While the predictive model directly anticipates the future action, the transitional model is first constrained to output what is currently happening.

Then, it uses this information to anticipate future actions.

Transitional ModelsThe above figure illustrates transitional models.

Upper: Action Recognition (AR) based transitional model learns to prediction future actions based on the predictions of an action recognition classifier applied on current/present frames (clips).

Lower: Visual Attributes (VA) based transitional model learns to predict future actions based on visual attributes of the current/present frames (clips).

Predictive model (fpred)The goal of the predictive model fpred is to directly anticipate future action from the visual input.

As opposed to ftrans, fpred is not subject to any specific constraint.

For a training video V with action labels Yt0+T , .



 , Ttn+T, for each label Yti+T, objective is to minimize the loss:where s(ti) = max(0, ti −tpred),l is the cross entropy loss, tpred is a dataset dependent hyper-parameter, also chosen by validation, that represents the maximum temporal interval of a video fpred has access to.

This hyper-parameter is essential because looking too much in the past may add irrelevant information that degrades prediction performance.

This loss is then summed up over all videos from the training dataset.

fpred is a linear model which takes as input a video descriptor.

Transitional model (ftrans)The transitional model ftrans splits the prediction into two stages: gs and gt.

The first stage gs aims at recognizing a current state s, describing the observed video segment.

The state s can represent an action or a latent action-attribute.

The second stage gt takes as input the current state s, and anticipates the next action given the current state s.

gs is a complex function extracting high-level information from the observed video segment, while gt is a simple linear function operating on the state s and modeling the correlation between the present state and the future action.

There are two different approaches for transitional model:Based on Action Recognition (AR) andBased on Visual Attributes (VR), as illustrated in Figure “Transitional Models”.

Transitional Model based on Visual Attributes leverage visual attributes to anticipate the future.

Visual attributes have been previously used for action recognition.

The idea is to first predefine a set of visual attributes describing the presence or absence of objects, scenes or atomic actions in a video.

Facebook Research team trains the model based on these visual attributes for action recognition that express the transitional model.

The current state :predicted by gs, is then a vector of visual attributes probabilities, where a is the number of visual attributes.

Given the presently observed visual attribute s, gt predicts the future action, with gt being modeled as a low-rank linear model:These parameters are learned, in the same manner as the predictive model, by minimizing the cross entropy loss between the predicted action given by gt(s) and the future action ground-truth.

Implementing gt through a low-rank model reduces the number of parameters to estimate and leads to better accuracy.

The lower part of Figure “Transitional Models” illustrates this case.

Transitional Model based on Action RecognitionReal-world videos often consist of a sequence of elementary actions performed by a person in order to reach a final goal such as Preparing coffee, Changing car tire or Assembling a chair.

Many datasets come with a training set where each video has been annotated with action labels and segment boundaries for all occurring actions.

When this is available action labels can be used, instead of predefined visual attributes for state s.

The anticipation of the next action significantly depends on the present action being performed.

With Markov assumption on the sequence of performed actions an ordered sequence of action annotations can be represented as:where an defines the action class performed in video segment Vn.

The modelcan be represented as:∀n∈{0,…,N−1}, i∈{1,…,K}.

In addition,is decomposed in terms of two factors:An action recognition model gs(Vn) that predicts P(an = j|Vn), i.


, the action being performed in the present;A transition matrix T that captures the statistical correlation between the present and the future action, i.


such that :gt takes as input the probability scores of each action given by gs to anticipate the next action in a probabilistic manner:The transition matrix T can be computed by estimating the conditional probabilities between present and future actions from the the sequences of action annotations in the training set.

Prediction Explainability.

The transitional model ftrans provides interpretable predictions that can be easily analyzed as:Function gt of the transitional model takes the form of a simple linear model applied to the state s, both when using visual attributes as well as when using action predictions.

The linear weights of gt can be interpreted as conveying the importance of each element in s for the anticipation of the action class.

When given an action class k to anticipate, the linear weights of gt can be analyzed to understand which visual attributes or action class are most responsible for the prediction of action class k.

It also provides an easy way to diagnose the source of mispredictions.

For example, suppose the transitional model anticipates wrongly an action k , then the reason behind such mis-prediction can be analyzed as recognition problem (i.


wrong detection score for the visual attribute/action class) or due to the learned transition weights.

ConclusionFacebook AI research proposes a new model for future action anticipation.

The main motivating idea for our method is to model action anticipation as a fusion of two complementary modules.

The predictive approach is a purely anticipatory model, which aims at directly predicting future action given the present.

The transitional model is first constrained to recognize what is currently seen in the present actions and then uses this output to anticipate future actions.

For transitional model ftrans, the task is to analyze which visual attributes are responsible for the anticipation of each action class.

The Action Recognition (AR) transitional model performs better than the Visual Attributes (VA) transitional model.

However, both are outperformed by the purely-predictive model (fpred).

Combining the predictive model with either of the two transitional models (Predictive + Transitional (VA) / Predictive + Transitional (AR)) yields further accuracy gains.

The observed video segment V is represented to perform action prediction.

The overall strategy is to split the video into clips, extract clips representation and perform pooling over these clips.

The uniformly splitted small clips V = [V1,…,VN] with 8 or 16 frames can be fed into a pre-trained video CNN C.

From the penultimate layer of the CNN, an L2-normalized one-dimensional representation C(Vi) is extracted for each clip Vi.

Then, a temporal aggregation, Aggregate([C(V1), .



 , C(VN )]) of the extracted features is performed in order to get a one-dimensional video representation for V .










. More details

Leave a Reply