Deep into End-to-end Neural Coreference Model

This article will fulfill your curiosity.

This article contains formulas for more details, but I have tried to make the description of theoretical part in papers more accessible.

Medium does not support superscript, subscript, or latex-like syntax, it causes some inconvenience when reading this article.

Before we begin to know more about this model, several concepts about coreference help our understandings.


Researches about Coreference ResolutionSome important models are built for coreference, such as mention-pair model and mention-ranking model.


1 Mention-pair ModelThe mention-pair model is supervised.

According to the coreference dataset, NPs’ (noun phrase) coreferences are labeled.

This model proposes a binary classifier to predict whether two NPs are coreferent.

However, this model is not practical in terms of the following issues.

First of all, the transitivity property in the coreference relation cannot be guaranteed.

Then, for most of NPs are not coreferent, the number of non-coreferent labels is far more than coreferent labels.

Thus, a labeled dataset may have a skewed class distribution.

In order to realize the model, features, the training instance creation method and the clustering algorithm are needed.


2 Mention-ranking ModelGiven an NP to be resolved, the mention-ranking model takes into consideration the most probable candidate antecedents.

For each mention, a pairwise coreference score of this mention and antecedent candidates is calculated.

The antecedent with the largest score will be chosen to match this mention.

The pairwise coreference score is composed of mention scores and antecedent scores.

The mention score implies the possibility of an expression being a mention.

Similarly, the antecedent score indicates that the possibility of an antecedent candidate being the real antecedent of this mention.

Machine Learning methods train some criteria from the dataset.

This allows us to train a mention ranker rather than ranking all the candidate antecedents.

The mention-ranking model outperforms the mention-pair model.

However, it cannot exploit cluster-lever features.

Another improved cluster-ranking model was proposed by Rahman and Ng (2009).

Rather than ranking only the candidate antecedents, preceding clusters are ranked.

One of the defaults of the mention-ranking model is that it only ranks candidate antecedents, so if the mention actually appears before some candidate antecedents, non-anaphoric NP will be resolved by mistake.

And it is not what we expect.

The model itself cannot determine whether the mention is anaphoric.

Several other resolutions are proposed to identify non-anaphoric NPs.


3 State-of-the-art Coreference Resolution modelsThe paper (Clark and Manning, 2016a) proposes a neural network based entity-based model that produces high-dimensional vector representations for pairs of coreference clusters.

The system provides a high-scoring final coreference partition by using a learning-to-search algorithm to learn how to make clusters merge.

For the true observations depend on previous actions, the common i.


d assumption cannot be supported in this case.

The learning-to-search algorithm can solve it by optimizing a policy score.

The final result is the average of a F1 score of 65.


In the paper Clark and Manning (2016b), the mention-ranking model is optimized by two methods, reinforcement learning and a reward-rescaled max-margin objective.

The same mention-ranking model described in the paper Clark and Manning (2016a) is applied.

Instead of a learning-to-search algorithm, reinforcement learning is proposed as the learning algorithm to optimize the model directly for coreference metrics.

Finally, the model using reward-rescaled max-margin objective outperforms the REINFORCE algorithm and also previous paper of Clark and Manning, which gives 65.

73% in English task.


The End-to-end Neural Coreference ModelThe first end-to-end Coreference Resolution model outperforms previous state-of-the-art models which use manual mention detection, syntactic parsers, and heavy feature engineering.

It considers all spans (i.


expressions) as potential mentions and finds out possible antecedents for each span.

The spans are represented by combining context- dependent boundary representations with a head-finding attention mechanism.

For each span, a span-ranking model provides the decision of which the previous span is a good antecedent.

A pruning function is trained to eliminate less possible mentions.

The final model is a 5-model ensemble with different parameters.

Compared to the mention- ranking model, the span-ranking model has a larger space of detecting mentions.

Spans are represented by word embeddings.

The representations consider two important parts: the context surrounding the mention span and the internal structure within the span.

LSTMs and 1-dimensional convolution neural network (CNN) over characters consist of vector representations.

Instead of a syntactic parser that is usually used in the Coreference Resolution, a head-finding attention mechanism over words in each span is applied.

In the learning process, the marginal log-likelihood of all correct antecedents in the gold clustering is optimized.

The spans are pruned during the process of optimizing the objective.

The length of spans, the number of antecedents to be considered are set.

And the spans are ranked and only those with highest mention scores are taken.

The final result of the ensemble model is 68.

8% which outperforms all previous papers.

Now, the better result is given by the Higher-order Coreference Resolution with Coarse-to-fine Inference (Lee et al, 2018).

For practical concerns, we implemented the end-to-end model rather than the newest model.


1 Introduction to the TaskThe end-to-end Coreference Resolution is built for every possible span in the document.

The task is to find out the most possible antecedent yi for each span.

The set of possible antecedents is a dummy antecedent ε and all preceding spans.

Two situations lead to the dummy antecedents ε: (1) the span is not an entity mention or (2) the span is an entity mention but it is not coreferent with any previous span.

We suppose that the document is D which contains T words along with metadata for features.

The number of possiblespans in the document is N = T(T + 1) / 2.

We denote the start and end indices of a span i in D by START(i) and END(i), 1≤i≤N.


2 Span RepresentationsSpan representations are the core of the end-to-end neural coreference model.

Powerful span representations can extract semantic and syntactic information of the context surrounding the mention span and the internal structure within the span.

The model could understand the relationships between words according to the word similarities that span representations provide.

First of all, vector embeddings are crucial.

Each word has its vector embedding.

The vector representations, {x1, …, xT}, are composed of fixed pre-trained word embeddings (300-dimensional GloVe embeddings and 50-dimensional Turian embeddings) and 1-dimensional convolution neural networks (CNN) over characters.


3 Bidirectional LSTMsThe one-directional LSTMs can only perceive information from the past.

However, previous words cannot provide all information about expression or what it refers to, which will cause ambiguities.

While a bidirectional LSTMs can obtain information both in the past and in the future.

That is a great advantage for the Coreference Resolution because the understanding of relationships between words depends largely on the surrounding contexts.

The Bidirectional LSTMs has almost the same components as the LSTMs, except that it has two LSTMs.

One of them accepts the sequence in a forward direction, while the other one takes the backward sequence as the input.

The architecture is provided in the figure.

Bi-LSTMs — Credit to Colah’s BlogEach layer of the Bidirectional LSTMs is an independent LSTMs.

While the output is the concatenation of two output vectors.

The formulas of the Bidirectional LSTMs are thus different from the LSTMs because it depends also on the direction.

We assume that the direction is declaimed by a direction indicator δ = {−1, 1}.

formulas of Bi-LSTMs2.

4 Attention MechanismSyntactic heads, which are the most important syntactic information over a span, are detected by an attention mechanism.

In previous researches, the syntactic heads are represented as features.

The basic idea of the attention mechanism is to decide the most weighted part over a span, namely the most important information over a span.

The input of the attention mechanism is the output of the Bidirectional LSTMs.

By a feed-forward neural network, the vector representation is turned to a word score αt.

Next, the weight of each word ai, t is computed by an alignment model which measures how important this word is in this span.

The weighted sum of word vectors is the final result of the attention mechanism for a span .

formulas of Attention MechanismThe final span representation is the combination of the boundary representations, the soft head word vector and a feature vector.

formula of the final span representation2.

5 Scoring and pruning strategiesWe remind that the task is to find the most likely antecedent for each span.

The antecedent candidates are ranked according to pairwise coreference scores s which are composed of mention scores sm and antecedent scores sa.

The mention score implies whether span is a mention.

And the antecedent score indicates whether span is an antecedent.

The mention score sm and antecedent score sa are all calculated via standard feed-forward neural networks.

formulas of the scores sm and saThe pairwise coreference score considers a pair of spans, span i and span j:formula of the coreference scoreThe dummy antecedent ε is used in two situations.

The first one is that the span is not an entity mention.

The second one is that the span is an entity mention but it is not coreferent with any previous span.

Once we have the coreference score, the output layer, a softmax will decide which antecedent is more likely for the span i.

The model will not keep all spans generated in the first step during both training and evaluation.

The reason is that the memory complexity of the model is up to O(T4).

Applying the pruning strategy can delete spans that are unlikely to be included in coreference clusters.

Whether a span will be pruned or not depends on the mention score sm.

We only consider spans with a width no more than 10 and calculate their mention scores sm.

Only up to λT spans with the highest mention scores are kept.

For each span, only up to K antecedents are considered.

According to the paper, we still keep high mention recall, over 92% when λ = 0.

4 even though we use these aggressive pruning strategies.


6 Learning and Optimization StepDuring the learning process, a marginal log-likelihood of all correct antecedents implied by the gold clustering:formula of the marginal log-likelihood for the learning processThe output layer of the model is a softmax which depends on the pairwise coreference scores.

We will learn a conditional probability distribution P(y1,…, yN|D), which means that the configuration of this distribution could find the correct clusterings.

As for each span, the process of finding its antecedent is independent of other spans, we can decompose this distribution to the product of multinomials for each span:formula of the distribution of the coreference clustering2.

7 Architecture of the ModelNow we know how the model works inside.

As for the architecture of the model, it is composed of two parts.

One part is the span, the other part is the score architecture.

The Architecture of Generating Span RepresentationsTaking a sentence for example, “General Electric said the Postal Service contacted the company.

” First of all, every word in this sentence will be represented as a vector embedding which is composed of a word embedding and a CNN character embedding.

In the next step, the vector embeddings work as inputs of the Bidirectional LSTMs which outputs another vector embedding for each word.

The attention mechanism takes the outputs of the Bidirectional LSTMs as its inputs at span level and provides a vector embedding.

Additionally, a feature embedding with certain dimensions will also play a role in the span representation.

The Score ArchitectureFinally, we achieve a span representation with the boundary information of a span, a representation generated by the attention mechanism and a feature embedding.

We consider only up to a certain number of word for calculating the mention scores.

By applying the pruning strategies, we keep up to a certain number of spans.

Then we calculate antecedent scores and coreference scores.

The softmax output layer will decide which antecedent to choose for each span.

References:[1] Kevin Clark and Christopher D.


Improving coreference resolution by learn- ing entity-level distributed representations.

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.

[2] Kevin Clark and Christopher D.


Deep reinforcement learning for mention- ranking coreference models.

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.

[3] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer.

End-to-end neural coreference resolution.

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.

I am interested in business collaboration opportunities about Data Science, AI and blockchain:https://www.


com/in/lingjin2016/If you find my articles are useful, please encourage me more by lovely Bitcoin:3AyDNi2CToCphsLr9pdb3hcxKDcUxPDZ4HThanks for reading and supporting!.

. More details

Leave a Reply