Truecasing in natural language processing

')word: I upos: PRON xpos: PRPword: think upos: VERB xpos: VBPword: that upos: SCONJ xpos: INword: john upos: PROPN xpos: NNPword: stone upos: PROPN xpos: NNPword: is upos: AUX xpos: VBZword: a upos: DET xpos: DTword: nice upos: ADJ xpos: JJword: guy upos: NOUN xpos: NNword: .

upos: PUNCT xpos: .

word: there upos: PRON xpos: EXword: is upos: VERB xpos: VBZword: a upos: DET xpos: DTword: stone upos: NOUN xpos: NNword: on upos: ADP xpos: INword: the upos: DET xpos: DTword: grass upos: NOUN xpos: NNword: .

upos: PUNCT xpos: .

word: i upos: PRON xpos: PRPword: 'm upos: AUX xpos: VBPword: fat upos: ADJ xpos: JJword: .

upos: PUNCT xpos: .

word: are upos: AUX xpos: VBPword: you upos: PRON xpos: PRPword: welcome upos: ADJ xpos: JJword: and upos: CCONJ xpos: CCword: smart upos: ADJ xpos: JJword: in upos: ADP xpos: INword: london upos: PROPN xpos: NNPword:?.upos: PUNCT xpos: .

word: is upos: AUX xpos: VBZword: this upos: DET xpos: DTword: martin upos: PROPN xpos: NNPword: 's upos: PART xpos: POSword: dog upos: NOUN xpos: NNword:!.upos: PUNCT xpos: .

The resulting POS tags obtained with StandfordNLP look great.The first instance of the word stone is now correctly recognized as a person name, allowing a correct capitalization as shown below.



capitalize() if w.

upos in ["PROPN","NNS"] else w.

text for sent in doc.

sentences for w in sent.

words]['I', 'think', 'that', 'John', 'Stone', 'is', 'a', 'nice', 'guy', '.

', 'there', 'is', 'a', 'stone', 'on', 'the', 'grass', '.

', 'i', "'m", 'fat', '.

', 'are', 'you', 'welcome', 'and', 'smart', 'in', 'London', '?', 'is', 'this', 'Martin', "'s", 'dog', '?']Stanford CoreNLP also provides a set of powerful tools.

It can detect the base forms of words (lemma), parts of speech, names of companies, people, etc.

It can also normalize dates, times and numeric quantities.

It is also used to mark up phrases and syntactic dependencies, to indicate sentiment, and to get the quotes people said.

StanfordNLP takes few lines of code to start utilizing CoreNLP’s sophisticated API.

For those who want to get deeper, check the post linked here.

ConclusionIn this post, we investigated case restoration for text without case information.

All techniques used were operating at the word level using the NLTK, spaCy and StandfordNLP toolkits.

An approach using character-level recurrent neural networks (RNN) is proposed in the article linked here, for some heroes among us.


. More details

Leave a Reply