We can get some clues by looking at the context words of the predicted materials, and see which of these context words have high similarities both with the material and the application keyword “thermoelectric”.
Some of the top contributing context words for 3 of our top 5 predictions are shown below.
Figure 5: Context words for 3 of our top 5 predictions that contribute the most to the predictions.
The width of the connect lines is proportional to cosine similarities between the words.
Figure borrowed from .
Effectively, the algorithm captures context words (or, more precisely, combinations of context words) that are important for a material to be a thermoelectric.
As materials scientists, we know that for instance chalcogenides (a class of materials) are often good thermoelectrics and that the presence of a band gap is crucial most of the time.
We see how the algorithm has learnt this using co-occurrences of the words.
The graph above captures only the first order connections, but higher order connections could also be contributing to the predictions.
For scientific applications, natural language processing (NLP) is almost always used as a tool to extract the already known facts from the literature, rather than to make predictions.
This is different from other areas such as stock value predictions, where, for instance, news articles about the company are analysed to predict how the value of its stock will change in the future.
But even then, most of the methods feed the features extracted from the text into other, larger models that use additional features from structured databases.
We hope that the ideas described here will encourage direct, unsupervised NLP – driven inference methods for scientific discovery.
Word2vec is not the most advanced NLP algorithm, so a natural next step could be its substitution with more novel, context aware embeddings such as BERT  and ELMo .
We also hope that since the methods described here require minimal human supervision, researchers from other scientific disciplines will be able to use them to accelerate machine-assisted scientific discoveries.
Notes†A crucial step in obtaining good predictions was to use output embeddings (output layer of the Word2vec neural network) for materials and word embedding (hidden layer of the Word2vec neural network) for the application keyword.
This effectively translates to predicting co-occurrences of words in the abstracts.
Therefore, the algorithm is identifying potential “gaps” in the research literature, such as chemical compositions that researchers should study in the future for functional applications.
See the supplementary materials of the original publication for more details.
The code we used for Word2vec training and the pre-trained embeddings are available at https://github.
The default hyperparameters in the code are the ones used in this study.
DisclaimerThe work discussed here was performed while I was a postdoc at Lawrence Berkeley National Laboratory, working alongside an amazing team of researchers — John Dagdelen, Leigh Weston, Alex Dunn, Ziqin Rong, Olga Kononova, Kristin A.
Persson, Gerbrand Ceder and Anubhav Jain.
Also big thanks to Ani Nersisyan for the suggested improvements to this story.
Corrado & J.
Dean, Efficient Estimation of Word Representations in Vector Space (2013), https://arxiv.
Corrado & J.
Dean, Distributed Representations of Words and Phrases and their Compositionality (2013), https://arxiv.
Ceder & A.
Jain, Unsupervised word embeddings capture latent knowledge from materials science literature (2019), Nature 571, 95–98 L.
Maaten & G.
Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research J.
Lee & K.
Toutanova, Bert: pre-training of deepbidirectional transformers for language understanding (2018), https://arxiv.
Zettlemoyer, Deep contextualized word representations (2018), https://arxiv.