Human learning happens, largely, through deductive reasoning, but it’s not into an arbitrary scattering of categories we learn.
When we teach machines to learn, it’s crucial we employ these ideas, as well.
ImageNet is a standard, benchmark dataset of high-resolution images for deep learning.
The explosively large dataset (with over 14 million images and counting!) has its labels organized into hierarchical categories.
It employs the structure of WordNet, a lexical reference system which organizes English nouns, verbs, and adjectives by semantic relations.
WordNet’s design, inspired by Collins and Quillian’s work, stacks sets of synonyms (“synsets”), each which represent one concept, into a hierarchy of relations: antonymy, hyponymy (IS-A relations), and meronymy (HAS-A relations).
ImageNet’s labelling uses these IS-A relations among nouns.
It impressively allows for disambiguated labels, where two meanings of one word do not interfere with image labelling.
Images classified as plants, meaning, “living organism[s] lacking the power of locomotion” won’t be confused with those classified as plants, meaning, “buildings for carrying on industrial labor.
”The original paper published for the 2009 IEEE Conference on Computer Vision and Pattern Recognition by Jia Deng et al.
boasts higher accuracies for images when a classifier considers all child labels; for instance, in classifying an image, instead of considering the classification score of just “dog,” it takes the max of the scores of “dog,” “German shepherd,” “English terrier,” etc.
The rationale behind this is straightforward — a prediction classifying a picture of a tree (correctly) as being a maple should not be dismissed for being too specific.
The hierarchical labelling at such a mass scale proved to be a breakthrough in benchmarking complexity of models (5).
Hierarchical organization is the best bet toward achieving the ability of abstracting patterns to generalize categories.
If we want our models to understand the Form of Uncheerfulness as well as we do, we would like it to understand anger and anguish both fall under the umbrella (6).
In a world where supervised deep learning will continue to be bogged down by needing massive amounts of examples, hierarchical labelling is a win for abstraction.
Despite ImageNet’s advantages, its labelling still has its shortcomings.
Playing around with some resources ImageNet makes available, I discovered a key problem: ample opportunities for mislabeling (7).
While the labels are organized hierarchically, acquiring images for each synset involves search queries of only words either in the target synset itself or in its parent synset, but the results could still differ largely from the intended meaning of the word.
The images are annotated and labelled by Amazon Mechanical Turk workers, who are given the definition of the target synset.
They are not provided with the hierarchy of the synset (8).
Human error introduced from crowdsourced labels raises problems for truly disambiguated labelling.
Consider the word, “beanbag.
” Its synset contains only the one word, with the WordNet gloss: “a small cloth bag filled with dried beans; thrown in games.
” Its direct ancestors are “bag” and “container.
” However, the ImageNet page for “beanbag” consists almost entirely of images of beanbag chairs.
There is no “beanbag” synset under the entities, “furniture,” “seat,” or “chair.
” The result is a misclassification of all images queried under the entire synset for a different, more common, meaning of the same word (which happened not to have its own synset in WordNet).
This kind of error would create instances of erroneous classification of both seats and containers, if, given an image of a man lounging in a beanbag chair at a campsite, the model predicts it is seeing a small bag.
Another example of a synset where the instinctive, human association of the word (and thus image labelling) mismatches the gloss: artefact/artifact, “a man-made object taken as a whole,” though ImageNet’s image collection of “artifacts” is almost exclusively remnants from ancient cultures.
MTurk workers have little reason to realize they should really be labelling the data as “archaeological remains.
”Still, despite my nitpicking after a day’s fooling around on ImageNet, I’m blown away by the elegance of its design.
A great next step for the image-recognition community is to ensure our machines learn not only the labels of the examples it sees, but, by harnessing the structure, learn their place in a hierarchy, as well.
When training a model to learn a specific category, we can include all images labelled for descendent categories for more generalized classifiers, and in this way explicitly tell the model the hierarchy (9).
In other cases, the hierarchy might be able to be learned (6).
Comprehending how an AI agent should learn the world and build models have laughably little to do with how our brains perform the same tasks.
But it’s important we allow ourselves to take inspiration from them.
Abstracting, generalizing, and hierarchizing are crucial to our learning, and they can help our machines’ learning, as well.
If we consider these techniques seriously, we may find a world where better decisions are made — by both ourselves and our machines.
FootnotesAnd despite the instinct to believe texts can’t convey sarcasm, we’re actually pretty good at understanding Forms of emotions in colloquial written correspondence.
Textisms have evolved in our language as substitutes for extra-linguistic cues with more use of social media and online communication.
Extra ellipses, intentional misspellings, emoticons, and a period at the end of a sentence can add meaning to a message in the same way body language and visual cues do.
Psycholinguists have poked holes in the hierarchical network model.
It fails to explain, for example, why “A robin is a bird” can be confirmed faster than “A chicken is a bird” if chickens and birds have the same depth in the hierarchy.
In some particular cases, it does not seem to hold: “A dog is an animal” is confirmed faster than “A dog is a mammal.
” If you’re thinking this model seems too elegant to be true, you’re right.
It is based on Quillian’s work in artificial intelligence, which has a knack for oversimplifying mental processes.
Despite response time not always lining up, most psycholinguists today agree that nouns in English follow a hierarchical organization in semantic memory.
This paper might be a zoologist’s nightmare.
Of course, not all animals have skin (I’d be willing to bet my mental model of the world doesn’t count insects’ exoskeleton as skin), and while plants don’t have lungs, they still breathe in carbon dioxide and produce oxygen, but the idea that traits are stored in hierarchical categories roughly holds.
The authors call this saving of space “cognitive economy.
” This point has been contested by several other psycholinguists who argue fast confirmations may just be due to strong noun-property associations.
Collins and Quillian’s original paper even includes a caveat that the assumptions regarding organization and cognitive economy are not meant to hold in all cases.
In 2012, AlexNet, a model now considered one of the most influential achievements for computer vision, had the best accuracy in the ImageNet Large Scale Visual Recognition Challenge, kicking off the global deep-learning frenzy we are still in today.
Hierarchical clustering of DeepMoji predictions shows the model — a 2017 emotional intelligence effort from MIT trained on 1246 million tweets including an emoji, using the emojis as labels — learns a reasonable grouping of emojis into categories and subcategories.
It is better able to predict sentiment in tweets than humans.
Other, less subtle, issues are the existence of ImageNet’s “Misc.
” subtree with over 20,000 descendent synsets (upon instant inspection, many of which seem to belong under “plant, flora, plant life”) and how to best handle WordNet’s entire “abstract entity” subtree of English nouns.
The hierarchies themselves, of course, are also contestable.
Atomic elements and compounds, for example, are considered an “abstract entity” in WordNet, many of which don’t have physical-entity-synset counterparts.
Is aluminum oxide abstract in the same way conflict is?Though they are given a link to a Wikipedia page of the concept, they likely wouldn’t care to click on it to ensure correct labelling since they 1) are paid small, fixed amounts of money for completion of quick tasks and this external effort and time will be wasted with respect to the reward, and 2) would have no reason to assume they are misidentifying the synset.
As with most things, the relevance will depend on the situation: it would be immensely helpful for all the various types of “fungus,” but could be less appropriate and for “scavenger,” who has a WordNet child synset, “bottom-feeder.
”.. More details