The easiest way to understand something really complicated is to examine where it fails and understand the reasons why. At the heart of any good information retrieval system (a.k.a. search) lies the ability to disambiguate between terms that are identical.
Take this ‘innocent’ phrase for example: “trained SEAL”. I have playfully augmented it a little in the image above by adding “my brother is a trained SEAL” to show just what kind of misunderstandings could arise when language is used imprecisely.
The thing is that amongst people language is a very imprecise form of communication that allows for huge redundancies and interpretations based on context. My brother, for instance, could be said to be like a trained, performing seal, if I think he is too subservient with little free will of his own. Or he could be a member of one of the world’s most accomplished special forces. When I write out the phrase “my brother is a trained SEAL” the capitalization is a strong clue as to which it is I am actually referring to.
In conversation, however, in order for someone I am discussing this with to completely understand it they will have to take into account:
- The context of our conversation (what have we been talking about until that moment i.e. armed forces or mindless slavery in a corporate job)
- The style of our conversation (playful and wry or intelligent and literal)
- What they know about me (do I even have a brother? Am I the kind of person who uses irony and similes? Could my parents have been involved in some bizarre genetic experiment in the past involving seal DNA?)
- The quality of our command of English (are we both native speakers? Is there a language barrier?)
- Any previous history we might have had (is this a new meeting? Are there any clues from previous ones that help decode my utterance?)
- Our shared culture (If I am from Tibet it is highly probable that I do not understand what a SEAL is)
Obviously the very words “my brother” are a strong contextual clue that helps narrow things down a bit and semantic search would use it exactly the same way to give a better answer Consider however how much harder this becomes if I only used the phrase “trained SEAL” taking away even the little helpful, contextual information provided by “my brother” and you begin to understand a little of the enormity of the task that semantic search indexing is presented with.
In semantic search not only must a machine be able to discern that something is an Entity (i.e. a real world, or fictional object associated with very specific attributes, qualities and values) but it also has to be able to understand that one entity is not like another even if they have the same name just like a word is not like another, even if they sound the same.
The way this is solved at individual word level is also an indication of how it can be solved at entity level. Recently Google and Microsoft announced a competition to help push research further into this domain particularly where “short text” search queries (like my “trained Seal” example) are concerned.
This shows that the challenge is less than solved. Progress however is being made.
Two Approaches to Solving Disambiguation
There are two ways this problem can be solved. One is the dictionary or rule-based approach. Basically a dictionary is fed into search. Dictionaries cover all definitions, including examples and this allows for a high degree of accuracy. Because dictionaries are governed by precise rules (terms appear in alphabetical order, verbs, nouns, adjectives and idioms are clearly marked, etc) this provides an easy first fix.
To better understand the approach consider how a term-checker was designed within a closed data set:
There were rules, definitions, meanings, all ascribed to it forming an internal dictionary of sorts against which specific words could be checked and the correct definition identified.
But just like you cannot learn a language simply by learning its vocabulary, there are limitations to this approach. The limitations of the dictionary-based approach were highlighted when very recently IBM’s Watson when fed all the terms found in the Urban Dictionary, delivered an inappropriate answer to a question (by calling it BS. The Urban Dictionary had to be, sadly, removed from Watson’s knowledge base).
The second approach involves machine learning. This allows the indexing of masses of data and the relationships between indexed terms (and entities) which the machine then has to try and understand.
The image below, drawn from the Apache Stanbol site helps illustrate the approach:
The idea behind this is that if you provide a small set of very basic rules that allow a machine (or a search engine) to index data and sort it, then, as it scales, its understanding of the process becomes ever more refined.
To help it along, usually a supervised set of data is used, that is to say a human operator actually feeds a number of specific words and contextual information that help disambiguate the usage of specific words, in the first instance (a little like starting out with SEAL = elite special forces and seal = cute marine mammal and then allowing the machine to scale it to all other sorts of “seal” that can be found).
Such an approach, as you can imagine, is expensive, particularly when you consider the size and constant expansion of the web. To get round it two additional methods have been applied: semi-supervised (minimal human intervention that creates a set of rules that can be expanded upon) and unsupervised machine learning.
The latter requires a machine (in this case search) to learn, independently of human intervention based on its indexing of data and its own understanding of usage. In a research paper published by the University of Pennsylvania it became evident that given a small set of rules (like, for example dictionary definitions) and then let loose on a complex (but in this case limited) set of data, an algorithm is capable of disambiguating meanings from identical words with a high degree of accuracy.
One way this could happen is shown in a Digital Library Research magazine article on Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries.
The result of this approach is that it is scaleable, highly accurate and it allows search to constantly learn and expand.
How Does All This Affect You?
There are several practical takeaways here. First of all, when you consider the large amount of data that is placed on the web and the constant evolution of human language and culture, the task is an ever-evolving one, hence easy it ain’t.
Second, although having structured data on your website will help, you can see that Google is quite capable of extracting what it needs regardless.
To help it along you should have content on your website that:
- Is complete in the sense that it reads as an intelligible resource to human visitors
- Links to entities or authoritative sources that can provide additional definitions (where necessary)
- Follows a logical structure in its presentation of topics
- Contains sufficient information to help clarify the meaning of what is being discussed
To understand the importance of this consider that Facebook, with its treasure trove of data and social connections never thought of disambiguation until very recently. As a result Facebook search has always sucked and it is only now that they are beginning to address it though, as Microsoft has said, there are only so many good search engineers to go around and Google tends to have most of them.
So, think about how you will best clarify what it is your online business does through the content on your website and then set about creating it. Think about how you create clarity for the human visitor (first) and apply it. And think about how you establish the expertise of what you do, and implement it. Basically, consider just how you will establish your uniqueness in what you do and how you do it, and then set to it.
This way you’re giving your website a really good chance in semantic search indexing. (Plus you have now understood one of the hardest concepts in search).