The value of any piece of information is only as good as its truthfulness. Data that’s out of date, inaccurate or just plain false is next to useless. Worse than that it can lead to seriously flawed decisions based upon it.
The Roman Emperors knew that which is why they kept Soothsayers from the Sibylline Sisterhood around (rejoice Dr Who fans for the reference) who guided them in their choices.
We don’t have Soothsayers around anymore but we have Google and its semantic search engine. Semantic search is a Big Data solution to the problem of ever increasing amounts of information accumulating on the web. As such its success is defined, precisely, by its ability to deal with the four Big Data vectors:
The sheer amount of information being pumped out (Volume), its spread across the web (Velocity) and its ability to be repurposed (Variety) are signals in their own right. Each piece of information, each data point, each node, experiences these vectors in a unique combination that also makes the digital identity of that piece of information unique. What they all have in common however is the “yes” or “no” answer that’s provided by the fourth Big Data vector: Veracity.
Veracity is a measure of the truthfulness or provenance of the information you find on the web and the ability of the computationally driven, answer engine of semantic search to gain our trust, rests completely upon it.
How Veracity Works
A new patent awarded to IBM recently shows that the mechanics through which this process is carried out rely upon the connections of semantic relationships which themselves have a very specific meaning and intent, very much what Google's semantic search is all about. The patent shows how semantic analysis ascribes a trustworthiness value to each connection and a total value to the semantic network of connections. If more than one semantic network is involved in the calculation of the answer the search engine then ascribes a trustworthiness value to each of them.
The moment a conflict is detected between two pieces of data the search engine then uses a trustworthy algorithm to determine which is the correct one, updates the obsolete or inaccurate document on the fly and presents the results to the end user both with the correct version and the annotated inaccurate version.
The description of IBM’s patent on Veracity detection says:
With existing search mechanisms, a user search for the product will include web page documents that contain the original release date of the product, without providing any indication in these returned documents that the release date has changed. Thus, the user may be presented with erroneous information. In contrast, the annotation mechanism in the illustrative embodiments identifies obsolete data within a web page document, annotates the document with up-to-date information to override the obsolete data in the document, and returns the annotated document in the search results to the user. Thus, when a user clicks on the annotated document, the obsolete data is annotated in the document with up-to-date data also obtained from the repository. When a document comprises data whose veracity is in question due to the existence of conflicting data in other documents within the repository, a “trustworthiness” algorithm is used to dictate which data source contains the correct or most reliable data. Any data that is deemed to be incorrect or unreliable on the displayed web page is annotated automatically. In an alternative embodiment, rather than determining which data is correct via a trustworthiness algorithm, a link may be provided in the annotation which provides a summary of all the conflicting data, along with sources of the conflicting data. In this case, users may determine for themselves which data source is most reliable.
In plain English itwill determine whether a piece of data is true and will:
A. Display an answer (based upon it)
B. Display the correct web page and the incorrect, annotated one
C. Display the conflict regarding a web page’s truthfulness and then provide a link to all the sources that the search engine has determined create the conflict so that the end user can decide.
Which will appear is most probably determined by the complexity of the search query and the degree of conflict experienced.
‘Simple’ Questions Have Complicated Answers
To illustrate just how important this is consider the ‘simple’ question of “Can a Hippopotamus Swim?” or “Can a Kiwi bird fly?”. As the image below shows, Google’s semantic search right now (December 2013) simply doesn’t know.
The reason that this is not an easy question for search to answer lies in the knowledge necessary to actually determine this from the page. In the traditional Boolean search a webpage that states: “The hippopotamus, a creature indigenous to parts of Africa, is the only mammal that cannot swim. It is also the only mammal that does not have hair.” has the same opportunity to come up in search as one that says: “There are a number of animals in the Edinburgh zoo, including penguins, zebras, and hippopotamuses. Visitors can feed the penguins, but they cannot swim in the penguin pool.”
There is simply insufficient information for search to distinguish the correct page. The keyword occurrence and frequency is the same. Thanks to this new approach however, Google applies a relevancy algorithm that creates a semantic network out of the content of the page that looks a little like this:
The image shows the breakdown of the semantic relationships with words being nodes (i.e. data points) and connections being edges. It becomes evident that from this break down through the relevancy algorithm the answer begins to emerge, especially if the same break down now occurs on the web page that Boolean search would have marked as identical:
So, What Have We Learnt About Semantic Search?
A few things: first that Google uses a layering of algorithms to arrive at the answer. In this example alone it would have to create a semantic network out of each web page and then use: A Relevancy Algorithm – to determine the relevancy and meaning of connections (nodes) within the semantic network
- A Trustworthiness Algorithm – to compare data between two documents with similar but conflicting information.
- A Deductive Reasoning Algorithm – if an outright answer is warranted in search.
- An Annotation Engine – that could enable Google search to annotate and correct wrong information on the fly and display the changes.
How Is This Applied?
Well, think of a major event that is heavily publicised by your company that gets reblogged about and mentioned everywhere across the web. Then just a few days before the event, for whatever reason, the date has to change. While changing the information on your company website is easy, changing it across the web is next to impossible. There is no way you can realistically identify and then chase to change, all the websites where the original date of the event appears.
Applying a very similar approach Google could now actually do that right at the search page, updating the out of date event with the correct date and showing what’s changed.
There are wider implications: As information across the web scales (and we pump out more and more data each day) Google can begin to examine the semantic networks in its storage and determine the correctness of the information presented on the search page. This makes it a natural filter of false or incorrect information which will begin to transform Google from a search engine to a Truth Engine.
As semantic search scales across the web Google search will start to become the go-to place for verifying information just like now it has become the world’s favourite means of spell-checking.
In the age of semantic search the Sibylline Sisterhood would be out of a job. The Soothsayers in need of retraining.