David Amerland

Understanding Some of the Complexities of Semantic Search

Semantic search layers

A McKinsey & Company report places a value on search, globally, estimated at a conservative $780 billion. With stakes that high it’s no surprise that Google, Apple, Microsoft, Amazon, Facebook and even, Wolfram Alpha are competing to develop the perfect semantic search

While search technology is only imperfectly understood at its point of implementation its effects are clearly visible and just like that of a vehicle that rolls on a road, specific assumptions can be made that lead to a reverse engineering of sorts so that what’s under the hood can be understood a little better. 

There are six specific components we are going to look at here as they relate to semantic search:  

  •  Worst-case execution Time (WCET)
  • Analysis of Algorithms
  • Real-time components in semantic search
  • Machine Learning Models
  • Linear scale semantic mining
  • Knowledge-based semantic analysis (using linear functionals) 

Although each is independent as a concept, when it comes to search they are all related. In explaining what they are I will also show how they fit together, why and, eventually, why all this matters beyond the “that’s nice to know stage”. 

The Problem of Verification

Worst-case execution time (WCET) is a concept used to determine whether a piece of software is functioning correctly or not. It is a standard computing industry practice when it comes to assessing critical real-time systems and it is increasingly used in information retrieval (IR) where search may take a long time to bring a result for specific search queries. 

Because lengthy computations are expensive WCET is defined as the maximum length of time a computation should take to execute on a specific hardware platform. It is necessary for a good end-user experience to have a way of defining the boundaries of a worse-case execution time. But with information retrieval being as broad as the people who use it this, in itself is problematic and there are no instances of best-practice that can be applied. 

To solve the problem compilers are written in which apply a semantic analysis model to the algorithms in use in search to determine, on a per-need basis, the WCET boundaries. Semantic search applies machine learning models to understand: 

  • Search queries (employing Natural Language Processing methodologies)
  • Context (by collecting relevant end-user data)
  • Relevance (by identifying the semantic concepts in the information contained in documents)

To illustrate the magnitude of the challenge and how it differs from Boolean search that relied on keywords to retrieve information consider that the phrases “guys shooting the hoop in the car park”

and “boys playing basketball in the gym” are conceptually very similar, yet there are no keywords linking one with the other. The quality of the response here requires that the concept is understood and then the shortest path to the answer calculated as the retrieval algorithm is running. 

Many Roads Lead to Rome

The assumption with semantic search is that its relational analysis of concepts and context reveals many paths to the correct answer. This is where Algorithm Analysis comes in. Essentially this is a way of choosing the best ‘tool’ for the job based on the correct assessment of the job in hand. 

Search is a complex amalgamation of a large number of algorithmic processes, each of which does something specific and they are all subordinate to the task of providing the best possible answer to a search query. 

Real-Time is Challenging

Even without taking into account the necessity to consider the impact of breaking news and real-time developments on particular search queries, the element of verification of the accuracy of information that is itself somewhat changing all the time, semantic search is a huge challenge. 

Trust and trustworthiness is a fluid dynamic with specifically ascribed scores based on the assessment of specific connections. Recency and freshness play their part. Authority and expertise are an integral part of the equation. There is a real-time computation component even to the most historically stable of search queries. 

This means that someone asking “Who is the President of the United States?” will get an answer that is correct in the year 2025 even when it is different from the answer to the exact same query in the year 2015. 

Leaving the run-time on the computation without boundaries we can run the risk of making the simplest of queries take a long period of time to return a result (which is why WCET limits are important).  

Machine Learning and Linear Semantic Mining 

To save time in presenting the computational results of information retrieval machine learning is employed (usually human-trained systems capable of scaling their basic knowledge on their own) that employ linear semantic mining methods on web documents. Linear semantics look for direct connections between semantic documents establishing a primary point of similarity based on attributes like synonyms, structure and subject before they are then filtered for relevance. 

This approach only works when the index in which all this takes place is a knowledge-based one as opposed to a lexical analysis one. 

Knowledge-Based Searches

An index like Google’s Knowledge Graph and the Knowledge Vault create vast ontologies out of the current collection of web documents that are indexed. This makes the task of navigating them both faster and more reliable. 

The structured data within a knowledge-based index reveals more shortcuts to computing the answer to search queries because they expose relational connections to concepts, within the ontologies mapped in the system. To understand the advantage offered by structured data and ontologies consider the magnitude of the task when asked to count how many yellow marbles exist in a room filled floor-to-ceiling with marbles and how much easier it becomes when all the marbles in the room have been placed in square boxes, each containing ten marbles and stacked by colour. 

The dependence of all these structures in semantic search is best illustrated by this diagram: 

Semantic Search unpacked

Putting Your Knowledge to Work

This is your TL;DR part. Semantic search is a little bit like rocket science. Its index needs to be build for each language and culture separately. Its effects though aren’t. To take advantage of its power and gain greater visibility here’s a list of activities you should keep in mind:

  • Create content that reflects your expertise
  • Establish your authority by creating content depth
  • Use hashtags to mark content
  • Employ structured data
  • Create ontologies
Sources: 
Retrieving and Organizing Web Pages by “Information Unit”
Query Relaxation by Structure and Semantics for Retrieval of Logical Web Documents
A compiler framework for the reduction of worst-case execution times
Information Retrieval by Semantic Similarity
The Worst-Case Execution Time Problem — Overview of Methods and Survey of Tools
Automatic Selection of Machine Learning Models for Compiler Heuristic Generation
A method for computing lexical semantic distance using linear functionals
The Semantics and Proof Theory of Linear Logic
A Linear Operational Semantics for Termination and Complexity Analysis

 

© 2017 David Amerland. All rights reserved