The British police have an interesting interrogation method that’s a ramped-up version of the 20-question game. Namely they take turns to ask you the same thing in as many different ways as you can imagine. The principle behind it is that if you ask enough questions, enough times of the same subject, eventually the real truth begins to color their replies.
It’s easier to explain how something works when it no longer does. The reason for this lies in an obvious fact. When everything works as it should we forget about the effects and tend to focus on the mechanics. Because the system in question delivers what it promises we take its function for granted. As a result the “what” is conveniently overlooked and we focus on the “how”.
Let it break down at any point however and suddenly we become acutely aware of what it is that it actually does. Email, which is terrific in the way it breaks up messages at the point of origin, transmits fragmented bits over the internet pipes and then reassembles the message at the point of the receiver is amazing until it stops. Then we suddenly realize just how huge a chunk of our business relies on emails getting through to us immediately.
It’s the same with cognitive computing and semantic technologies, terms that are increasingly interchangeable. When employed correctly cognitive computing (which employs Machine Learning) takes masses of raw data and turns it into usable information by assessing the importance of each piece in relation to all the other pieces around it and then weighs the importance of a cluster of connected data in relation to all the other, similar clusters found on the web. The result is that answers are produced that closely approximate what a person would be able to provide had he had access to all the world’s information and a brain the size of a planet.
Not As Easy As It Sounds
What sounds easy to explain is hard to do. For a start the algorithms that do all this have an accepted fail rate that in the best case scenario is around the 5% globally. But the global accuracy picture does not take into account what happens when the data required to cross-check and cross-reference the veracity of the connections is not there.
To illustrate what I mean consider what happens when I turn up at a conference on Big Data and call myself a Data Scientist. Because I play to stereotypes and want to live up to expectations, I have the impressive name badge, the clipboard and the slightly odd professorial attire. To clinch the deal I have also a presentation running behind me and have paid 50 friends to turn up and tell everyone who I am.
In that environment I am a data point. My attire and presentation are my primary footprint and my 50, paid friends are my connections. Anyone entering that environment has no reason to suspect I am lying and no good reason to challenge me on what I am purporting to be.
But a Data Scientist is not a point of data that works in a vacuum. You would expect to at least find a business I am working with that independently verifies my expertise and title. A publication or two. A book maybe. At least one paper. Other publications, excerpts, comments, interviews and appearances that indicate that yes, I am who I say I am and I do what I say I do.
Should there be a doubting Thomas in the audience (and in this case he plays the role of a search engine bot) all he has to do is Google my name to find all the connections, reviews of my books, citations and mentions.
This is what cognitive computing does when it comes to information. Not only does a spider of some description check to see the complexity and veracity of the immediate web that the presence of interlinked data has created but it then checks to see its history across a much wider spectrum of information.
The 4Vs Rule
Data has a life that is governed by the Big Data concepts of:
Taken as a whole all four of the 4Vs represent a living, breathing piece of data (or datum to be a little pedantic) which, once we get past the metaphorical phase, suggests that the data actually has impact. People are interested in it. It has relative importance and therefore it has some degree of existential truth (which is where the Veracity component comes in).
Lacking that (which is what happens in my closed-world example above) holes develop in the capacity of an algorithm to truly understand what is happening. Its assessment of the situation may show that it is a case where trustworthiness may be questionable but beyond that it cannot really suggest anything.
The weakness here is in the conjecture. While humans can very quickly draw from their understanding of society and its structures and the possible pitfalls and suggest a motive in the overt absence of evidence of trustworthiness, an algorithm can only present the next ‘best’ answer it has available and that usually is never good enough.
How Does Google Do Map Semantic Connections?
Google used to use Google+ and the web at large to track individual posts, link them to websites and personal profiles, map sentiment in comments and compare it all with past profile activity and textbook ‘signature’ styles to see what is real, what is not and what is somewhere in between. It continues to do this across the wider web using machine learning technology to provide it with the only cost-effective means to do so.
Given the ability of computers to do everything faster and better and their capacity to never forget it is easy to imagine that there is an always-on, omniscient mega-machine keeping tabs on everything and everybody and assigning some kind of ever evolving numerical value to everything. Clearly, this is not the case.
The reason lies in both the amount of information that is released every moment on the web and the computational power required to keep tabs of it all. Even a company as big as Google requires some kind of shortcut to make sense of it all and those shortcuts lie in trusted entities. The problem is it takes a long time to develop trusted entities that are in the same class as say Wikipedia or the New York Times. With time this problem will be a little smaller though the amount of fresh data released on the web will only grow.
We Are The Final Link
The final link in the very long chain of processes that make information be true or false on the web, is us. Ultimately our activities, shortcuts and transparency become key to maintaining veracity across the web and while we may not be quite to the point where everyone is accountable for their actions and feels responsible for what they post, by degrees we will get there. Particularly as the divide between online and offline is being continuously bridged, first by our mobile devices and now by the advent of virtual reality and augmented reality connections.
What Marketers and Businesses Need to Know
There is good news in all this for both marketers and businesses. If you’ve already got a copy of SEO Help then you’re ahead of the game and are already reaping the benefits. If you haven’t however you need, at the very least to do the following:
- Create data-density to your online presence that at least matches your offline one.
- Find an audience. That means that on the web you need to engage. Do not just broadcast.
- Define your identity. If a guy selling cronuts can do it, anybody can.
- Think like a publisher. In Google Semantic Search I explained how now, none of us have a choice. Just like opening up a shop forces you to become an expert on window displays, color psychology and lighting, operating on the web requires you to know what works in terms of text, pictures and video.
- Be personable. If your ‘voice’ and identity do not come across then people are unlikely to want to engage with a blunt, corporate sounding machine.
- Be real. Acknowledge faults and weaknesses and work to set them right.
These are minimum requirements and each takes a lot of effort to get right. But then again something that requires hardly any effort at all is unlikely to hold much value in the eyes of the beholder which means it will not really get you anywhere.
The question as to whether a strong social signal helps your website rank in search has been at the very top of the questions asked around semantic search. While the intuitive answer is that “yes, social signals should play a role” there are some practical considerations that prevent that from happening and which Google’s Matt Cutts addressed in a short video in January 2014:
Matt’s explanations feed directly into what Google representatives have repeatedly said, at conferences I have been present at where they have explained that social signals are “dirty signals” and “weak” at best because Google hasn’t got sufficient access or cannot always fathom the intent of the posts it crawls.
Incidentally, in the video above Matt talks about semantic search’s ability to extract a person’s identity from their social network profiles even when no real name or apparently other identifying information is mentioned.
While social signals ‘weak’ or otherwise are not a ranking factor that does not mean they do not impact on search and do not help raise visibility both of personal profiles and the websites associated with them.
One way Google might do it is through the use of machine learning applying algorithms to Twitter that operate under the premise of not-so-distant supervision.
For the record, unsupervised indexing of Tweets leads to the accumulation of errors that eventually deprecate the indexing record. Supervised indexing, where human testers actively sample and correct the tagging of the indexed corpus is also not without problems as incorrect assumptions, human error and even false positive taggings lead to drops in accuracy.
Google’s proposed approach of “adapting taggers to Twitter with not-so-distant supervision” takes a hybrid approach that appears to resolve the majority of problems associated with correctly understanding what a Tweet means and how important it is in relation to other things happening on the web. A “tag” is a piece of metadata that allows an indexing bot to correctly label a Tweet so that it can be incorporated into Google’s semantic index.
As Google’s own researchers say this is far from easy: “The challenges include dealing with variations in spelling, specific conventions for commenting and retweeting, frequent use of abbreviations and emoticons, non-standard syntax, fragmented or mixed language, etc.” add to this the fact that the 140-character limit, of necessity, creates a homogenous nature to Twitter that makes it hard to separate one Tweet from another in terms of structure and you can begin to see the scope of the challenge.
How Semantic Search Indexes Tweets
Google can get round all these issues by using its superior computer processing power and search to perform a number of tasks:
- Analysis of a Tweet in relation to similarly structured search queries
- Analysis of a Tweet in relation to accelerating content across the web
- Analysis of a Tweet in relation to lexical content found on the website(s) a Tweet links to
- Analysis of the website(s) a Tweet points to
- Correlation of search queries and associated snippets provided by the search engine with the Tweet lexical structure
- Assessment of Click-Through-Rate (CTR) data on specific search queries as a qualitative guide to creating a tag library that can be applied to Tweet indexing
- Extraction of entities in Tweets and matching with known entities indexed
- Analysis of the Tweet in relation to other Tweets of a similar or directly tangential nature trending across Twitter
In other words it is a little like rocket science.
What Do You Need to Do?
If you are serious about leveraging your Twitter presence to gain in semantic search here is what you had to do:
- When you Tweet link to a URL on your site (or a site relevant to your Tweet)
- Ensure that at least 20% of your Tweets contain a URL leading to a specific page on your website
- Make sure the page you Tweet contains sufficient information to allow Google to accurately determine the intent of the Tweet
- Use language in your Tweet that is reflected in the content of the URL you point to
- Because of the way Tweets can be semantically indexed, Twitter is now more closely linked to search than ever. Tweet snippets that could potentially come up in search.
- Frame your Tweets with at least some correlation to potential search queries
- Avoid using syntax in your Tweets that is unusual or hard to fathom
- Create domain authority in your Twitter account by weighing your Tweets with content specific to your expertise
- Make sure that your Tweet and website page have at least one matching word that is not a stopword
- Do not link Tweets to your website’s Home Page, it is always discounted by Google’s indexing
While social signals do not directly act as a ranking signal in search, a strong social signal helps Google establish confidence in its indexing of websites which then does lead to higher visibility in search. Its indexing also allows it to pinpoint what the website does, increasing Google’s understanding of its differentiation from other websites which leads to more accurate matching of website content and search queries input in Google search.
A robust, carefully crafted Twitter presence can significantly aid your visibility in search by helping define your website’s relevance.
Adapting taggers to Twitter with not-so-distant supervision (Google Research)
Simple and knowledge-intensive generative model for entity recognition (Microsoft Research)
If I told you that you could make a million dollars in two weeks, from home, using Google your scam-detection radar would go on overdrive. In most cases you would be right but here’s one scenario where one person, actually did just that.
I have, of late, been putting out posts that take us down to the very fundamentals of who we are and what we do. I know, you know that I know you need a “three steps that help me do this” content because you’re busy, are juggling things and really, if you give me your trust (which I value) you want me to tell you to do three, four, five (a number, anyhow) of actionable things that will deliver what you want.