How Google Reads Images in Search

We used to have to put words on the web because computers could not see images. When all that a search engine had to go by was the alt text input by a human and, maybe, the text surrounding an image, understanding what it was exactly was a very hit and miss affair. 

The web is a very visual medium. Yet it has become the wordiest of domains precisely because search engines could deal with text but not images, and though we knew that a picture is always “worth a thousand words” we needed at least a thousand words to be written in the hope that a web page would rank and a picture would get seen by enough human visitors, to work its magic. 

The approach created its own conventions that, with time, led to pictures being provided in text as mostly handy visual breaks for online visitors that had to wade through reams of text. Few saw real value in using pictures to market themselves or increase the layer density of each web page.  

Words Are Vectors

Semantic search changes all of that and Google came up with a brilliant way of tackling it that is predicated on language translation. Although the mathematics are complex the idea behind them is not: Every language has a strict logical order in which words can be associated with each other which is based upon: 

  • The internal grammatical logic of the language
  • Popular convention
  • Culture
  • Use & usage paradigms

When, for instance, Arnie (of Terminator fame) begins to say “Hasta la vista…” you know the sentence in not complete if you do not add the word “baby” after it to make the complete signature phrase: “Hasta la vista, baby”. 

The way words are associated and how they occur is computationally represented in vectors in two dimensional diagrams that ascribe a numerical value to each word that makes sense when viewed within the four parameters mentioned above. That computational value can then be reverse-translated into another language using the same four parameters above, changing the sentence syntax to reflect the numerical values as opposed to a word-for-word translation.  

Images Have Words

The traditional approach to image translation is to take an image and try to match the elements within it to already known entities in an attempt to devine the context. This was covered in my Computer Vision article. Because an image can appear in many variants of shape, size and perspective in a picture then a probability assessment would be carried out to deduce whether some elements ‘detected’ in the picture should be there within that context. 

The approach worked well enough when the data sets were restricted (when computers worked only with a limited set of categories or when images were restricted only within particular data sets) but that meant that if you really wanted to go all-out and illustrate an article with a multi-layered picture that played on the theme and acted almost as a metaphor (something like illustrating the idiomatic phrase: “at our place of work we eat our own dogfood”) it was unlikely that a computer would be capable to show its appreciation of your ingenuity, and your cleverness would be lost.  

Images Can Be Translated

Well, that cleverness need not be lost any more. Google researchers realized that if they could use a computer’s image-element reading ability, combined with advances in entity detection within an image and translated that image into vector words which then used a traditional language-to-language translation model to create a caption, the effect would be, essentially a ‘translation’ of an image into language in a way that made sense for both. 

The net result of this approach is that complex images can now be translated the same way complex sentences from one language can be translated into another without loss of meaning.  

The Web Is Visual Again

For marketers this has a significant impact. For a start, if you are good in the visual field, if you have a strong graphic designer, if you can take great photographs, you can use that as a strength to help you build your digital identity in the semantic web. 

 More than that however there is a long list of changes which will make themselves felt as a result of this:  

  • Pictures can truly support text now and add extra depths to it, thereby further enriching its semantic value.
  • You can have web pages with minimal or no text and only pictures.
  • You can use pictures as a primary marketing and branding aid.
  • The pictorial material you produce begins to become part of your digital identity, informing your authority and expertise just like text does.
  • The web will become increasingly visual.
  • Pictures will become the primary attraction on a web page (particularly with mobile devices and smartphones).
  • Words will have to be curbed to add extra value to a picture or explain things the picture perhaps cannot.
  • Google Image Search can become a powerful driver of traffic for search queries that may not even be reflected on the word content on the page.
  • Originality in the images used is now as important as originality in the written word.
  • Pictures will play an increasing role in overall branding efforts.

All of this is part of Google’s ongoing efforts to make semantic search a powerful tool giving the end user a seamless experience. Not quite there yet but this is incredibly encouraging.  


A picture is worth a thousand (coherent) words: building a natural description of images (Google Research)
Learning the meaning behind words (Google Research) 
Building a deeper understanding of images
Deep Visual Semantic Alignments for Generating Image Descriptions
Visual Search and How it Affects Your Marketing in a Semantic Web