In a recent Read Write Web article that was much more myth than reality, Alex Iskold posits the fact that a semantic search engine must dethrone Google (myth1). Fortunately by the end of his article he concludes that he was mislead into thinking that. I do not think he was misled at all. I just think he is confused about it all.
He posits a few trivial queries (myth 2) to show that there is “not much difference” between Powerset and Hakia and Freebase (myth 3). And that semantic search “is no better than Google search ” (myth 4). After that Alex writes that there is a set of problems suitable to semantic search. He says these problems are wonderfully solved (already) by relational databases (myth 5).
It makes one wonder why we should mess with semantic search if the problems are already solved. It is not true. That is why. Neither was any of the talk about query complexity, true.
It is not all these myths, exactly, but unclear thinking that leads to false expectations as well as false conclusions. Alex seems to be confused about the semantic web and semantic search. These are two different things but somehow Alex morphs them into one big database. Because I do want this post to impart some information instead just being critical of a poorly informed post, let me start by debunking the myths.
Myth1: Semantic Search should dethrone Google
For many search problems, semantics plays no role at all. Semantics plays a very limited role when the query is of a transactional nature, e.g., a search problem of the type: find all x.
Google is a search engine that solves search problems of this type. Yet the Google kingdom is based on being a publisher. Google uses super-fast and superficial keyword search to aggregate dynamic content from the internet for information seekers and advertisers alike. Google’s business does not even lend itself to semantic search for some very obvious reasons having to do with speed and scalability. Google’s best customers know exactly what they want and they certainly do not want any “intellectual” interpretations.
None of the so-called semantic search engine companies, that I know of, are pursuing a business strategy to dethrone Google as an information-seeker’s destination of choice. Powerset, for example, is not aggregating dynamic content like Google. It’s business model does not seem to be based on a publishing or advertising model.
Powerset is using their understanding of semantics to assist the user (of wikipedia) in relating to that relatively static content, from several different mental or rational and conceptual perspectives. This is meant to assist the information-seeker with interpreting the content. That is a good and valid application of semantics.
This is not the position a company seeking to unseat Google would take. A company seeking to unseat Google would be better positioned by producing technology to assist advertisers in classifying, segmenting and targeting buyers.
Myth 2: Trivial and Complex Queries
Unfortunately Alex did not supply any complex examples in his post. He tried to imply that his trivial queries were complex and the most complex was impossible to solve. This query was the one labeled impossible: “What’s the best vocation for me now?” I will use Alex’s query to debunk his misguiding assumptions. First, let’s clarify by looking at the search problems represented by the Alex’s natural language queries.
Note 1: Alex offers the first query as impossible to solve. It must be because Alex is expecting a machine and some software to divine his calling based presumably on his mood now and some mind-reading algorithm. I should hope most people would seek a human counselor rather than rely on the consul of a semantic search engine for addressing their calling. It is fair to use a search engine to find a career or occupation and it is valid to expect a semantic search engine to “understand” the equivalence relationship between the terms occupation and vocation, in this context.
As I suggested best + vocation, or just vocation alone is a simple solution that should be easy to satisfy. However, this simple search solution fails on all search engines. Even so-called semantic search engines have a problem with this query (see comparative search results under myth 4 below). It is not because it is complex query. It is because Alex used the word vocation. This word is not frequent and search engines do not know its synonyms. This is a complex concept as it takes semantic acuity to “understand” it. No one talks about semantics in terms of acuity though.
Nonetheless, a search for vocation + best, and sorting the results by most recent date, will however, create a valid search context in which one can reasonably expect a solution from their semantic search engine. Most people, I am assuming, would have a more reasonable expectation than Alex; one that may be fulfilled by this internet page suggested by Readware:
A semantic search engine needs semantic acuity to “understand” that the concept of a vocation and the concept of an occupation are related. Obviously none of the search engines mentioned in Alex’s article have such acuity of understanding. Some of the search engines tried to process the pronoun me and the word now. Instead of being a solution, it created a problem as can be seen in the search results (under myth 4) below.
Note 2: This query needs a search engine with some more exotic search operators than a simple keyword search engine might provide. The query, however, is not complex. Some search engine may index US Senator as a single item to facilitate such a search. A search engine would need extended Boolean logic to process phrases using a logical AND between them. A more seasoned search engine, such as Google, would parse and logically process the query from a search box, without any specifying logic, and return an acceptable result. NLP-based engines (like Hakia and Powerset) try to do this too. They use propositional logic instead of Boolean logic. The effects are not very satisfying as can be seen below (in the search results listed under myth 4).
A more sophisticated and indeed “semantic” search engine may interpret foreign entity according to a list of “foreign entities”. It would take some sophisticated semantics to algorithmically interpret what type of labels may be subordinate to foreign entity. For example: A German businessman, a Russian bureaucrat, a Japanese citizen, an American taxpayer. Which is the foreign entity?
Yet, it is also clear that an inventory of labels can be assembled under certain contexts. Building such an inventory constitutes the building of knowledge. A semantic search engine should help inventory and utilize knowledge for future researches. None of the semantic search engines that Alex mentioned do anything like this. Readware does do this.
Note 3: This search would benefit from a search engine that recognizes names. I think Hakia has an onomastic component. I am not sure about Powerset. However, this search works on nearly any search engine because their are plenty of pages on the web that contain the necessary names and terms. Otherwise there is nothing complex about this query.
The reality, as you can see, is that every query Alex offered is trivial. Yet it demonstrates what is wrong with so-called “semantic search”. That is, today’s semantic search products, including the NLP -powered search engines masquerading as “semantic” search, fail at real tests of semantic acuity. Before I get into the evidence though. Let me just say something about semantic search technologies in general.
Semantic Search Technologies
There are no public semantic search engines today. There are search engines and there are search engines with Natural Language Processors (NLP) that work on the indexing and query side of the search equation. Whether or not databases are used to store processed indexes or search results, databases and database technology like RDBMS and SQL have nothing to do with it.
The search engines that have the capacity for natural language processing usually claim to “understand” word and/or sentence semantics– in a linguistic sense. This usually means that they understand words by their parts of speech, or they can look up definitions in a resource. Hakia and Powerset fall into this class, as does Cognition and several other search engines both in the U.S. and abroad. These are called semantic search engines and they claim to understand word sense and do disambiguation and so forth and so on, but as I will show below: at questionable acuity.
Google is not a semantic search engine at all. While Hakia and Powerset may represent some small part of the spectrum of semantic search engines they are hardly representative of semantic search. Along with Freebase and Powerset, more representative of “semantic web” search is SWSE, Swoogle and Shoe.
Besides these semantic web search engines, there are semantic search engines akin to Hakia, such as Cognition, as mentioned in this article at GigaOM, along my own favorite Readware. So, in summary, Alex’s comparison is not representative and is really poor evidence.
Myth 3: No difference between Powerset and Hakia and Freebase.
Well this is just ridiculous. It is not only a myth, it is pure misinformation. Nothing could be further from the truth. While Powerset and Hakia use NLP technology that could be construed as similar, Freebase is a essentially an open database that can be queried in flexible ways. Freebase and Powerset happen to be somewhat comparable because Powerset works on Wikipedia and uses RDF to store data, and semantic triples (similar to POS) to perform some reasoning over the data. Freebase also stores Wikipedia-like data in RDF tuples.
It is probably also worthwhile to mention that Hakia’s NLP comes from the long time work and mind of the eminent professor Victor Raskin and his students. Powerset’s NLP comes from the work of Ronald Kaplan, Martin Kay and others connected with Palo Alto Research Center, Stanford University and the Center for the Study of Language and Information (CSLI). Cognition’s technology is based on NLP work done by Kathleen Dahlgen.
While Hakia, Powerset and Cognition represent these notable NLP approaches, their search methods and results show they do not know a great deal about search tactics and solutions. They do not seem to be successful in mapping the sentence semantics into more relevant and satisfying search results. It seems, from the evidence of these queries, they only know how to parse a sentence for its subject, object and verb and, a lot like Google, find keywords.
Myth 4: Semantic Search is No Better than Google.
Hakia and Powerset are like neophytes in Google’s world of search. That alone makes these engines no better than Google. Yet, that does not apply to semantic search in general. The truth is that the semantics of the search engines we are talking about (Hakia, Powerset, Freebase and Search Monkey), do not appear to make the results any worse than those from Google. Let’s take a look at the Google search results for ‘What is the best vocation for me now’:
As may be predicted, the results are not very good (because the keyword vocation is not popular). Google also wants to be sure we do not mean ‘vacation’ instead of vocation. Hakia , on the other hand , strictly interpreted the query:
Just like the results from Google, these are not very satisfying. You might think that because Hakia is a semantic search engine, it would have the semantic acuity to “understand” that vocation and occupation are related. As you can see in the following search result, this could not be farther from the truth:
Not one of Hakia’s results had to do with occupational specialties or opportunities for career training and employment. Powerset did not produce any results when the term vocation is used and it really had nothing on occupation so it searched for best + me + now. There is nothing semantic about that and it is a pretty bad search decision as well. The results are useless; I will post them so you can judge for yourself:
When you have results like this, it really does not matter what kind of user interface you have to use. If it is a bad or poor user interface, it only makes the experience that much worse. Even if it is a good, fancy, slick or useful interface, it won’t improve irrelevant results.
Another so-called semantic search engine, Cognition, did not fare any better:
This above search result is useless provides a starting point for further investigation, as is does the search for occupation:
I actually was mildly surprised that Cognition related the term occupation to the acronym MOS, which means Military Occupational Specialty. Then I saw that they did not distinguish the acronym from other uses of the three letter keyword combination. Again not a very satisfying experience. I did not leave Freebase out, I just left them until last. All Freebase results do is confirm that vocation is an occupation or a career:
It was not possible for freebase to accept and process the full query. As this result shows, the data indicates that a vocation is also known as an occupation but none of these engines realize that fact.
Myth 5: Already solved by RDBMS.
If the search problem or the “semantic” problem could be solved by the RDBMS, Oracle would be ten times the size of Google and Google might be using Oracle’s technology if it existed at all. None of these problems (aggregated search, semantic parsing of the query and text, attention, relevance) are solved by any RDBMS. But Alex brushed over the real problems to make the claim that it is all up to the User Interface and the semantics only matter there. I suppose that was the point he was trying to make by including Search Monkey in his comments. This is just hog wash though. By that I mean that it is not true and it is in fact misleading.
Conclusions
It is plain to see that a semantic search engine needs acuity to discern the differences and potential relations that can form between existing terms, names and acronyms. It is also plain to see that none of the commercial crop of search engines have it. The Natural Language search engines, which have dictionaries as resources, do not associate vocation to occupation (for example) and therefore cannot offer any satisfying search results.
There are 350,000 words in the English language. How many do you suppose are synonymous and present a case just like this example? Parsing a sentence for its subject, object and verb, is fine. It does not mean it will be useful or helpful in producing satisfying search results.
It is foolish to think that NLP will be all that is needed to obtain more relevant search results. The fact is that search logic and search tactics are arts that are are largely unrelated to linguistics or NLP or database technology. While language has semantics, testing of the semantics of so-called semantic search engines has demonstrated that the semantics, if they are semantics, are pretty weak. I have demonstrated that semantic acuity plays a large role in producing more relevant and satisfying search results. A semantic search engine should also help inventory and utilize knowledge for future researches. An informed search produces more satisfying results.









Another Myth is that you can test out the quality of any search engine with only a few queries, or one query, in your case.
Powerset is trying to match the meaning of a query with the meaning of a sentence in Wikipedia. That’s a much more modest goal than an engine that understands any English phrase completely, but a necessary step along that road.
I’d be happy to arrange a detailed demo of Powerset if you’d like more information from us.
{mark} powerset product manager
I’m not sure how you can type in single keywords “vocation” and “occupation” and say that Freebase and Cognition didn’t return useful results. You can see that Cognition clearly understands what those terms mean (acuity) from the thing on the right side of the results page. And Freebase served up a definition of “vocation.” For semantic search to show its value, queries should be longer, not simply one keyword, otherwise, the engines revert to pattern matching.
To the reader, it merely appears as if you’ve presented a significantly biased review of an uneducated review.
“Commonsensical’s” conclusion that Semantic NLP is “weak” is unsupported in his posting. Semantic NLP, as commonly defined, involves knowledge of morphology, ambiguity, phrases, synonymy and hyponymy.
Morphology increases recall by interpreting all the forms of words, such as “baby” and “babies”. The ability to disambiguate words within context enables a search engine to avoid returning irrelevant documents that use the same words in different meanings, such as “occupation” meaning “a military takeover” as opposed to “a job”, thereby improving precision. Knowledge of phrases helps the search engine return only those documents with the phrase (or its synonyms), not the words in the phrase used in different contexts. Synonymy increases the number of relevant retrievals by finding the same concept expressed in all that ways it can be expressed. Hyponymy helps the search engine find specific instances of more general concepts, such as “car” or “truck” in response to a query about “vehicles”. All of these abilities dramatically increase precision and recall of Search results over those returned using pattern search technology.
That said, there is a long way to go to create a Search technology that truly understands meanings the way humans do. It is doubtful that any technology in the foreseeable future will function to evaluate the needs of an individual in terms of life goals, as in the silly query “what vocation is best for me now?” In addition, the features of Semantic NLP described above do not constitute a real knowledge of the world, i.e. an encyclopedia of facts, though such knowledge bases can be used to augment the performance of search engines.
“Commonsensical” asserts that “The Natural Language search engines which have dictionaries as resources, do not associate vocation to occupation (for example) and therefore cannot offer any satisfying search results.” On the contrary, Cognition’s Semantic NLP DOES associate those two terms, because it “understands” the synonymy of word meanings. For Cognition, “occupation” meaning “job” is synonymous with “vocation” meaning “job”, and not “occupation” meaning a “military takeover” or “vocation” meaning a “religious predestination”. No context for the meaning of “occupation” was given in the single-word query, but Cognition guessed it meant “job” because that is the most frequent meaning of the word. Notice that the number of retrieved documents for “occupation” is identical to the number of retrieved documents for “vocation”.
–Kathleen Dahlgren, PhD
CTO, Cognition Technologies
I was waiting for Riza Berkan from Hakia to comment. I guess it is long enough.
Thank you all for your critics. I do apologize if I showed bias toward or against any of the fine products being discussed here while talking about the NLP methods and semantic search in particular.
I am biased against claims of semantic search where semantic acuity is absent and such absence is verified by a lack of suitable correspondence between the input and the result.
I suppose that anyone using RDF and other elements of the semantic web can call their methods semantic; even that is a pretty low bar. I just don’t see the semantics in the NLP methods used to map search terms onto parts of speech and indexed sentences.
This is a hard topic to address in a few comments so I will write a post on that and put it up a little later today or tomorrow.
-Ken Ewell
[...] June 25, 2008 by commonsensical In looking at the comments of the last post The Search for Semantic Search, I see there appears to be some interesting interpretations. Let me explain my motives, address any [...]