Feeds:
Posts
Comments

Archive for the ‘Semantic Search’ Category

What is “meaning” in questions such as: what is the meaning of life? It is the same as asking what is the truly real significance of life. Any answer is only theoretical.  Intuitively, any answer must be universal.  The truly real significance must, by definition, be significant for everyone.

That makes the notion appear to be either exaggerated or rather improbable.  The universality of such a theory of meaning would rest on the multitude of “real” things that are perceived by the theory as salient, pertinent properties and relations in “real life” and to humanity in general.  It would have to include everything we can imagine in experience.  How could it be possible?

This would also make it necessary to correspond with every “real” experience, in just enough (and no more) dimensions, necessary to make such experience “really” meaningful.  Intuitively, it must capture or cover any continuous or discrete distributions or extensions of “real” natural structure, elements or processes, in three dimensions of space and one dimension of real presence or immediate existence x.

It is very complex but not impossible.  On the one hand, one cannot help but wonder how to deal with such complexity.  On the other hand, we notice that very young children do it. Four-year old children seemingly adapt to complexity, with very little problem.  It is sophistication and obfuscation that comes later in life with which they have problems.  At four, children are already able to tell the differences in sensible and nonsensical distributions and extension of reality,  irrespective of whether they are the continuous or discrete variety.

These continuous or discrete distributions and extensions bear some additional explanation mainly due to the overarching significance to this context. First, they establish a direct correspondence with our most immediate reality. For every time we open our eyes, we see a real distribution of colored shapes.  Such a real distribution is nature’s way of communicating its messages to consciousness, via real patterns.

Second, perceived distribution patterns directly suggest the most fundamental ontological concept in theoretical physics: a field configuration, which in the simplest example of a scalar field can be likened to a field of variable light intensity.  That life is intense and that meaning is intense is not something one ought to have to prove to anyone. I will come back to intensity in another post, as I want to continue commenting on presence or real and immediate existence x. We must, in practice and in effect, solve for the real meaning of x as you see.

Meaning in this case, so defined, is literally the significance of truth, or more appropriately, what one interprets as significant or true within the dimensions of intense messages or information pertaining to real life as specified above. So, we must begin, undoubtedly, by defining what true is, then proceeding to the next step, we ought define the elements and structure to one’s interpretation of this truly significant nature of life. I did it a little backwards in this respect and this has always created a bit of a confusion that I did not see until recently.

One begins any such analysis by examining a subject’s real elements and structures. For the subject of truth, one also searches the literature where it is well represented. Such a search conducted on the subject of truth brings a broad range of ideas. To try and make a taxonomy of ideas from the varied opinion found there would turn out to be an exercise in incoherence, But it ought be acceptable to reference some theories and practices that have been adopted.

Ibn Al-Haytham, who is credited with the introduction of the Scientific Method in the 10th century A.D., believed, “Finding the truth is difficult and the road to it is rough. For the truths are plunged in obscurity” (Pines, 1986, Ibn al-Haytham’s critique of Ptolemy. In Studies in Arabic versions of Greek texts and in medevial science, Vol. II. Leiden, The Netherlands: Brill. p. 436). While truths are obscured and obfuscated; there can be no doubt that the truth does exist and the truth is there to be found by seekers. I do not accept views or opinions that the  average layman is too stupid or are otherwise not equipped to figure it out by themselves.

The Modern Correspondence Theory of Truth.

While looking for the truth it helps to know what shape it takes or what it may look like when one happens upon it or finds it lying around and exposed to the light. According to some: truth looks like correspondence between one thing or element and another, Scientist have long held a correspondence theory of truth. This theory of truth is at its core an ontological thesis.

It means that a belief (a sort of wispy, ephemeral, mostly psychological notion) is called true if, and only if, there exists an appropriate entity—a fact—to which it corresponds. If there is no such entity, the belief is false. So you see, as we fixate on the “truth of a belief” –a psychological notion such as a thought of something —to be sure —but some concrete thing, nonetheless, we see that one thing —a belief— corresponds to another thing —another entity called a fact. The point here, is that both facts and beliefs are existing, real entities — even though they may also be considered to be psychological or mental notions — beliefs, ideas –they– are reality.

While beliefs are wholly or entirely psychological notions, facts are taken to be much stronger entities. Facts, as far as neoclassical correspondence theory is concerned, are concrete entities in their own right. Facts are taken to be composed of particulars and properties and relations or universals, at least. But universality has turned out to be elusive and the notion is problematic for those who hold personal or human beliefs to be at the bottom of truth.

Modern theories speak to “propositions” which may not be any more real, after all. As Russell later says, propositions seem to be at best “curious shadowy things” in addition to facts. (Russell, Bertrand, 1956, “The philosophy of logical atomism”, in Logic and Knowledge, R. C. Marsh, ed., London: George Allen and Unwin, 177-281. Originally published in The Monist in 1918. , p. 223) If only he were around here now; one can only wonder how he might feel or rephrase.

In my view, the key features of the “realism” of correspondence theory are:

  1. The world presents itself as “objective fact” or as “a collection of objective facts” independently of the (subjective) ways we think about the world or describe or propose the world to be.
  2. Our (subjective) thoughts are about the objective fact of that world as represented by our claims (facts) which, presumably, ought be objective.

(Wright (1992) quoted at the SEP offers a nice statement of this way of thinking about realism.) This sort of realism together with representationalism is rampant in the high tech industry.  Nonetheless, these theses are seen to imply that our claims (facts) are objectively true or false, depending on the state of affairs actually expressing or unfolding in the world.

Despite the fact of one’s perspective, metaphysics or ideals, the world that we represent in our thoughts or language is a socially objective world. (This form of realism may be restricted to some social or human subject-matter, or range of discourse, but for simplicity, we will talk only about its global form as related to realism above.)

The coherence theory of truth is not much different than the correspondence theory in respect to this context. Put simply, in the coherence theory of truth: a belief is true when we are able to incorporate it in an orderly and logical manner into a larger and presumably more complex web or system (sic) of beliefs.

In the spirit of American pragmatics almost every political administration since Reagan has used the coherence theory of truth to guide national strategy, foreign policy and international affairs. The selling of the War in Iraq to the American people, is a study in the application of the coherence theory of truth to America’s state of affairs as a  hegemonic leader in the world.

For many of the philosophers who argue in defense of the coherence theory of truth, they have understood “Ultimate Truth” as the whole of reality. To Spinoza, ultimate truth is the ultimate reality of a rationally ordered system that is God. To Hegel, truth is a rationally integrated system in which everything is contained. To the American Bush dynasty, in particular, to W.: truth is what the leaders of their new world order say that it is.  To Adi, containment is only one of the elementary processes at work creating, enacting (causing) and (re)enacting reality.

Modern scientists break the first rule of their own skepticism by being absolutely certain of information theory.

Let me be more specific.  Modern researchers have settled on a logical definition of truth as a semantic correspondence by adopting Shannon’s communications theory as “information” theory. Those object-oriented computer programmers who use logic and mathematics; understand truth as a Boolean table and as correspondence as per Alfred Tarski’s theory of semantics.

Modern computer engineers have adopted Shannon’s probabilities as “information theory” even though, on the face of it: the probabilities that form such an important part in Shannon’s theory are very different from messages; which stand for the kinds of things we most normally associate with objects. However, to his credit, the probabilities on which Shannon based his theory were all based on objective counting of relative frequencies of definite outcomes.

Shannon’s predecessor, Samuel Morse, based his communication theory, which enhanced the speed and efficiency with which messages could be transmitted, on studying frequently used letters. It is the communications theory I learned while serving in the United States Army. It was established by counting things — objects in the world — the numbers of different letter-type in the printer’s box.

When I entered the computer industry in 1978, I was somewhat astonished that Shannon’s theory of communications was already established in the field of information science — before word processors and “word” processing were common. I confirmed that belief by joining with information scientists for awhile, as a member of the American Society of Information Science (ASIS).

While at ASIS, I found out that Shannon’s probabilities also have an origin in things much like Morse code: although they in no way ought be considered to be symbols that stand for things. Instead, Shannon’s probabilities stand for proportions of things in a given environment.

This is just as true of observationally determined quantum probabilities (from which Shannon borrowed on the advice of the polymath John Von Neumann) as it is for the frequencies of words in typical English, or the numbers of different trees in a forest, or; the countable blades of grass on my southern lawn.

Neither Morse Code, nor Shannon’s Communications theory, nor any “information” theory, directly addresses the “truth” of things in or out of an environment –save Adi’s. The closest any computer theory or program gets to “interpretation” is by interpreting the logical correspondence of statements in respect to other statements — both with respect to an undefined or unknown “meaning” — the truth or significance or unfolding of the thing in the world. It takes two uncertainties to make up one certainty according to Shannon and Von Neumann– who had two bits of uncertainty, 1 and 0, searching for, or construing, a unity.

That is not us. That is not our scientific program. Our program was not to construe a unity, or “it” from “bit.”  That is the program of the industry, because, almost like clocks, everyone in industry marches in lock step by step, tick by tock, take-stock.

Adi began with the assumption that there is an overarching unity to “it.” He then studied how a distribution of signs of “it” (i.e., symbols that make up human languages describing “it”) manages to remain true to the unity of “it,” despite constant change. Such change, it can be argued, arrives in the guise or form of uneven or unequal adoption, selection, and retention factors, as seen in the overwhelming evidence of a continuous “morphogenesis” in as much as the formation, change and meaning of wordsfacts and other things, over eons.

To determine how people interpret the intensity and sensibility or “information” projected with language by means of speech acts (with messages, composed of words) — Adi investigated the sounds of symbols used to compose a real human language when most people were inventing artificial, specialized, logical and less general languages.  Adi chose to study the unambiguous sounds of Classical Arabic that have remained unchanged for 1400 years until present day.  That sound affects what we see is in no way some incidental trivia or minutia.

At the least, it helps truth break free of being bound to mere correspondence, a relegation reminiscent of mime or mimicry. Adi’s findings set truth free,  liberates truth, to soar to heights more amenable — such as high-fidelity,–  than those that burn out in the heated brilliance of spectacular failure.  In fact, in early implementations of our software we had an overt relevance measure called “fidelity” that users could set and adjust.  It speaks to the core of equilibrium that permeates this approach to conceptual modelling, analysis, searching for relevance and significance, subject and topic classification and practical forms of text analytics in general.

Tom Adi’s semantic theory interprets the intensity, gradient trajectory and causal sensibility of an idea presumably communicated as information in the speech acts of people. This “measure” of Adi’s (or we may call it “Adi’s Measure”) can be understood as a measure of the increase in the magnitude (intensity) of a property of psychological intension. (e.g., like a temperature or pressure change or change in concentration) observable in passing from one point or moment to another. Thus, while invisible, it can be perceived as the rate of such a change.

In my view, it is in the action of amplitude, signifying a change from conceptual, cognitive or imaginative will or possibility, to implementation or actualization in terminal reality. Computationally, it is and can be used as a vector formed by the operator ∇ acting on a scalar function at a given point in a scalar field. It has been implemented in an algorithm as an operating principle, resonating —   acting/reacting (revolving, evolving) as a rule, i.e.; being an operator: conditioning, i.e., coordinating/re-coordinating,  a larger metric system or modelling mechanism (e.g., Readware; text analytics, in general).

I mention this to contrast Adi’s work with that of Shannon, who, in order to frame information according to his theory of communications, did a thorough statistical analysis of ONLY the English language. After that analysis, Shannon defined information as entropy or uncertainty on the advice of Von Neumann.  The communications of information (an outcome) involves things which Shannon called messages and probabilities for those things. Both elements were represented abstractly by Shannon: the things as symbols (binary numbers) and probability simply as a decimal number.

So you see, Shannon’s information represents abstract values based on a statistical study of English. Adi’s, information, on the other hand, represents sensible and elementary natural processes that are selected, adopted and retained for particular use within conventional language –as a mediating agency– in an interpersonal or social act of communications. Adi’s information is based upon a diachronic study of the Arabic language and the confirming study in fourteen additional languages, including modern English, German and French, Japanese and Russian, all having suffered the indisputable and undeniable effects of language change — both different from and independent of the evolution of language, or the non-evolution, as-it-were, of Classical Arabic.

Adi’s theory is a wholly different treatment of language, meaning and information than either Shannon or Morse attempted or carried out on their own merits. It is also a different treatment of language than information statistics gives, as it represents the generation of salient and indispensable rules in something said or projected using language. It is different from NLP or Natural Language Processing which depend (heavily) on the ideas of uncertainty and probability.

A “concept search” in Adi’s calculation and my estimation, is not a search in the traditional sense of matching keys in a long tail of key information.  A “concept search” seeks mathematical fidelity, resonance or equilibrium and symmetry (e.g., invariance under transformation) between a problem (query for information) and possible solutions (i.e., “responses” to the request for information) in a stated frame or window (context) on a given information space (document stack, database).  A search is conducted by moving the window (e.g., the periscope) over the entirety of the information space in a scanning or probing motion.  While it ought be obvious, we had to “prove” that this approach works, which we did in outstanding form, in NIST and DARPA reviewed performances.

Adi’s theory is not entirely free of uncertainty as it is, after all, only theoretical. But it brings a new functionality, a doctrinal functionality, to the pursuit of certainty by way of a corresponding reduction of doubt. That is really good news. In any case, this is a theory that deserves and warrants consideration as a modern information theory that stands in stark contrast to the accepted norm or status-quo.

Read Full Post »

A key is a fundamental or central operative of harmony. The connexion of relevance is recognized concordantly.

A quick read of popular technology news and review sites– gives one the impression that the trouble people have with search engines– those called semantic search engines and all other search engines too– is the relevance of the results. This of course, is besides any trouble people have with the actions of the company fielding the search technology, e.g. the corporate entities such as Google, Yahoo, Microsoft and Powerset, and Hakia, Cognition among about a hundred others.

The problem I want to address is the problem with the relevance of the results, because even with the new crop of semantic search engines using sophisticated Natural Language Processing (NLP) technologies, the problem remains. In fact, NLP technology hardly addresses ambiguity, let alone, relevance. There is a good reason for this that I can now summarize, after dealing with it for more than twenty-five years. The problem actually stems from the abstract ideas about relevance.

By ideas, I just mean people’s thoughts and deliberations over exactly what is relevant. For corporations, the answer is very obvious: what is relevant is nothing but that substance that increases corporate equities. That substance may be abstract to some, but it is very real to corporate shareholders and the business managers they employ. It makes perfect sense that money and wealth and any investments of the same in any assets that generate that substance is the thing that is relevant to business. Any assets that do not perform or have little or no uptake in that way are dumped. This is why industry, corporations and businesses thereof, have overlooked the keys to relevance and have instead held steadfast to their own values of relevance to their own institutions.

However, a quick scan of current events shows that this substance they value so highly is not very real. It can evaporate and disappear before your very eyes. This is because money has a closer connection to fuel than it does to the fundamental keys of relevance. Anyway, that shows that that substance: money, wealth, etc. are not fundamental keys to relevance. The same thing is true of the objects of Natural Language Processing technologies: they are not the substance or keys to relevance; some of the objects they use are the signs of such substance: the semiotic keys.

A word is such a key: a semiotic key, not a harmonic key. A harmonic key is the substance of relevance and meaning of which the word is a semiotic key symbol. A symbol is a sign of something invisible or abstract. A word is a sign of abstract harmonic keys that are the substance and essence of relevance to judgment. As I pointed out in my previous post, these keys are marked by sounds, phonemes, letters.

Harmonic keys would behave just as they sound, and they would not leave anyone reeling in discord and conflicted in such a way as money markets are conflicted today. This is because the keys to relevance are concordant to the essence of relevance in one’s own mind. By that I mean, what is relevant to one’s own judgment. Just what is that?

That is of course, the premises required by the rational mind. For the premises of an argument for your judgment are often left unstated or hidden– left for you to figure out – left abstract.

Would anyone like to know more? For example, would you like to know the premise of fear? Why do people feel fearful about the economy today? The premises for this are factual. The signs appear in the outlook or horizon and in the word fear: the semiotic symbol itself. If you know this premise, you already know the reason for your fearfulness. It is like asking what is the fundamental quantity of fearfulness? The fundamental quantity of a physical substance is mass, length or time. What is the fundamental quantity of a conceptual substance like fear? Wouldn’t you want to know this measure?

Leave a comment if you would.

Read Full Post »

Search. I suppose there is no denying that the word “search” ascended to significance in the consciousness of more people since the birth of Information Science than perhaps at any other time in history. This supposition is supported by a recent Pew Foundation internet study stating that:

The percentage of internet users who use search engines on a typical day has been steadily rising from about one-third of all users in 2002, to a new high of just under one-half (49%).

While it may not be obvious, it becomes apparent on closer examination of the phenomena, that the spread and growth in the numbers of words and texts and more formal forms of knowledge, along with the modern development of search technology, had a lot to do with that.

Since people adopted the technology of writing systems, civilizations and societies have flourished. Human knowledge and culture, and technological achievement, have blossomed. No doubt.

Since computers, and ways of linking them over the internet, came along, the numbers of words and the numbers of writers have increased substantially. It was inevitable that search technology would be needed to search through all those words from all those writers. That is what Vannevar Bush was telling his contemporaries in 1945 when he said the perfection of new instruments “call for a new relationship between thinking man and the sum of our knowledge.

But somewhere along the line things went wrong; some things went very, very wrong. Previous knowledge and the sum of human experience was swept aside. Search technology became superficial, and consequently, writing with words is not considered as any kind of technology at all. That superficiality violates the integrity of the meaning of search, and the classification of words merely as arbitrary strings is also wrong, in my view.

Some scientists I know would argue that the invention of writing is right up there at the top of human technological achievement. I guess we just take that for granted these days, and I am nearly certain that scientists that were embarking into the new field of information technology in the 1940’s and 1950’s were not thinking of writing with words as the world’s first interpersonal memory– the original technology of the human mind and its thoughts and interactions.

Most information scientists have not yet fully appreciated words as technical expressions of human experience but treat them as labels instead. By technical, I mean of or relating to the practical knowledge and techniques (of being an experienced human).

Very early in the development of search technology, information scientists and engineers worked out assumptions that continue to influence the outcome, that is, how search technology is produced and applied today. The first time I wrote about this was in 1991 in the proceedings of the Annual Meeting of the American Society of Information Science. There is a copy in most libraries if anyone is interested.

And here we are in 2008, in what some call a state of frenzy and others might call disinformed and confused– looking at the prospects of the Semantic Web. I will get to all that in this post. I will divide this piece into the topics of the passion for search technology, the state of confusion about search technology, and the semantics of search technology.

The term disinformed is my word for characterizing how people are under-served if not totally misled by search engines. A more encompassing view of this sentiment was expressed by Nick Carr in an article appearing in the latest issue of the Atlantic Monthly where he asks: Is Google making us stupid?

I am going to start off with the passion of search.

Writing about the on-line search experience in general, Carmen-Maria Hetrea of Britannica wrote:

… the computing power of statistical text analysis, pattern-matching, and stopwords has distracted many from focusing on (should I say remembering?) what actually makes the world tick. There are benefits and dangers in a world where the information that is served to the masses is reduced to simple character strings, pattern matches, co-location, word frequency, popularity based on interlinking, etc.

( … ) It has been sold to us as “the trend” or as “the way of the future” to be pursued without question or compromise

That sentiment pretty much echos what I wrote in my last post. You see, computing power was substituted for explanatory power and the superficiality of computational search was given credibility because it was needed to scale to the size of the world wide web.

This is how “good enough” became state of the art. Because search has become such a lucrative business and “good enough” has become the status quo, it has also become nearly impossible for “better” search technology to be recognized, unless it is adopted and backed by one of the market leaders such as Google or Microsoft or Yahoo.

I have argued in dozens of forums and for more than twenty years that search technology has to address the broader logic of inquiry and the use of language in the pursuit of knowledge, learning and enhancing the human experience. It has to accommodate established indexing and search techniques and it has to have explanatory power to fulfill the search role.

Most that know me know that I am not showing up at this party empty-handed. I have software that does all that and while my small corporate concern is no market or search engine giant my passion for search technology is not unique.

In her Britannica Blog post about search and online findabillity, Carmen-Maria Hetrea summed up her passion for search:

Some of us dared to differ by returning to the pursuit of search as something absolutely basic to the foundations of our human existence: the simple word in all of its complexity — in its semantics and in its findability and its futuristic promise.

You have to ask yourself what you are really searching for before you can find that it is not for keywords or patterns at all. Out in the real world almost everyone is searching for happiness. Some are also searching for truth or relevance. And many search for knowledge and to learn. If your searching doesn’t involve such notions, maybe you don’t mind the tedium of thorough, e.g., exegetical, searching. Or maybe you are someone who doesn’t search at all, but depends on others for information.

How is the search for happiness addressed by online search technology? Should it be a requirement of search technology to find truth or relevance? Should a search be thorough or superficial? Is it about computing power or explanatory power? I am going to try and address each of these questions below as I wade through the causes of confusion, expose the roots of my passion and maybe shed some light on search technology and its applications.

Some people have said in the online world you have both the transactional search and the research search, which are not the same. They imply that these search objectives require different instruments or plumbing. I don’t think so. I think it is just a crutch vendors use to justify superficial search. Let’s look at an example transactional search, say, searching for a new car. There are so many places where you can carry out that transaction, being thorough and complete is not an issue. Here’s is a search vendor quiz:

Happiness is a ___________ search experience.

Besides searching for objects of information that we know but don’t have at hand, in cyberspace and on the web, we might search for a pizza place in a new destination. Many search for cheap air fares or computer or car parts, or deals on eBay, while others search for news, music, pictures and many other types of media and information. A few others search for knowledge and for explanation. Happiness in the universe of online search is definitely a satisfying search experience irrespective of what you are searching for.

Relevance is paramount to happiness and satisfaction whether searching for pizza in close proximity or doing research with online resources. Search vendors are delivering hit lists from their search engines, where users are expecting relevance and to be happy with the results. Satisfaction, in this sense, has turned out to be a tall order and nonetheless a necessary benefit of search technology that people still yearn for.

Let’s now turn to the state of confusion.

Carmen-Maria mentions that new search technology has to be backward compatible and she also complains that bad search technology is like the wheel that just keeps getting reinvented:

The wheel is being reinvented in a deplorable manner since search technology is deceptive in its manifestation. It appears simple from the outside, just a query and a hitlist, but that’s just the tip of the iceberg. In its execution, good search is quite complex, almost like rocket science.

… The wealth of knowledge gained by experts in various fields – from linguists to classifiers and catalogers, to indexers and information scientists – has been virtually swept off the radar screen in the algorithm-driven search frenzy.

The wheel is certainly being re-invented; that’s part of the business. I am uncertain what Carmen-Maria means by algorithm-driven search frenzy. Algorithms are the stuff of search technology. I believe that some of the problems with search stem from the use of algorithms that are made fast by being superficial, by cutting corners and by other artificial means. The cutting of corners begins with the statistical indexing algorithms or pre-coordination of text– so retrieval is consequently hobbled by weaknesses in the indexing algorithms. But algorithms are not the cause of the problem.

Old and incorrect assumptions are the real problem.

Modern state-of-the-art search technology (algorithms) invented in the 1960’s and 1970’s strip text of its dependence on human experience under something information science (IS) professionals justify as the independence assumption. Information retrieval (IR) professionals– those that design software methods for search engine algorithms– are driven by the independence assumption to treat each text as a bag of words without connection to other words in other texts or other words in the human consciousness.

I don’t think he was thinking about this assumption when Rich Skrenta wrote:

… – the idea that the current state-of-the-art in search is what we’ll all be using, essentially unchanged, in 5 or 10 years, is absurd to me.

Odds are that he intends to sweep a lot of knowledge out of the garage too, and I would place the same odds that any “new” algorithm Rick brings to the table will implicitly apply that old independence assumption too.

So this illustrates a kind of tug of war between modern experts in search technology and the knowledge of ages of experience. There is also a kind of frenzy or storm over so-called “new” technologies and just what constitutes “semantic” search technology. While some old natural language processing (NLP) technology has debuted on the online search scene, it has not brought any new search algorithms to light. They have only muddied the waters in my opinion. I have written about this in previous posts.

The underlying current is stirred up by imbalance existing in the (significant) history of search technology contrasted with the nascence of online search and other modern applications of search technology. Add to that disturbance the dichotomy exasperated by good (satisfying) and bad (deceptive) search results, multiplied by the number of search engine vendors, monopolistic or otherwise, and you have the conditions where compounding frenzy, absurdity and confusion, rather than relevance, reigns supreme.

I like to think my own view transcends this storm and sets an important development principle that I established when I produced the first concept search technology back in 1987. The subjects of the search may be different but the freedom to search for objects, for answers, or for theories or explanations of unknown phenomena is the right of inquiry.

This right of intellectual inquiry is as important and as basic as the freedom of speech. This is what ignites my passion for search technology. And I cannot stand to have my right of inquiry blocked, limited, biased, restricted, arrested or constrained, whether by others, or by unwarranted procedure (algorithm) or formality, or by mechanical devices.

I wear my passion on my sleeve and it frequently manifests as a rant against the “IT” leaders or so-called experts that Carmen-Maria wrote about:

Many consider themselves experts in this arena and think that information retrieval is this new thing that is being invented and that is being created from scratch. The debate often revolves around casual observations, remarks, and opinions come mostly from an “IT” perspective.

To be fair, not all those with “IT” perspectives are down with all this “new thing” in online search engines. Over at the Beyond Search blog, Stephen Arnold wrote about the problem with the thinking about search technology:

… fancy technology is neither new nor fancy. Google has some rocket science in its bakery. The flour and the yeast date from 1993. Most of the zippy “new” search systems are built on “algorithms”. Some of Autonomy reaches back to the 18th century. Other companies just recycle functions that appear in books of algorithms. What makes something “new” is putting pieces together in a delightful way. Fresh, yes. New, no.

I also think Stephen understands the history of search technology pretty well. He demonstrates this when he writes:

Software lags algorithms and hardware. With fast and cheap processors, some “old” algorithms can be used in the types of systems Ms. Hane identifies; for example, Hakia, Powerset, etc. Google is not inventing “new” things; Google is cleverly assembling bits and pieces that are often well known to college juniors taking a third year math class.

Like Carmen-Maria Hetera, Stephen Arnold sounds biased against algorithms, “old” algorithms in particular, though I don’t think he intended any bias, as many of the best algorithms we have are “old”. There are really not many “new” algorithms. Augmented, yes. Modified, Yes. New, no.

To be involved in IT and biased against algorithms is absurd as long as technology is the application of the scientific method and scientific search methods are understood as collections of investigative steps systematically combined into useful search procedures or algorithms. So there you have my definition of search technology.

The algorithms for most search technology are not rocket science and can be boiled down to simple procedures. At the very least there is an indexing algorithm and a search algorithm:

Pre-coordination per-text/document/record/field procedure:

  1. Computerize an original text by reading the entire text or chunks of it into computer memory.
  2. Parse the text into the smallest retrievable atomic components (usually patterns (trigrams, sentences, POS, noun-phrases, etc.) or keywords or a bag (alphabetical list) of infrequent words).
  3. Store the original text with a unique key and store the parsing results as alternate keys in an index.
  4. Repeat for each new text added to a database or collection.

Post-coordination per-query procedure:

  1. Read a string from input, parse the query into keys in the same way as a text.
  2. Search the index to the selected collection or database with the keys.
  3. Assemble (sort, rank) key hits into lists and display.
  4. Choose hit to effect retrieval of the original text.

These basic algorithms are fulfilled differently by different vendors but vendors do not generally bring new algorithms to the table. They bring their methods of fulfilling these algorithms; they may modify or augment regular methods employed in steps 2 and 3 of these procedures as Google does with link analysis.

In addition, vendors fold search technology into a search engine. Most online search engines– those integrated “software systems” or search appliances that process text, data and user-queries, are composed of the following components:

  1. A crawler for crawling URI’s or files on disk or both.
  2. An indexer that takes input from the crawler and recognize key patterns or words.
  3. A database to store crawler results and key indexing (parsing) results.
  4. A query language (usually SQL, Keyword-Boolean) to use the index and access keys in the database.
  5. An internet server and/or graphical user interface (GUI) components for getting queries from, and presenting results to, users.

Most search engine wizards, as they are called, are working on one or more of these software components of online search engines. You can look at what a representative sample of these so-called wizards have to say about most of these components at the ArnoldIT blog here. If you read through the articles, you won’t find one of them (and I have not read them all) that is working on new indexing methods or new mapping algorithms for mapping the meaning of the query to the universe of text, for example.

Many of the “new search engines,” popping up everywhere, are not rightly called new search technology even though they frequently bear the moniker. They are more rightly named new applications of search technology. But even vendors are confused and confusing about this. Let’s see what Riza Berkin of Hakia is saying in his most recent article where he writes:

But let’s not blind ourselves by the narrowness of algorithmic advances. If we look closely, the last decade has produced specialist search engines in health, law, finance, travel etc. More than that, search engines in different countries started to take over (like Naver, Baidu, Yandex, ect.)…

He had been writing that Search 1.0 began with Alta Vista (circa 1996) Search 2.0 is Google-like and Search 3.0 is semantic search “where the search algorithms will understand the query and text”. I guess all those search engines from Fulcrum, Lexis-Nexis, OpenText, Thunderstone, Verity, Westlaw, and search products from AskSam to Readware ConSearch to ZyIndex, were Search 0.0 or at leat P.B. …. You know like B.C. but Pre-Berkin.

And so this last paragraph (above) makes me think he is confusing search applications with search technology. His so-called specialists search engines are applications of search technology to the field or domain of law, to the field or domain of health, and so on.

Then he confuses me even more, when he writes about “conversational search”:

Make no mistake about it, a conversational search engine is not an avatar, although avatars represent the idea to some extent. Imagine virtual persons on the Web providing search assistance in chat rooms and on messengers in a humanly, conversational tone. Imagine more advanced forms of it combined with speech recognition systems, and finding yourself talking to a machine on the phone and actually enjoying the conversation! That is Search 2.0 to me.

Now I can sympathize with Riza because I used the phrase “conversational search” to describe the kind of conceptual search engine I was designing in 1986. I am not confused about that. I am confused that he calls that Search 2.0 when earlier– statistically augmenting the inverted index –was described as Search 2.0.

He doesn’t stop there. He continues describing Search 3.0 that “will be the ‘Thinking Search’ where search systems will start to solve problems by inferencing. ” Earlier he wrote that semantic search was Search 3.0. Semantics requires inferencing, so I began to reckon maybe thinking and semantics are equal in his mind, until he writes: “I do not fool myself with the idea that I will see that happening in my life time” — so now I am confused again. I think it is what vendors want; they want the public to remain confused about the semantics of search and what you get with it.

And that brings me to the semantics of search.

There are only two words that matter here: Thoroughness and Explanatory.

When I started tinkering with text processing, search and retrieval software in the early 1980’s, I was captivated by the promise of searching and reading texts on computers. The very first thing that I noticed about the semantics of search, before my imagination became involved in configuring computational search technology, was thoroughness. The word /search/ implies thoroughness if not completeness in its definition. Thoroughness is a part of the definition of search. Look at the definition of search for yourself.

You need only look at one or two hit lists from major search engines and you can see that is not what we get from commercial search engines, or from most search technology. Search is not a process that is completed by delivering some hints of where to look, but that is what it has been fashioned into by the technological leaders in the field. Millions of people have accepted it.

Yet, in our hearts we know that search must be complete and it must be explanatory to be satisfying; We must learn from it, and we expect to learn from conducting a search. Whether we are learning of the address to the nearest pizza place or we are learning how to install solar heating, it is not about computational power, it is about explanatory power. They forgot that words are part of the technique of communicating interpersonal meaning, let’s hope search vendors don’t forget that words have explanatory power too.

Tell me what you think.

Read Full Post »

Peter Mika recently wrote an article about the semantic web and NLP-style semantic search. I should just ignore his claim that there are only two roads to semantic search because he is plainly mistaken on that count. As Peter works for Yahoo, he was mainly discussing data processing with RDF and Yahoo’s Search Monkey. He obviously knows that subject well.

He constructed an example of how to use representational data (such as an address) according to semantic web standards and how to integrate the RDF triples with search results. His claim is that one cannot do “semantics” without some data manipulation and for that the data must be encoded with metadata; essentially data about the data. In this case, the metadata necessary to pick out and show the data at the keyword: address.

At the end of his article, Peter talks about the way going forward, and; in particular, about the need for fostering agreements around vocabularies. I suppose that he means to normalize the relationships between words by having publishers direct how words are to be used. He calls this a social process while calling on the community of publishers to play their role. Interesting.

About the time Peter was beginning his PhD candidacy, industry luminary John Sowa wrote in Ontology, Metadata and Semiotics that:

Ontologies contain categories, lexicons contain word senses, terminologies contain terms, directories contain addresses, catalogs contain part numbers, and databases contain numbers, character strings, and BLOBs (Binary Large OBjects). All these lists, hierarchies, and networks are tightly interconnected collections of signs. But the primary connections are not in the bits and bytes that encode the signs, but in the minds of the people who interpret them.

This is the case in the trivial example offered by Peter. The reason one is motivated to list an address in the search result of a search for Pizza is because it is relevant to people who are searching for a pizza place close to them. In his paper, John Sowa writes:

The goal of various metadata proposals is to make those mental connections explicit by tagging the data with more signs.

This is the essential nature of the use case and proposal offered by Yahoo with SearchMonkey. It seems a good idea, doesn’t it? Yahoo is giving developers the means to tag such data with more signs. Besides, it has people using Yahoo’s index, exposing Yahoo’s advertisers. Sowa cautions that:

The ultimate source of meaning is the physical world and the agents who use signs to represent entities in the world and their intentions concerning them.

Which resources do investigators or developers use to learn about agents and their intentions when using signs? The resource most developers turn to is language and they begin by defining the words of language in each context in which they appear.

Peter says it is common for IR systems to focus on words or grams and syntax. While some officials may object, though NLP systems such as Powerset, Hakia and Cognition use dictionaries and “knowledge bases” to obtain sense data, they each focus mainly on sentence syntax and (perhaps with the exception of Powerset) use keyword indexes for retrieval just like traditional IR systems.

Hakia gets keyword search results from Yahoo as a matter of fact. All of these folks treat words, and even sentences, as the smallest units of meaning of a text. Perhaps these are the most noticeable elements of a language that are capable of conveying a distinction in meaning though they certainly are not the only ones. There are other signs of meaning obtainable from textual discourse.

Believe it or not, the signs people use most regularly are known as phonemes. They are the least salient because we use them so often, and frequently they are also largely used subconsciously. Yet, we have found that these particular sounds are instantiations, or concrete signs, of the smallest elements of abstract thought– distinctive elements of meaning that are sewn and strung together to produce words and form sentences. When they take form in a written text they are also called morphemes.

Some folks may not remember that they learned to read words and texts by stringing phonemes together, sounding them out to evoke, apprehend and aggregate their abstract meanings. I mention this because if a more natural or organic semantic model were standardized, the text on the world wide web could become more tractable and internet use might become more efficient.

This would happen because we could rid ourselves of the clutter of so many levels of metalevel signs and the necessity of controlled vocabularies for parsing web pages, blogs and many kinds of unstructured texts. An unstructured text is any free flowing textual discourse that cannot easily be organized in the field or record structure of a database. Neither is it advantageous to annotate the entirety of unstructured text with metalevel signs. Because as John Sowa wrote:

Those metalevel signs themselves have further interconnections, which can be tagged with metametalevel signs. But meaningless data cannot acquire meaning by being tagged with meaningless metadata.

So now it begs the question of whether or not words and their definitions are just meaningless signs to begin with. The common view of words—as signs— is that they are arbitrarily assigned to objects. I am unsure whether linguists could reach consensus that the sounds of words evoke meaning, as it seems many believe that a horse could have been called an egg without any consequence to its meaning or use in a conversation.

Within the computer industry it becomes even more black and white: A word is used to reference objects by way of general agreement or convention, where the objects are things and entities existing in the world. Some linguists and most philosophers recognize abstract objects as existing in the world as well. Though this has not changed the conventional view that is a kind of defacto standard among search software vendors today.

This view implies that the meaning of a word or phrase -its interpretation- adheres only to conventional and explicit agreements on definitions. The trouble is that it overlooks or ignores the fact that meaning is independently processed and generated (implicitly) in each individual (agents) mind. This is generally very little trouble if the context is narrow and well-defined as in most database and trivial semantic web applications on the scene now.

The problems begin to multiply exponentially when the computer application is purported to be a broker of information (like any search engine) where there is a verbal interchange of typically human ideas in query and text form. This is partly why there is confusion about meaning and about search engine relevance. Relevance is explicit, in as much as you know it when you see it, otherwise, relevance is an implicit matter.

Implicit are the dynamic processes by which information is recognized, organized, acted on, used, changed, etc. The implicit processes in cognitive space are those required to recognize, store and recall information. Normally functioning, rational, though implicit and abstract thought processes organize information so we that may begin to understand it.

It is obvious that there are several methods and techniques of organizing, storing and retrieving information in cyberspace as well. While there are IR processes running both in cyberspace and in cognitive space, it is not the same abstract space and the processes are not at all the same. In cyberspace and in particular in the semantic web, only certain forms of logical deduction have been implemented.

Cognitive processes for organizing information induce the harmonious and coherent integration of perceptions and knowledge with experience, desires, the physical self, and so on. Computational processes typically organize data by adding structure that arranges the information in desired patterns.

Neither the semantic web standards, nor microformats, nor NLP, seek the harmony or coherence of knowledge. Oh, yes, they talk about knowledge and about semantics yet what they deliver are little more than directives; suitable only for data manipulation in well-understood and isolated contexts.

Neither NLP nor semantic web meta data or tools presently have sufficient faculty for abstracting the knowledge that dynamically integrates sense data or external information with the conditions of human experience. The so-called semantic search officials start with names and addresses because these data have conventionally assigned roles that are rather regular.

When it comes down to it, not many words have such regular and conventional interpretations. It would actually be quite alright if we were just talking about a simple database application, but proponents of the semantic web want to incorporate everything into one giant database and controlled vocabulary. Impossible!

While it appears not to be recognized, it should be apparent that adherence to convention is a necessary yet insufficient condition to hold relevant meaning. An interpretation must cohere with its representation and its existence (as an entity or agent in the world) in order to hold. Consider the case of Iraq and weapons of mass destruction. Adhere, cohere, what’s the difference –it’s just semantics– right? Nonetheless, neither harmony nor coherence can be achieved by directive.

A consequence of the conventional view is that such fully and clearly defined directives leave no room for interpretation even though some strive for under specification. The concepts and ideas being represented can not be questioned; because, being explicit directives, they go without question. This is why I believe the common view of words and meaning that many linguists, computer and information experts, like Peter, hold, is mistaken.

If the conventional view were correct, the interpretation of words would neither generate meaning nor provide grounds for creating new concepts and ideas. If it were truly the case, as my friend Tom Adi said, natural language semantics would degenerate into taking an inventory of people’s choices regarding the use of vocabulary.

So, I do not subscribe to the common view. And these are the reasons that I debate semantic technologies even though end-users could probably care less about the techniques being deployed. Because if we are not careful we will end up learning and acting by directive too. That is not the route I would take to semantic search. How about you?

Read Full Post »

In looking at the comments of the last post The Search for Semantic Search, I see there appears to be some interesting interpretations. Let me explain my motives, address any perceived bias and clarify my position.

Alex Iskold wrote about semantic search that we were asking the wrong questions; that it was essentially the root of the problem with semantic search engines, and; they were only capable of handling a narrow range of questions– those requiring inference. Among other things, he also wrote that his question about vocation was unsolvable; impossible, was the term he used. These ideas and the fact that Alex implied Google was a semantic search engine, and inferred that vendors must dethrone Google to be successful, motivated me to blog about it myself.

I was criticized, in the comments, for implying that the so-called “semantic search” capability of these NLP-driven search engines is weak, and due to this they do not really qualify as “semantic search” engines. Actually Kathleen Dahlgren introduced a new name in her comments: “Semantic NLP”. I was also criticized for asking a silly question and for posting my brief analysis of this one single question that Alex said was unsolvable without massively parallel computers.

Of course you cannot judge a search engine by the way it mishandles one or even a few queries. But in this case one natural language question reveals a lot about the semantic acuity of NLP, and the multiple query idea is a kind of strawman argument intended to distract us. It almost proves Alex right as it is misleading.

I do not believe people are motivated to ask wrong questions and I do not believe people ask silly questions to computer search engines while expecting a satisfactory set of results or answers. Nevertheless, when any case fails, the problem or fault does not lie with people. The search engine is supposed to inform them. The fault lies with the computer software for failing to inform. You can try and dismiss it with a lot of hand waving but just like that pesky deer fly — it is going to keep coming back.

While NLP front ends and semantic search engines are the result of millions of dollars in funding and the long work of brainiacs, and while they may be capable enough to parse sentences, build an index or other representation of terms and use some rules of grammar, they are not always accommodating or satisfying. In fact they can be quite brittle or breakable. This means they do not always work. But they do work under the right circumstances in narrowly defined situations. One of the questions here is whether they work well enough to qualify them as “semantic search” engines for English language questions.

Any vendor who comes out in public and claims they are doing “semantic search” should prove it by inferring the significance of the input with sufficient quality and acuity such that the result, or search solution or conclusion, satisfies the evidence and circumstance. This a minimum level of performance. There are tests for this. Many people use a relevance judgment as a measure of that satisfaction as far as any type of search and retrieval method or software is concerned.

With that said, my last post was about debunking the so-called complex query myth not about “testing” the capabilities of any search engine. It was about semantic search and how any search engine solves this single so-called impossible question. There were results, and they were not completely “useless” as I see, on review, that I wrote. I apologize for calling them useless.

Both Cognition and Powerset produced relevant results (with one word) that were more comprehensive than the results Google provides, in my opinion. That is not a natural language process of understanding a sentence though. Having a capacity to look up a word in a dictionary is not the same as the capacity to referentially or inferentially use the concept. In this case, to make some judgments (distinguish the significant relationships, at least) and inform the search process.

This capability to distinguish significant relationships is a key criteria of “semantic” search engines — meaning they should have a capacity to infer something significant from the input and use it. The results of this query tell a different story. You cannot just profess linguistic knowledge, call the question silly and make the reality it represents go away. This kind of problem is systemic.

As far as the so-called “semantic” search engines inferring anything siginificant from this (full sentence case) question (evidence) or circumstance of searching, I treated all the results with equal disaffirmation. What is more; I stand by that as it is supported on its face. If you look at the results of the full sentence case query at Cognition, you will notice that they are essentially the same as those from Powerset.

I reckon this could be because both engines map the parts of speech and terms from the query onto the already prepared terms and sentences from Wikipedia. This “mapping strategy” clearly fails –in this case– for some pretty obvious reasons. Without pointing out all the evidence I collected, I summed those reasons up as a lack of semantic acuity. That seems to have touched a nerve.

So I will get into the details of this below. Let me first take a moment to address the fact that one inquiry reveals all this information. Really it is not just one inquiry. It is one conceptualization. Dozens of questions can be derived by mapping from the concepts associated to these terms of this single question. For example: Where are the best careers today?; Who has the better jobs this year?; Where can I work best for now?; What occupation should I choose given the times?; etc. I tried them all and more with varying degrees of success.

One problem is that NLP practitioners are concerned with sentence structure and search engineers are concerned with indexing terms and term patterns. Either way, the methods lack a conceptual dimension and there is no apparent form of any semantic space for solving the problem. The engines have no sense of time points or coordinated space or other real contexts in which things take place. The absence of semantic acuity is not something that only affects a single inquiry. It will infect many inquiries just as a disease infects its host.

Now that I recognize the problem, if I were challenged to a wager, I would wager that I could easily produce 101 regular English language questions that would demonstrate this affliction. The search engines may produce a solution, except that the results would be mostly nonsense and not satisfying. It would prove nothing more and nothing less than I have already stated. What say you Semantic Cop?

I should mention that I have long suspected that there was a problem mapping NLP onto a search process and I could not put my finger on it. A literature search on evaluations of text retrieval methods will show, in fact, that the value of part of speech processing (in text search and retrieval) has long been regarded as unproven. By taking the time to investigate Alex Iskold’s complex query theory I gained more insight into the nature and extent of this problem. It is not just a problem of finding a definition or synonyms for a given term as some reader’s may infer. Let me explain further.

While Powerset, Cognition and Hakia each had the information that a vocation was a kind of altruistic occupation, and the search circumstance (a hint) that the information seeker could be looking for an occupational specialty or career, they did not really utilize that information. The failure, though, really wasn’t with their understanding of the terms occupation or vocation. Their failure was specifically related to the NLP approach to the search process. That is supported by the fact that these different search products employing NLP fail in the same way.

That should not be taken to mean that the products are bad or useless. Quite to the contrary, the product implementations are really first class productions and they appear to improve the user experience as they introduce new interface techniques. I think NLP technology will continue to improve and will eventually be very useful, particularly at the interface as Alex noted in his post. But does that make them semantic search engines?

Lest I have been ambiguous, let me sum up and clarify by referring back to the original question: Whether you are looking at Powerset, Cognition or Hakia results, they clearly did not understand the subordinate functionality of the terms /best/, /vocation/, /me/ and /now/ in the sentence.

They clearly could not conceptualize ‘best vocation’ or ‘now’– they could only search for those keyword patterns in the index or data structures created from the original sentences. That is not just ‘weak’ semantics that is not semantic search at all. Maybe they “understood” the parts of speech but they did not infer the topic of inquiry nor did they properly map the terms into the search space. Google did not fare any better in this case, but Google does not claim to be a semantic search engine. So where’s the semantics?

By that I mean (for example) that interpreting /now/ from the natural language question ‘what is the best vocation for me now’ as an adverb, does not improve the search result. Treating it as a keyword or arbitrary pattern does not improve the search result. And it demonstrates a clear and present lack of acuity and understanding of the circumstance.

Finding the wrong sense of /now/ and showing it is of dubious value. An inference from /now/ to ( –> ) today, at present, this moment in time, or to this year or age, and using that as evidence leading to an informed conclusion would demonstrate some semantic acuity in this case. Most people have this acumen, these NLP search engines obviously do not–according to the evidence in this case.

The NLP vendors defend this defect by accusing people of not asking the right question in the right way or not asking enough questions. That is like me saying to my wife:

If you want some satisfying information from me you better use a lot of words in your question and it better not be silly. Don’t be too precise and confuse me and don’t use an idiom and expect me to satisfy you. I’ll still claim to understand you. It is you that asks silly questions. That not being enough, you also have to nag me with more long and hard questions before you say my responses are rubbish.

If I should either desire or dare to do that at all, what do you think her response would be? More importantly what do prospects say when you tell them their questions are silly?

I do not need to proceed with a hundred questions when with a dozen or so I have enough evidence to deduce that these NLP-driven search engines are limited when it comes to ** inferring the topic of inquiry **. In some cases they are simply unable to draw on the *significant structures or patterns of input, evidence or circumstance” and produce a suitable solution.

What bothers me is that some of these so-called “semantic search engines” claim to “understand” in general. I did that too, a very long time ago. Yes, I was there in the back of the room at DARPA and NIST meetings and I have been at PARC and the CSLI for presentations. I was challenged then. And it enlightened me. If such claims go unchallenged it will only serve to demean the cultural arts and independent critical thinking and confuse prospects about the capabilities regular people expect of semantic products. I do not wish to lower the bar.

In this instance, and there are many similar cases that could be derived from the semiotic characteristics of this instance, the NLP-driven engines do not show the slightest acumen for inferring the topic of inquiry. I hope the discerning reader sees that it is not just about some synonyms. If they could infer the topic of inquiry, that would demonstrate a little understanding… at least a capacity to learn.

The result, in all such cases, is that these so called “reasoning-engines” and semantic search engines do not lead us to a satisfying consequence or conclusion at all. They have technical explanations such as synonymy and hyponymy for any word, yet, if the software cannot infer the sense of everyday terms, is it even sensible to call the methods “semantic”? Just because the vendors profess linguistic knowledge does not mean their their semantics are any more than just another marketing neologism.

It may be called semantic NLP but that does not qualify as semantic search in my opinion.

Read Full Post »

In a recent Read Write Web article that was much more myth than reality, Alex Iskold posits the fact that a semantic search engine must dethrone Google (myth1). Fortunately by the end of his article he concludes that he was mislead into thinking that. I do not think he was misled at all. I just think he is confused about it all.

He posits a few trivial queries (myth 2) to show that there is “not much difference” between Powerset and Hakia and Freebase (myth 3). And that semantic search “is no better than Google search ” (myth 4). After that Alex writes that there is a set of problems suitable to semantic search. He says these problems are wonderfully solved (already) by relational databases (myth 5).

It makes one wonder why we should mess with semantic search if the problems are already solved. It is not true. That is why. Neither was any of the talk about query complexity, true.

It is not all these myths, exactly, but unclear thinking that leads to false expectations as well as false conclusions. Alex seems to be confused about the semantic web and semantic search. These are two different things but somehow Alex morphs them into one big database. Because I do want this post to impart some information instead just being critical of a poorly informed post, let me start by debunking the myths.

Myth1: Semantic Search should dethrone Google

For many search problems, semantics plays no role at all. Semantics plays a very limited role when the query is of a transactional nature, e.g., a search problem of the type: find all x.

Google is a search engine that solves search problems of this type. Yet the Google kingdom is based on being a publisher. Google uses super-fast and superficial keyword search to aggregate dynamic content from the internet for information seekers and advertisers alike. Google’s business does not even lend itself to semantic search for some very obvious reasons having to do with speed and scalability. Google’s best customers know exactly what they want and they certainly do not want any “intellectual” interpretations.

None of the so-called semantic search engine companies, that I know of, are pursuing a business strategy to dethrone Google as an information-seeker’s destination of choice. Powerset, for example, is not aggregating dynamic content like Google. It’s business model does not seem to be based on a publishing or advertising model.

Powerset is using their understanding of semantics to assist the user (of wikipedia) in relating to that relatively static content, from several different mental or rational and conceptual perspectives. This is meant to assist the information-seeker with interpreting the content. That is a good and valid application of semantics.

This is not the position a company seeking to unseat Google would take. A company seeking to unseat Google would be better positioned by producing technology to assist advertisers in classifying, segmenting and targeting buyers.

Myth 2: Trivial and Complex Queries

Unfortunately Alex did not supply any complex examples in his post. He tried to imply that his trivial queries were complex and the most complex was impossible to solve. This query was the one labeled impossible: “What’s the best vocation for me now?” I will use Alex’s query to debunk his misguiding assumptions. First, let’s clarify by looking at the search problems represented by the Alex’s natural language queries.

Note 1: Alex offers the first query as impossible to solve. It must be because Alex is expecting a machine and some software to divine his calling based presumably on his mood now and some mind-reading algorithm. I should hope most people would seek a human counselor rather than rely on the consul of a semantic search engine for addressing their calling. It is fair to use a search engine to find a career or occupation and it is valid to expect a semantic search engine to “understand” the equivalence relationship between the terms occupation and vocation, in this context.

As I suggested best + vocation, or just vocation alone is a simple solution that should be easy to satisfy. However, this simple search solution fails on all search engines. Even so-called semantic search engines have a problem with this query (see comparative search results under myth 4 below). It is not because it is complex query. It is because Alex used the word vocation. This word is not frequent and search engines do not know its synonyms. This is a complex concept as it takes semantic acuity to “understand” it. No one talks about semantics in terms of acuity though.

Nonetheless, a search for vocation + best, and sorting the results by most recent date, will however, create a valid search context in which one can reasonably expect a solution from their semantic search engine. Most people, I am assuming, would have a more reasonable expectation than Alex; one that may be fulfilled by this internet page suggested by Readware:

A semantic search engine needs semantic acuity to “understand” that the concept of a vocation and the concept of an occupation are related. Obviously none of the search engines mentioned in Alex’s article have such acuity of understanding. Some of the search engines tried to process the pronoun me and the word now. Instead of being a solution, it created a problem as can be seen in the search results (under myth 4) below.

Note 2: This query needs a search engine with some more exotic search operators than a simple keyword search engine might provide. The query, however, is not complex. Some search engine may index US Senator as a single item to facilitate such a search. A search engine would need extended Boolean logic to process phrases using a logical AND between them. A more seasoned search engine, such as Google, would parse and logically process the query from a search box, without any specifying logic, and return an acceptable result. NLP-based engines (like Hakia and Powerset) try to do this too. They use propositional logic instead of Boolean logic. The effects are not very satisfying as can be seen below (in the search results listed under myth 4).

A more sophisticated and indeed “semantic” search engine may interpret foreign entity according to a list of “foreign entities”. It would take some sophisticated semantics to algorithmically interpret what type of labels may be subordinate to foreign entity. For example: A German businessman, a Russian bureaucrat, a Japanese citizen, an American taxpayer. Which is the foreign entity?

Yet, it is also clear that an inventory of labels can be assembled under certain contexts. Building such an inventory constitutes the building of knowledge. A semantic search engine should help inventory and utilize knowledge for future researches. None of the semantic search engines that Alex mentioned do anything like this. Readware does do this.

Note 3: This search would benefit from a search engine that recognizes names. I think Hakia has an onomastic component. I am not sure about Powerset. However, this search works on nearly any search engine because their are plenty of pages on the web that contain the necessary names and terms. Otherwise there is nothing complex about this query.

The reality, as you can see, is that every query Alex offered is trivial. Yet it demonstrates what is wrong with so-called “semantic search”. That is, today’s semantic search products, including the NLP -powered search engines masquerading as “semantic” search, fail at real tests of semantic acuity. Before I get into the evidence though. Let me just say something about semantic search technologies in general.

Semantic Search Technologies

There are no public semantic search engines today. There are search engines and there are search engines with Natural Language Processors (NLP) that work on the indexing and query side of the search equation. Whether or not databases are used to store processed indexes or search results, databases and database technology like RDBMS and SQL have nothing to do with it.

The search engines that have the capacity for natural language processing usually claim to “understand” word and/or sentence semantics– in a linguistic sense. This usually means that they understand words by their parts of speech, or they can look up definitions in a resource. Hakia and Powerset fall into this class, as does Cognition and several other search engines both in the U.S. and abroad. These are called semantic search engines and they claim to understand word sense and do disambiguation and so forth and so on, but as I will show below: at questionable acuity.

Google is not a semantic search engine at all. While Hakia and Powerset may represent some small part of the spectrum of semantic search engines they are hardly representative of semantic search. Along with Freebase and Powerset, more representative of “semantic web” search is SWSE, Swoogle and Shoe.

Besides these semantic web search engines, there are semantic search engines akin to Hakia, such as Cognition, as mentioned in this article at GigaOM, along my own favorite Readware. So, in summary, Alex’s comparison is not representative and is really poor evidence.

Myth 3: No difference between Powerset and Hakia and Freebase.

Well this is just ridiculous. It is not only a myth, it is pure misinformation. Nothing could be further from the truth. While Powerset and Hakia use NLP technology that could be construed as similar, Freebase is a essentially an open database that can be queried in flexible ways. Freebase and Powerset happen to be somewhat comparable because Powerset works on Wikipedia and uses RDF to store data, and semantic triples (similar to POS) to perform some reasoning over the data. Freebase also stores Wikipedia-like data in RDF tuples.

It is probably also worthwhile to mention that Hakia’s NLP comes from the long time work and mind of the eminent professor Victor Raskin and his students. Powerset’s NLP comes from the work of Ronald Kaplan, Martin Kay and others connected with Palo Alto Research Center, Stanford University and the Center for the Study of Language and Information (CSLI). Cognition’s technology is based on NLP work done by Kathleen Dahlgen.

While Hakia, Powerset and Cognition represent these notable NLP approaches, their search methods and results show they do not know a great deal about search tactics and solutions. They do not seem to be successful in mapping the sentence semantics into more relevant and satisfying search results. It seems, from the evidence of these queries, they only know how to parse a sentence for its subject, object and verb and, a lot like Google, find keywords.

Myth 4: Semantic Search is No Better than Google.

Hakia and Powerset are like neophytes in Google’s world of search. That alone makes these engines no better than Google. Yet, that does not apply to semantic search in general. The truth is that the semantics of the search engines we are talking about (Hakia, Powerset, Freebase and Search Monkey), do not appear to make the results any worse than those from Google. Let’s take a look at the Google search results for ‘What is the best vocation for me now’:

As may be predicted, the results are not very good (because the keyword vocation is not popular). Google also wants to be sure we do not mean ‘vacation’ instead of vocation. Hakia , on the other hand , strictly interpreted the query:

Just like the results from Google, these are not very satisfying. You might think that because Hakia is a semantic search engine, it would have the semantic acuity to “understand” that vocation and occupation are related. As you can see in the following search result, this could not be farther from the truth:

Not one of Hakia’s results had to do with occupational specialties or opportunities for career training and employment. Powerset did not produce any results when the term vocation is used and it really had nothing on occupation so it searched for best + me + now. There is nothing semantic about that and it is a pretty bad search decision as well. The results are useless; I will post them so you can judge for yourself:

When you have results like this, it really does not matter what kind of user interface you have to use. If it is a bad or poor user interface, it only makes the experience that much worse. Even if it is a good, fancy, slick or useful interface, it won’t improve irrelevant results.

Another so-called semantic search engine, Cognition, did not fare any better:

This above search result is useless provides a starting point for further investigation, as is does the search for occupation:

I actually was mildly surprised that Cognition related the term occupation to the acronym MOS, which means Military Occupational Specialty. Then I saw that they did not distinguish the acronym from other uses of the three letter keyword combination. Again not a very satisfying experience. I did not leave Freebase out, I just left them until last. All Freebase results do is confirm that vocation is an occupation or a career:

It was not possible for freebase to accept and process the full query. As this result shows, the data indicates that a vocation is also known as an occupation but none of these engines realize that fact.

Myth 5: Already solved by RDBMS.

If the search problem or the “semantic” problem could be solved by the RDBMS, Oracle would be ten times the size of Google and Google might be using Oracle’s technology if it existed at all. None of these problems (aggregated search, semantic parsing of the query and text, attention, relevance) are solved by any RDBMS. But Alex brushed over the real problems to make the claim that it is all up to the User Interface and the semantics only matter there. I suppose that was the point he was trying to make by including Search Monkey in his comments. This is just hog wash though. By that I mean that it is not true and it is in fact misleading.

Conclusions

It is plain to see that a semantic search engine needs acuity to discern the differences and potential relations that can form between existing terms, names and acronyms. It is also plain to see that none of the commercial crop of search engines have it. The Natural Language search engines, which have dictionaries as resources, do not associate vocation to occupation (for example) and therefore cannot offer any satisfying search results.

There are 350,000 words in the English language. How many do you suppose are synonymous and present a case just like this example? Parsing a sentence for its subject, object and verb, is fine. It does not mean it will be useful or helpful in producing satisfying search results.

It is foolish to think that NLP will be all that is needed to obtain more relevant search results. The fact is that search logic and search tactics are arts that are are largely unrelated to linguistics or NLP or database technology. While language has semantics, testing of the semantics of so-called semantic search engines has demonstrated that the semantics, if they are semantics, are pretty weak. I have demonstrated that semantic acuity plays a large role in producing more relevant and satisfying search results. A semantic search engine should also help inventory and utilize knowledge for future researches. An informed search produces more satisfying results.

Read Full Post »

I would like to address the few questions I received on the three parts 1,2 and 3 of the semantics of interpersonal relations. The first and most obvious questions was:

I don’t get it. What are the semantics?

This question is about the actual semantic rules that I did not state fully or formally in any of the three parts. I only referred to Dr. Adi’s semantic theory and related how the elements and relations of language (sounds and signs) correspond with natural and interpersonal elements and relations relevant to an embodied human being.

Alright, so a correspondence can be understood as an agreement or similarity and as a mathematical and conceptual mapping (a mapping on inner thoughts). What we have here, essentially, is a conceptual mapping. Language apparently maps to thought and action and vice-versa. So the idea here is to understand the semantic mechanism underlying these mappings and implement and apply it in computer automations.

Our semantic objects and rules are not like those of NLP or AI or OWL or those defined by the semantic web. These semantic elements do not derive from the parts of speech of a language and the semantic rules are not taken from propositional logic. And so that these semantic rules will make more sense, let me first better define the conceptual space where these semantic rules operate.

Conceptually, this can be imagined as a kind of intersubjective space. It is a space encompassing interpersonal relationships and personal and social interactions. This space constitutes a substantial part of what might be called our “semantic space” where life lived, what the Germans call Erlebnis, and ordinary perception and interpretation (Erfahrung) intersect, and where actions in our self-embodied proximity move us to intuit and ascribe meaning.

Here in this place is the intersection where intention and sensation collide, where sensibilities provoke the imagination and thought begets action. It is where ideas are conceived. This is where language finds expression. It is where we formulate plans and proposals, build multidimensional models and run simulations. It is the semantic space where things become mutually intelligible. Unfortunately, natural language research and developments of “semantic search” and the “Semantic-Web” do not address this semantic space or any underlying mechanisms at all.

In general when someone talks about “semantics” in the computer industry, they are talking either about English grammar, rdf-triples in general or they are talking about propositional logic in a natural or artificial language, e.g., a data definition language, web services language, description logic, Aristotelian logic, etc. There is something linguists call semantics though the rules are mainly syntactic rules that have limited interpretative and predictive value. Those rules are usually applied objectively, to objectively defined objects, according to objectively approved vocabulary defined by objectively-minded people. Of course, it is no better to subjectively define things. Yet, there is no need to remain in a quandary over what to do about this.

We do not live in an completely objective, observable or knowable reality, or a me-centric or I-centric society, it is a we-centric society. The interpersonal and social experience that every person develops from birth is intersubjective — each of us experience the we-centric reality of ourselves and others entirely through our own selves and our entirely personal world view.

Perhaps it is because we do not know and cannot know– through first-hand experience at least– what any others know, or are presently thinking, that there is this sort of dichotomy that sets in between ourselves and others. This dichotomy is pervasive and even takes control of some lives. In any case, conceptually, there is a continuum between the state of self-realization and the alterity of others. This is what I am calling the continuum of intersubjective space.

A continuum of course, is a space that can only be divided arbitrarily. Each culture has their own language for dividing this space. Each subculture in a society have their own language for dividing this space. Every technical field has their own language for dividing the space. And it follows, of course, that each person has their own language, not only for dividing this space, but for interacting within the boundaries of this space. The continuum, though, remains untouched and unchanged by interactions or exchanges in storied or present acts.

The semantics we have adopted for this intersubjective space include precedence rules formulated by Tom Adi. Adi’s semiotic axioms govern the abstract objects and interprocess control structures operating in this space. Cognitively, this can be seen as a sort of combination functional mechanism, used not only for imagining or visualizing, but also for simulating the actions of others. I might add that while most people can call on and use this cognitive faculty at will, its use is not usually a deliberate act; it is mainly used subconsciously and self-reflexively.

We can say that the quality of these semantics determine the fidelity of the sound, visualization, imitation or simulation to the real thing. So when we access and use these semantics in computer software as we do with Readware technology, we are accessing a measure of the fidelity between two or more objects (among other features) . This may sound simplistic though it is a basic level cognitive faculty. Consider how we learn through imitation. Note to self: Don’t leave out the cognitive load to switch roles and consider how easily we can take the opposite or other position on almost any matter.

We all must admit, after careful introspection, that we are able to “decode” the witnessed behavior of others without the need to exert any conscious cognitive effort of the sort required for describing or expressing the features of such behavior using language, for example. It may be only because we must translate sensory information into sets of mutually intelligible and meaningful representations in order to use language to ascribe intentions, order or beliefs, to self or others, that the functional mechanism must also share an interface with language. It may also be because language affords people a modicum of command and control over their environment.

Consider the necessity of situational control in the face of large, complex and often unsolvable problems. I do not know about you, but I need situational control in my environment and I must often fight to retain it in the face of seemingly insurmountable problems and daily ordeals.

Now try and recognize how the functional aspects of writing systems fill a semiotic role in this regard. Our theoretical claim is that these mutually intelligible signs instantiate discrete abstract clusters of multidimensional concepts relative to the control and contextualizing of situated intersubjective processes.

Like the particles and waves of quantum mechanics are to physics, these discrete intersubjective objects and processes are the weft and the warp of the weaving of the literary arts and anthropological sciences on the loom of human culture. We exploited this functional mechanism in the indexing, concept-analysis, search and retrieval software we call Readware.

We derived a set of precedence rules that determine interprocess control structures and gave us root interpretation mappings. These mappings were applied to the word roots of an ancient language that were selected because modern words derived from these word roots are used today. These few thousand root interpretations (formulas) were organized into a library of concepts, a ConceptBase, used for mapping expressions in the same language and from different languages. It was a very successful approach for which we designed a pair of ReST-type servers with an API to access all the functionality.

To make this multi-part presentation more complete, I have posted a page with several tables drawn up by Tom Adi, along with the formal theory and axioms. There are no proofs here as they were published elsewhere by Dr. Adi. These tables and axioms identify all the key abstract objects, the concepts and their interrelationships. Tom describes the mappings from the base set (sounds) and the axioms that pertain to compositions and word-root interpretations, together with the semantic rules determining inheritance and precedence within the control structures. You can find that page here.

And that brings me to the next question, which was: How can you map concepts between languages with centuries of language change and arbitrary signs? The short answer is that we don’t. We map the elements of language to and from the elements of what we believe to be are interlinked thought processes that form mirror-like abstract and conceptual images (snapshots) of perceptive and sensory interactions in a situated intersubjective space.

That is to say that there is a natural correspondence between what is actually happening in an arbitrary situation and the generative yet arbitrary language about that situation. This brings me to the last question that I consider relevant no matter how flippant it may appear to be:

So what?

The benefits of a shared semantic space should not be underestimated. Particularly in this medium of computing where scaling of computing resources and applications is necessary.

Establishing identity relations is important because it affords the self-capacity to better predict the consequences of the ongoing and future behavior of others. In social settings, the attribution of identity status to other individuals automatically contextualizes their behavior. By contextualizing content, for example, knowing that others are acting as we would effectively reduces the cognitive complexity and the amount of information we have to process.

It is the same sort of thing in automated text processing and computerized content discovery processes. By contextualizing content in this way (e.g, with Readware) we dramatically and effectively reduce the amount of information we must process from text, to more directly access and cluster relevant topical and conceptual structure, and to support further discovery processes. We have found that a side-effect to this kind of automated text-analysis is that it clarifies data sources by catching unnatural patterns (e.g., auto-generated spam) and it also helps identify duplication and error in data feeds and collections.

Read Full Post »

Older Posts »