Feeds:
Posts
Comments

Archive for the ‘intelligent search’ Category

I am happy to announce the publication of “A New Theory of Cognition and Software Implementations in Information Technology” to be published in the April-June issue of the Journal of Information Technology Research, Vol. 2, Issue 2, 2009.

Abstract

“The Scientific Method means that theories are developed to explain observed phenomena— similar to the task of text analysis—or to search for unobserved phenomena—similar to text retrieval. Theory development means testing a large number of proposed theories (hypotheses) until one is corroborated. To make theory development efficient, a method is needed to construct promising theories—ones more likely to succeed. Such a method is part of a new theory of cognition that is introduced. The theory is implemented in the software Readware. Readware uses theory development methods for text analysis and retrieval. Readware’s development, features and large-scale performance are reviewed. This includes a fast ontology-building system, the cross-lingual word-root theory base, a language to code theories, algorithms and ontology implementations, and software applications and servers that perform text analysis and retrieval using Readware API functions.”

Article copies are available for purchase from InfoSci-on-Demand.com.  You can also look for the publication in your local university library.

Read Full Post »

Search. I suppose there is no denying that the word “search” ascended to significance in the consciousness of more people since the birth of Information Science than perhaps at any other time in history. This supposition is supported by a recent Pew Foundation internet study stating that:

The percentage of internet users who use search engines on a typical day has been steadily rising from about one-third of all users in 2002, to a new high of just under one-half (49%).

While it may not be obvious, it becomes apparent on closer examination of the phenomena, that the spread and growth in the numbers of words and texts and more formal forms of knowledge, along with the modern development of search technology, had a lot to do with that.

Since people adopted the technology of writing systems, civilizations and societies have flourished. Human knowledge and culture, and technological achievement, have blossomed. No doubt.

Since computers, and ways of linking them over the internet, came along, the numbers of words and the numbers of writers have increased substantially. It was inevitable that search technology would be needed to search through all those words from all those writers. That is what Vannevar Bush was telling his contemporaries in 1945 when he said the perfection of new instruments “call for a new relationship between thinking man and the sum of our knowledge.

But somewhere along the line things went wrong; some things went very, very wrong. Previous knowledge and the sum of human experience was swept aside. Search technology became superficial, and consequently, writing with words is not considered as any kind of technology at all. That superficiality violates the integrity of the meaning of search, and the classification of words merely as arbitrary strings is also wrong, in my view.

Some scientists I know would argue that the invention of writing is right up there at the top of human technological achievement. I guess we just take that for granted these days, and I am nearly certain that scientists that were embarking into the new field of information technology in the 1940’s and 1950’s were not thinking of writing with words as the world’s first interpersonal memory– the original technology of the human mind and its thoughts and interactions.

Most information scientists have not yet fully appreciated words as technical expressions of human experience but treat them as labels instead. By technical, I mean of or relating to the practical knowledge and techniques (of being an experienced human).

Very early in the development of search technology, information scientists and engineers worked out assumptions that continue to influence the outcome, that is, how search technology is produced and applied today. The first time I wrote about this was in 1991 in the proceedings of the Annual Meeting of the American Society of Information Science. There is a copy in most libraries if anyone is interested.

And here we are in 2008, in what some call a state of frenzy and others might call disinformed and confused– looking at the prospects of the Semantic Web. I will get to all that in this post. I will divide this piece into the topics of the passion for search technology, the state of confusion about search technology, and the semantics of search technology.

The term disinformed is my word for characterizing how people are under-served if not totally misled by search engines. A more encompassing view of this sentiment was expressed by Nick Carr in an article appearing in the latest issue of the Atlantic Monthly where he asks: Is Google making us stupid?

I am going to start off with the passion of search.

Writing about the on-line search experience in general, Carmen-Maria Hetrea of Britannica wrote:

… the computing power of statistical text analysis, pattern-matching, and stopwords has distracted many from focusing on (should I say remembering?) what actually makes the world tick. There are benefits and dangers in a world where the information that is served to the masses is reduced to simple character strings, pattern matches, co-location, word frequency, popularity based on interlinking, etc.

( … ) It has been sold to us as “the trend” or as “the way of the future” to be pursued without question or compromise

That sentiment pretty much echos what I wrote in my last post. You see, computing power was substituted for explanatory power and the superficiality of computational search was given credibility because it was needed to scale to the size of the world wide web.

This is how “good enough” became state of the art. Because search has become such a lucrative business and “good enough” has become the status quo, it has also become nearly impossible for “better” search technology to be recognized, unless it is adopted and backed by one of the market leaders such as Google or Microsoft or Yahoo.

I have argued in dozens of forums and for more than twenty years that search technology has to address the broader logic of inquiry and the use of language in the pursuit of knowledge, learning and enhancing the human experience. It has to accommodate established indexing and search techniques and it has to have explanatory power to fulfill the search role.

Most that know me know that I am not showing up at this party empty-handed. I have software that does all that and while my small corporate concern is no market or search engine giant my passion for search technology is not unique.

In her Britannica Blog post about search and online findabillity, Carmen-Maria Hetrea summed up her passion for search:

Some of us dared to differ by returning to the pursuit of search as something absolutely basic to the foundations of our human existence: the simple word in all of its complexity — in its semantics and in its findability and its futuristic promise.

You have to ask yourself what you are really searching for before you can find that it is not for keywords or patterns at all. Out in the real world almost everyone is searching for happiness. Some are also searching for truth or relevance. And many search for knowledge and to learn. If your searching doesn’t involve such notions, maybe you don’t mind the tedium of thorough, e.g., exegetical, searching. Or maybe you are someone who doesn’t search at all, but depends on others for information.

How is the search for happiness addressed by online search technology? Should it be a requirement of search technology to find truth or relevance? Should a search be thorough or superficial? Is it about computing power or explanatory power? I am going to try and address each of these questions below as I wade through the causes of confusion, expose the roots of my passion and maybe shed some light on search technology and its applications.

Some people have said in the online world you have both the transactional search and the research search, which are not the same. They imply that these search objectives require different instruments or plumbing. I don’t think so. I think it is just a crutch vendors use to justify superficial search. Let’s look at an example transactional search, say, searching for a new car. There are so many places where you can carry out that transaction, being thorough and complete is not an issue. Here’s is a search vendor quiz:

Happiness is a ___________ search experience.

Besides searching for objects of information that we know but don’t have at hand, in cyberspace and on the web, we might search for a pizza place in a new destination. Many search for cheap air fares or computer or car parts, or deals on eBay, while others search for news, music, pictures and many other types of media and information. A few others search for knowledge and for explanation. Happiness in the universe of online search is definitely a satisfying search experience irrespective of what you are searching for.

Relevance is paramount to happiness and satisfaction whether searching for pizza in close proximity or doing research with online resources. Search vendors are delivering hit lists from their search engines, where users are expecting relevance and to be happy with the results. Satisfaction, in this sense, has turned out to be a tall order and nonetheless a necessary benefit of search technology that people still yearn for.

Let’s now turn to the state of confusion.

Carmen-Maria mentions that new search technology has to be backward compatible and she also complains that bad search technology is like the wheel that just keeps getting reinvented:

The wheel is being reinvented in a deplorable manner since search technology is deceptive in its manifestation. It appears simple from the outside, just a query and a hitlist, but that’s just the tip of the iceberg. In its execution, good search is quite complex, almost like rocket science.

… The wealth of knowledge gained by experts in various fields – from linguists to classifiers and catalogers, to indexers and information scientists – has been virtually swept off the radar screen in the algorithm-driven search frenzy.

The wheel is certainly being re-invented; that’s part of the business. I am uncertain what Carmen-Maria means by algorithm-driven search frenzy. Algorithms are the stuff of search technology. I believe that some of the problems with search stem from the use of algorithms that are made fast by being superficial, by cutting corners and by other artificial means. The cutting of corners begins with the statistical indexing algorithms or pre-coordination of text– so retrieval is consequently hobbled by weaknesses in the indexing algorithms. But algorithms are not the cause of the problem.

Old and incorrect assumptions are the real problem.

Modern state-of-the-art search technology (algorithms) invented in the 1960’s and 1970’s strip text of its dependence on human experience under something information science (IS) professionals justify as the independence assumption. Information retrieval (IR) professionals– those that design software methods for search engine algorithms– are driven by the independence assumption to treat each text as a bag of words without connection to other words in other texts or other words in the human consciousness.

I don’t think he was thinking about this assumption when Rich Skrenta wrote:

… – the idea that the current state-of-the-art in search is what we’ll all be using, essentially unchanged, in 5 or 10 years, is absurd to me.

Odds are that he intends to sweep a lot of knowledge out of the garage too, and I would place the same odds that any “new” algorithm Rick brings to the table will implicitly apply that old independence assumption too.

So this illustrates a kind of tug of war between modern experts in search technology and the knowledge of ages of experience. There is also a kind of frenzy or storm over so-called “new” technologies and just what constitutes “semantic” search technology. While some old natural language processing (NLP) technology has debuted on the online search scene, it has not brought any new search algorithms to light. They have only muddied the waters in my opinion. I have written about this in previous posts.

The underlying current is stirred up by imbalance existing in the (significant) history of search technology contrasted with the nascence of online search and other modern applications of search technology. Add to that disturbance the dichotomy exasperated by good (satisfying) and bad (deceptive) search results, multiplied by the number of search engine vendors, monopolistic or otherwise, and you have the conditions where compounding frenzy, absurdity and confusion, rather than relevance, reigns supreme.

I like to think my own view transcends this storm and sets an important development principle that I established when I produced the first concept search technology back in 1987. The subjects of the search may be different but the freedom to search for objects, for answers, or for theories or explanations of unknown phenomena is the right of inquiry.

This right of intellectual inquiry is as important and as basic as the freedom of speech. This is what ignites my passion for search technology. And I cannot stand to have my right of inquiry blocked, limited, biased, restricted, arrested or constrained, whether by others, or by unwarranted procedure (algorithm) or formality, or by mechanical devices.

I wear my passion on my sleeve and it frequently manifests as a rant against the “IT” leaders or so-called experts that Carmen-Maria wrote about:

Many consider themselves experts in this arena and think that information retrieval is this new thing that is being invented and that is being created from scratch. The debate often revolves around casual observations, remarks, and opinions come mostly from an “IT” perspective.

To be fair, not all those with “IT” perspectives are down with all this “new thing” in online search engines. Over at the Beyond Search blog, Stephen Arnold wrote about the problem with the thinking about search technology:

… fancy technology is neither new nor fancy. Google has some rocket science in its bakery. The flour and the yeast date from 1993. Most of the zippy “new” search systems are built on “algorithms”. Some of Autonomy reaches back to the 18th century. Other companies just recycle functions that appear in books of algorithms. What makes something “new” is putting pieces together in a delightful way. Fresh, yes. New, no.

I also think Stephen understands the history of search technology pretty well. He demonstrates this when he writes:

Software lags algorithms and hardware. With fast and cheap processors, some “old” algorithms can be used in the types of systems Ms. Hane identifies; for example, Hakia, Powerset, etc. Google is not inventing “new” things; Google is cleverly assembling bits and pieces that are often well known to college juniors taking a third year math class.

Like Carmen-Maria Hetera, Stephen Arnold sounds biased against algorithms, “old” algorithms in particular, though I don’t think he intended any bias, as many of the best algorithms we have are “old”. There are really not many “new” algorithms. Augmented, yes. Modified, Yes. New, no.

To be involved in IT and biased against algorithms is absurd as long as technology is the application of the scientific method and scientific search methods are understood as collections of investigative steps systematically combined into useful search procedures or algorithms. So there you have my definition of search technology.

The algorithms for most search technology are not rocket science and can be boiled down to simple procedures. At the very least there is an indexing algorithm and a search algorithm:

Pre-coordination per-text/document/record/field procedure:

  1. Computerize an original text by reading the entire text or chunks of it into computer memory.
  2. Parse the text into the smallest retrievable atomic components (usually patterns (trigrams, sentences, POS, noun-phrases, etc.) or keywords or a bag (alphabetical list) of infrequent words).
  3. Store the original text with a unique key and store the parsing results as alternate keys in an index.
  4. Repeat for each new text added to a database or collection.

Post-coordination per-query procedure:

  1. Read a string from input, parse the query into keys in the same way as a text.
  2. Search the index to the selected collection or database with the keys.
  3. Assemble (sort, rank) key hits into lists and display.
  4. Choose hit to effect retrieval of the original text.

These basic algorithms are fulfilled differently by different vendors but vendors do not generally bring new algorithms to the table. They bring their methods of fulfilling these algorithms; they may modify or augment regular methods employed in steps 2 and 3 of these procedures as Google does with link analysis.

In addition, vendors fold search technology into a search engine. Most online search engines– those integrated “software systems” or search appliances that process text, data and user-queries, are composed of the following components:

  1. A crawler for crawling URI’s or files on disk or both.
  2. An indexer that takes input from the crawler and recognize key patterns or words.
  3. A database to store crawler results and key indexing (parsing) results.
  4. A query language (usually SQL, Keyword-Boolean) to use the index and access keys in the database.
  5. An internet server and/or graphical user interface (GUI) components for getting queries from, and presenting results to, users.

Most search engine wizards, as they are called, are working on one or more of these software components of online search engines. You can look at what a representative sample of these so-called wizards have to say about most of these components at the ArnoldIT blog here. If you read through the articles, you won’t find one of them (and I have not read them all) that is working on new indexing methods or new mapping algorithms for mapping the meaning of the query to the universe of text, for example.

Many of the “new search engines,” popping up everywhere, are not rightly called new search technology even though they frequently bear the moniker. They are more rightly named new applications of search technology. But even vendors are confused and confusing about this. Let’s see what Riza Berkin of Hakia is saying in his most recent article where he writes:

But let’s not blind ourselves by the narrowness of algorithmic advances. If we look closely, the last decade has produced specialist search engines in health, law, finance, travel etc. More than that, search engines in different countries started to take over (like Naver, Baidu, Yandex, ect.)…

He had been writing that Search 1.0 began with Alta Vista (circa 1996) Search 2.0 is Google-like and Search 3.0 is semantic search “where the search algorithms will understand the query and text”. I guess all those search engines from Fulcrum, Lexis-Nexis, OpenText, Thunderstone, Verity, Westlaw, and search products from AskSam to Readware ConSearch to ZyIndex, were Search 0.0 or at leat P.B. …. You know like B.C. but Pre-Berkin.

And so this last paragraph (above) makes me think he is confusing search applications with search technology. His so-called specialists search engines are applications of search technology to the field or domain of law, to the field or domain of health, and so on.

Then he confuses me even more, when he writes about “conversational search”:

Make no mistake about it, a conversational search engine is not an avatar, although avatars represent the idea to some extent. Imagine virtual persons on the Web providing search assistance in chat rooms and on messengers in a humanly, conversational tone. Imagine more advanced forms of it combined with speech recognition systems, and finding yourself talking to a machine on the phone and actually enjoying the conversation! That is Search 2.0 to me.

Now I can sympathize with Riza because I used the phrase “conversational search” to describe the kind of conceptual search engine I was designing in 1986. I am not confused about that. I am confused that he calls that Search 2.0 when earlier– statistically augmenting the inverted index –was described as Search 2.0.

He doesn’t stop there. He continues describing Search 3.0 that “will be the ‘Thinking Search’ where search systems will start to solve problems by inferencing. ” Earlier he wrote that semantic search was Search 3.0. Semantics requires inferencing, so I began to reckon maybe thinking and semantics are equal in his mind, until he writes: “I do not fool myself with the idea that I will see that happening in my life time” — so now I am confused again. I think it is what vendors want; they want the public to remain confused about the semantics of search and what you get with it.

And that brings me to the semantics of search.

There are only two words that matter here: Thoroughness and Explanatory.

When I started tinkering with text processing, search and retrieval software in the early 1980’s, I was captivated by the promise of searching and reading texts on computers. The very first thing that I noticed about the semantics of search, before my imagination became involved in configuring computational search technology, was thoroughness. The word /search/ implies thoroughness if not completeness in its definition. Thoroughness is a part of the definition of search. Look at the definition of search for yourself.

You need only look at one or two hit lists from major search engines and you can see that is not what we get from commercial search engines, or from most search technology. Search is not a process that is completed by delivering some hints of where to look, but that is what it has been fashioned into by the technological leaders in the field. Millions of people have accepted it.

Yet, in our hearts we know that search must be complete and it must be explanatory to be satisfying; We must learn from it, and we expect to learn from conducting a search. Whether we are learning of the address to the nearest pizza place or we are learning how to install solar heating, it is not about computational power, it is about explanatory power. They forgot that words are part of the technique of communicating interpersonal meaning, let’s hope search vendors don’t forget that words have explanatory power too.

Tell me what you think.

Read Full Post »

Peter Mika recently wrote an article about the semantic web and NLP-style semantic search. I should just ignore his claim that there are only two roads to semantic search because he is plainly mistaken on that count. As Peter works for Yahoo, he was mainly discussing data processing with RDF and Yahoo’s Search Monkey. He obviously knows that subject well.

He constructed an example of how to use representational data (such as an address) according to semantic web standards and how to integrate the RDF triples with search results. His claim is that one cannot do “semantics” without some data manipulation and for that the data must be encoded with metadata; essentially data about the data. In this case, the metadata necessary to pick out and show the data at the keyword: address.

At the end of his article, Peter talks about the way going forward, and; in particular, about the need for fostering agreements around vocabularies. I suppose that he means to normalize the relationships between words by having publishers direct how words are to be used. He calls this a social process while calling on the community of publishers to play their role. Interesting.

About the time Peter was beginning his PhD candidacy, industry luminary John Sowa wrote in Ontology, Metadata and Semiotics that:

Ontologies contain categories, lexicons contain word senses, terminologies contain terms, directories contain addresses, catalogs contain part numbers, and databases contain numbers, character strings, and BLOBs (Binary Large OBjects). All these lists, hierarchies, and networks are tightly interconnected collections of signs. But the primary connections are not in the bits and bytes that encode the signs, but in the minds of the people who interpret them.

This is the case in the trivial example offered by Peter. The reason one is motivated to list an address in the search result of a search for Pizza is because it is relevant to people who are searching for a pizza place close to them. In his paper, John Sowa writes:

The goal of various metadata proposals is to make those mental connections explicit by tagging the data with more signs.

This is the essential nature of the use case and proposal offered by Yahoo with SearchMonkey. It seems a good idea, doesn’t it? Yahoo is giving developers the means to tag such data with more signs. Besides, it has people using Yahoo’s index, exposing Yahoo’s advertisers. Sowa cautions that:

The ultimate source of meaning is the physical world and the agents who use signs to represent entities in the world and their intentions concerning them.

Which resources do investigators or developers use to learn about agents and their intentions when using signs? The resource most developers turn to is language and they begin by defining the words of language in each context in which they appear.

Peter says it is common for IR systems to focus on words or grams and syntax. While some officials may object, though NLP systems such as Powerset, Hakia and Cognition use dictionaries and “knowledge bases” to obtain sense data, they each focus mainly on sentence syntax and (perhaps with the exception of Powerset) use keyword indexes for retrieval just like traditional IR systems.

Hakia gets keyword search results from Yahoo as a matter of fact. All of these folks treat words, and even sentences, as the smallest units of meaning of a text. Perhaps these are the most noticeable elements of a language that are capable of conveying a distinction in meaning though they certainly are not the only ones. There are other signs of meaning obtainable from textual discourse.

Believe it or not, the signs people use most regularly are known as phonemes. They are the least salient because we use them so often, and frequently they are also largely used subconsciously. Yet, we have found that these particular sounds are instantiations, or concrete signs, of the smallest elements of abstract thought– distinctive elements of meaning that are sewn and strung together to produce words and form sentences. When they take form in a written text they are also called morphemes.

Some folks may not remember that they learned to read words and texts by stringing phonemes together, sounding them out to evoke, apprehend and aggregate their abstract meanings. I mention this because if a more natural or organic semantic model were standardized, the text on the world wide web could become more tractable and internet use might become more efficient.

This would happen because we could rid ourselves of the clutter of so many levels of metalevel signs and the necessity of controlled vocabularies for parsing web pages, blogs and many kinds of unstructured texts. An unstructured text is any free flowing textual discourse that cannot easily be organized in the field or record structure of a database. Neither is it advantageous to annotate the entirety of unstructured text with metalevel signs. Because as John Sowa wrote:

Those metalevel signs themselves have further interconnections, which can be tagged with metametalevel signs. But meaningless data cannot acquire meaning by being tagged with meaningless metadata.

So now it begs the question of whether or not words and their definitions are just meaningless signs to begin with. The common view of words—as signs— is that they are arbitrarily assigned to objects. I am unsure whether linguists could reach consensus that the sounds of words evoke meaning, as it seems many believe that a horse could have been called an egg without any consequence to its meaning or use in a conversation.

Within the computer industry it becomes even more black and white: A word is used to reference objects by way of general agreement or convention, where the objects are things and entities existing in the world. Some linguists and most philosophers recognize abstract objects as existing in the world as well. Though this has not changed the conventional view that is a kind of defacto standard among search software vendors today.

This view implies that the meaning of a word or phrase -its interpretation- adheres only to conventional and explicit agreements on definitions. The trouble is that it overlooks or ignores the fact that meaning is independently processed and generated (implicitly) in each individual (agents) mind. This is generally very little trouble if the context is narrow and well-defined as in most database and trivial semantic web applications on the scene now.

The problems begin to multiply exponentially when the computer application is purported to be a broker of information (like any search engine) where there is a verbal interchange of typically human ideas in query and text form. This is partly why there is confusion about meaning and about search engine relevance. Relevance is explicit, in as much as you know it when you see it, otherwise, relevance is an implicit matter.

Implicit are the dynamic processes by which information is recognized, organized, acted on, used, changed, etc. The implicit processes in cognitive space are those required to recognize, store and recall information. Normally functioning, rational, though implicit and abstract thought processes organize information so we that may begin to understand it.

It is obvious that there are several methods and techniques of organizing, storing and retrieving information in cyberspace as well. While there are IR processes running both in cyberspace and in cognitive space, it is not the same abstract space and the processes are not at all the same. In cyberspace and in particular in the semantic web, only certain forms of logical deduction have been implemented.

Cognitive processes for organizing information induce the harmonious and coherent integration of perceptions and knowledge with experience, desires, the physical self, and so on. Computational processes typically organize data by adding structure that arranges the information in desired patterns.

Neither the semantic web standards, nor microformats, nor NLP, seek the harmony or coherence of knowledge. Oh, yes, they talk about knowledge and about semantics yet what they deliver are little more than directives; suitable only for data manipulation in well-understood and isolated contexts.

Neither NLP nor semantic web meta data or tools presently have sufficient faculty for abstracting the knowledge that dynamically integrates sense data or external information with the conditions of human experience. The so-called semantic search officials start with names and addresses because these data have conventionally assigned roles that are rather regular.

When it comes down to it, not many words have such regular and conventional interpretations. It would actually be quite alright if we were just talking about a simple database application, but proponents of the semantic web want to incorporate everything into one giant database and controlled vocabulary. Impossible!

While it appears not to be recognized, it should be apparent that adherence to convention is a necessary yet insufficient condition to hold relevant meaning. An interpretation must cohere with its representation and its existence (as an entity or agent in the world) in order to hold. Consider the case of Iraq and weapons of mass destruction. Adhere, cohere, what’s the difference –it’s just semantics– right? Nonetheless, neither harmony nor coherence can be achieved by directive.

A consequence of the conventional view is that such fully and clearly defined directives leave no room for interpretation even though some strive for under specification. The concepts and ideas being represented can not be questioned; because, being explicit directives, they go without question. This is why I believe the common view of words and meaning that many linguists, computer and information experts, like Peter, hold, is mistaken.

If the conventional view were correct, the interpretation of words would neither generate meaning nor provide grounds for creating new concepts and ideas. If it were truly the case, as my friend Tom Adi said, natural language semantics would degenerate into taking an inventory of people’s choices regarding the use of vocabulary.

So, I do not subscribe to the common view. And these are the reasons that I debate semantic technologies even though end-users could probably care less about the techniques being deployed. Because if we are not careful we will end up learning and acting by directive too. That is not the route I would take to semantic search. How about you?

Read Full Post »

In looking at the comments of the last post The Search for Semantic Search, I see there appears to be some interesting interpretations. Let me explain my motives, address any perceived bias and clarify my position.

Alex Iskold wrote about semantic search that we were asking the wrong questions; that it was essentially the root of the problem with semantic search engines, and; they were only capable of handling a narrow range of questions– those requiring inference. Among other things, he also wrote that his question about vocation was unsolvable; impossible, was the term he used. These ideas and the fact that Alex implied Google was a semantic search engine, and inferred that vendors must dethrone Google to be successful, motivated me to blog about it myself.

I was criticized, in the comments, for implying that the so-called “semantic search” capability of these NLP-driven search engines is weak, and due to this they do not really qualify as “semantic search” engines. Actually Kathleen Dahlgren introduced a new name in her comments: “Semantic NLP”. I was also criticized for asking a silly question and for posting my brief analysis of this one single question that Alex said was unsolvable without massively parallel computers.

Of course you cannot judge a search engine by the way it mishandles one or even a few queries. But in this case one natural language question reveals a lot about the semantic acuity of NLP, and the multiple query idea is a kind of strawman argument intended to distract us. It almost proves Alex right as it is misleading.

I do not believe people are motivated to ask wrong questions and I do not believe people ask silly questions to computer search engines while expecting a satisfactory set of results or answers. Nevertheless, when any case fails, the problem or fault does not lie with people. The search engine is supposed to inform them. The fault lies with the computer software for failing to inform. You can try and dismiss it with a lot of hand waving but just like that pesky deer fly — it is going to keep coming back.

While NLP front ends and semantic search engines are the result of millions of dollars in funding and the long work of brainiacs, and while they may be capable enough to parse sentences, build an index or other representation of terms and use some rules of grammar, they are not always accommodating or satisfying. In fact they can be quite brittle or breakable. This means they do not always work. But they do work under the right circumstances in narrowly defined situations. One of the questions here is whether they work well enough to qualify them as “semantic search” engines for English language questions.

Any vendor who comes out in public and claims they are doing “semantic search” should prove it by inferring the significance of the input with sufficient quality and acuity such that the result, or search solution or conclusion, satisfies the evidence and circumstance. This a minimum level of performance. There are tests for this. Many people use a relevance judgment as a measure of that satisfaction as far as any type of search and retrieval method or software is concerned.

With that said, my last post was about debunking the so-called complex query myth not about “testing” the capabilities of any search engine. It was about semantic search and how any search engine solves this single so-called impossible question. There were results, and they were not completely “useless” as I see, on review, that I wrote. I apologize for calling them useless.

Both Cognition and Powerset produced relevant results (with one word) that were more comprehensive than the results Google provides, in my opinion. That is not a natural language process of understanding a sentence though. Having a capacity to look up a word in a dictionary is not the same as the capacity to referentially or inferentially use the concept. In this case, to make some judgments (distinguish the significant relationships, at least) and inform the search process.

This capability to distinguish significant relationships is a key criteria of “semantic” search engines — meaning they should have a capacity to infer something significant from the input and use it. The results of this query tell a different story. You cannot just profess linguistic knowledge, call the question silly and make the reality it represents go away. This kind of problem is systemic.

As far as the so-called “semantic” search engines inferring anything siginificant from this (full sentence case) question (evidence) or circumstance of searching, I treated all the results with equal disaffirmation. What is more; I stand by that as it is supported on its face. If you look at the results of the full sentence case query at Cognition, you will notice that they are essentially the same as those from Powerset.

I reckon this could be because both engines map the parts of speech and terms from the query onto the already prepared terms and sentences from Wikipedia. This “mapping strategy” clearly fails –in this case– for some pretty obvious reasons. Without pointing out all the evidence I collected, I summed those reasons up as a lack of semantic acuity. That seems to have touched a nerve.

So I will get into the details of this below. Let me first take a moment to address the fact that one inquiry reveals all this information. Really it is not just one inquiry. It is one conceptualization. Dozens of questions can be derived by mapping from the concepts associated to these terms of this single question. For example: Where are the best careers today?; Who has the better jobs this year?; Where can I work best for now?; What occupation should I choose given the times?; etc. I tried them all and more with varying degrees of success.

One problem is that NLP practitioners are concerned with sentence structure and search engineers are concerned with indexing terms and term patterns. Either way, the methods lack a conceptual dimension and there is no apparent form of any semantic space for solving the problem. The engines have no sense of time points or coordinated space or other real contexts in which things take place. The absence of semantic acuity is not something that only affects a single inquiry. It will infect many inquiries just as a disease infects its host.

Now that I recognize the problem, if I were challenged to a wager, I would wager that I could easily produce 101 regular English language questions that would demonstrate this affliction. The search engines may produce a solution, except that the results would be mostly nonsense and not satisfying. It would prove nothing more and nothing less than I have already stated. What say you Semantic Cop?

I should mention that I have long suspected that there was a problem mapping NLP onto a search process and I could not put my finger on it. A literature search on evaluations of text retrieval methods will show, in fact, that the value of part of speech processing (in text search and retrieval) has long been regarded as unproven. By taking the time to investigate Alex Iskold’s complex query theory I gained more insight into the nature and extent of this problem. It is not just a problem of finding a definition or synonyms for a given term as some reader’s may infer. Let me explain further.

While Powerset, Cognition and Hakia each had the information that a vocation was a kind of altruistic occupation, and the search circumstance (a hint) that the information seeker could be looking for an occupational specialty or career, they did not really utilize that information. The failure, though, really wasn’t with their understanding of the terms occupation or vocation. Their failure was specifically related to the NLP approach to the search process. That is supported by the fact that these different search products employing NLP fail in the same way.

That should not be taken to mean that the products are bad or useless. Quite to the contrary, the product implementations are really first class productions and they appear to improve the user experience as they introduce new interface techniques. I think NLP technology will continue to improve and will eventually be very useful, particularly at the interface as Alex noted in his post. But does that make them semantic search engines?

Lest I have been ambiguous, let me sum up and clarify by referring back to the original question: Whether you are looking at Powerset, Cognition or Hakia results, they clearly did not understand the subordinate functionality of the terms /best/, /vocation/, /me/ and /now/ in the sentence.

They clearly could not conceptualize ‘best vocation’ or ‘now’– they could only search for those keyword patterns in the index or data structures created from the original sentences. That is not just ‘weak’ semantics that is not semantic search at all. Maybe they “understood” the parts of speech but they did not infer the topic of inquiry nor did they properly map the terms into the search space. Google did not fare any better in this case, but Google does not claim to be a semantic search engine. So where’s the semantics?

By that I mean (for example) that interpreting /now/ from the natural language question ‘what is the best vocation for me now’ as an adverb, does not improve the search result. Treating it as a keyword or arbitrary pattern does not improve the search result. And it demonstrates a clear and present lack of acuity and understanding of the circumstance.

Finding the wrong sense of /now/ and showing it is of dubious value. An inference from /now/ to ( –> ) today, at present, this moment in time, or to this year or age, and using that as evidence leading to an informed conclusion would demonstrate some semantic acuity in this case. Most people have this acumen, these NLP search engines obviously do not–according to the evidence in this case.

The NLP vendors defend this defect by accusing people of not asking the right question in the right way or not asking enough questions. That is like me saying to my wife:

If you want some satisfying information from me you better use a lot of words in your question and it better not be silly. Don’t be too precise and confuse me and don’t use an idiom and expect me to satisfy you. I’ll still claim to understand you. It is you that asks silly questions. That not being enough, you also have to nag me with more long and hard questions before you say my responses are rubbish.

If I should either desire or dare to do that at all, what do you think her response would be? More importantly what do prospects say when you tell them their questions are silly?

I do not need to proceed with a hundred questions when with a dozen or so I have enough evidence to deduce that these NLP-driven search engines are limited when it comes to ** inferring the topic of inquiry **. In some cases they are simply unable to draw on the *significant structures or patterns of input, evidence or circumstance” and produce a suitable solution.

What bothers me is that some of these so-called “semantic search engines” claim to “understand” in general. I did that too, a very long time ago. Yes, I was there in the back of the room at DARPA and NIST meetings and I have been at PARC and the CSLI for presentations. I was challenged then. And it enlightened me. If such claims go unchallenged it will only serve to demean the cultural arts and independent critical thinking and confuse prospects about the capabilities regular people expect of semantic products. I do not wish to lower the bar.

In this instance, and there are many similar cases that could be derived from the semiotic characteristics of this instance, the NLP-driven engines do not show the slightest acumen for inferring the topic of inquiry. I hope the discerning reader sees that it is not just about some synonyms. If they could infer the topic of inquiry, that would demonstrate a little understanding… at least a capacity to learn.

The result, in all such cases, is that these so called “reasoning-engines” and semantic search engines do not lead us to a satisfying consequence or conclusion at all. They have technical explanations such as synonymy and hyponymy for any word, yet, if the software cannot infer the sense of everyday terms, is it even sensible to call the methods “semantic”? Just because the vendors profess linguistic knowledge does not mean their their semantics are any more than just another marketing neologism.

It may be called semantic NLP but that does not qualify as semantic search in my opinion.

Read Full Post »

In a recent Read Write Web article that was much more myth than reality, Alex Iskold posits the fact that a semantic search engine must dethrone Google (myth1). Fortunately by the end of his article he concludes that he was mislead into thinking that. I do not think he was misled at all. I just think he is confused about it all.

He posits a few trivial queries (myth 2) to show that there is “not much difference” between Powerset and Hakia and Freebase (myth 3). And that semantic search “is no better than Google search ” (myth 4). After that Alex writes that there is a set of problems suitable to semantic search. He says these problems are wonderfully solved (already) by relational databases (myth 5).

It makes one wonder why we should mess with semantic search if the problems are already solved. It is not true. That is why. Neither was any of the talk about query complexity, true.

It is not all these myths, exactly, but unclear thinking that leads to false expectations as well as false conclusions. Alex seems to be confused about the semantic web and semantic search. These are two different things but somehow Alex morphs them into one big database. Because I do want this post to impart some information instead just being critical of a poorly informed post, let me start by debunking the myths.

Myth1: Semantic Search should dethrone Google

For many search problems, semantics plays no role at all. Semantics plays a very limited role when the query is of a transactional nature, e.g., a search problem of the type: find all x.

Google is a search engine that solves search problems of this type. Yet the Google kingdom is based on being a publisher. Google uses super-fast and superficial keyword search to aggregate dynamic content from the internet for information seekers and advertisers alike. Google’s business does not even lend itself to semantic search for some very obvious reasons having to do with speed and scalability. Google’s best customers know exactly what they want and they certainly do not want any “intellectual” interpretations.

None of the so-called semantic search engine companies, that I know of, are pursuing a business strategy to dethrone Google as an information-seeker’s destination of choice. Powerset, for example, is not aggregating dynamic content like Google. It’s business model does not seem to be based on a publishing or advertising model.

Powerset is using their understanding of semantics to assist the user (of wikipedia) in relating to that relatively static content, from several different mental or rational and conceptual perspectives. This is meant to assist the information-seeker with interpreting the content. That is a good and valid application of semantics.

This is not the position a company seeking to unseat Google would take. A company seeking to unseat Google would be better positioned by producing technology to assist advertisers in classifying, segmenting and targeting buyers.

Myth 2: Trivial and Complex Queries

Unfortunately Alex did not supply any complex examples in his post. He tried to imply that his trivial queries were complex and the most complex was impossible to solve. This query was the one labeled impossible: “What’s the best vocation for me now?” I will use Alex’s query to debunk his misguiding assumptions. First, let’s clarify by looking at the search problems represented by the Alex’s natural language queries.

Note 1: Alex offers the first query as impossible to solve. It must be because Alex is expecting a machine and some software to divine his calling based presumably on his mood now and some mind-reading algorithm. I should hope most people would seek a human counselor rather than rely on the consul of a semantic search engine for addressing their calling. It is fair to use a search engine to find a career or occupation and it is valid to expect a semantic search engine to “understand” the equivalence relationship between the terms occupation and vocation, in this context.

As I suggested best + vocation, or just vocation alone is a simple solution that should be easy to satisfy. However, this simple search solution fails on all search engines. Even so-called semantic search engines have a problem with this query (see comparative search results under myth 4 below). It is not because it is complex query. It is because Alex used the word vocation. This word is not frequent and search engines do not know its synonyms. This is a complex concept as it takes semantic acuity to “understand” it. No one talks about semantics in terms of acuity though.

Nonetheless, a search for vocation + best, and sorting the results by most recent date, will however, create a valid search context in which one can reasonably expect a solution from their semantic search engine. Most people, I am assuming, would have a more reasonable expectation than Alex; one that may be fulfilled by this internet page suggested by Readware:

A semantic search engine needs semantic acuity to “understand” that the concept of a vocation and the concept of an occupation are related. Obviously none of the search engines mentioned in Alex’s article have such acuity of understanding. Some of the search engines tried to process the pronoun me and the word now. Instead of being a solution, it created a problem as can be seen in the search results (under myth 4) below.

Note 2: This query needs a search engine with some more exotic search operators than a simple keyword search engine might provide. The query, however, is not complex. Some search engine may index US Senator as a single item to facilitate such a search. A search engine would need extended Boolean logic to process phrases using a logical AND between them. A more seasoned search engine, such as Google, would parse and logically process the query from a search box, without any specifying logic, and return an acceptable result. NLP-based engines (like Hakia and Powerset) try to do this too. They use propositional logic instead of Boolean logic. The effects are not very satisfying as can be seen below (in the search results listed under myth 4).

A more sophisticated and indeed “semantic” search engine may interpret foreign entity according to a list of “foreign entities”. It would take some sophisticated semantics to algorithmically interpret what type of labels may be subordinate to foreign entity. For example: A German businessman, a Russian bureaucrat, a Japanese citizen, an American taxpayer. Which is the foreign entity?

Yet, it is also clear that an inventory of labels can be assembled under certain contexts. Building such an inventory constitutes the building of knowledge. A semantic search engine should help inventory and utilize knowledge for future researches. None of the semantic search engines that Alex mentioned do anything like this. Readware does do this.

Note 3: This search would benefit from a search engine that recognizes names. I think Hakia has an onomastic component. I am not sure about Powerset. However, this search works on nearly any search engine because their are plenty of pages on the web that contain the necessary names and terms. Otherwise there is nothing complex about this query.

The reality, as you can see, is that every query Alex offered is trivial. Yet it demonstrates what is wrong with so-called “semantic search”. That is, today’s semantic search products, including the NLP -powered search engines masquerading as “semantic” search, fail at real tests of semantic acuity. Before I get into the evidence though. Let me just say something about semantic search technologies in general.

Semantic Search Technologies

There are no public semantic search engines today. There are search engines and there are search engines with Natural Language Processors (NLP) that work on the indexing and query side of the search equation. Whether or not databases are used to store processed indexes or search results, databases and database technology like RDBMS and SQL have nothing to do with it.

The search engines that have the capacity for natural language processing usually claim to “understand” word and/or sentence semantics– in a linguistic sense. This usually means that they understand words by their parts of speech, or they can look up definitions in a resource. Hakia and Powerset fall into this class, as does Cognition and several other search engines both in the U.S. and abroad. These are called semantic search engines and they claim to understand word sense and do disambiguation and so forth and so on, but as I will show below: at questionable acuity.

Google is not a semantic search engine at all. While Hakia and Powerset may represent some small part of the spectrum of semantic search engines they are hardly representative of semantic search. Along with Freebase and Powerset, more representative of “semantic web” search is SWSE, Swoogle and Shoe.

Besides these semantic web search engines, there are semantic search engines akin to Hakia, such as Cognition, as mentioned in this article at GigaOM, along my own favorite Readware. So, in summary, Alex’s comparison is not representative and is really poor evidence.

Myth 3: No difference between Powerset and Hakia and Freebase.

Well this is just ridiculous. It is not only a myth, it is pure misinformation. Nothing could be further from the truth. While Powerset and Hakia use NLP technology that could be construed as similar, Freebase is a essentially an open database that can be queried in flexible ways. Freebase and Powerset happen to be somewhat comparable because Powerset works on Wikipedia and uses RDF to store data, and semantic triples (similar to POS) to perform some reasoning over the data. Freebase also stores Wikipedia-like data in RDF tuples.

It is probably also worthwhile to mention that Hakia’s NLP comes from the long time work and mind of the eminent professor Victor Raskin and his students. Powerset’s NLP comes from the work of Ronald Kaplan, Martin Kay and others connected with Palo Alto Research Center, Stanford University and the Center for the Study of Language and Information (CSLI). Cognition’s technology is based on NLP work done by Kathleen Dahlgen.

While Hakia, Powerset and Cognition represent these notable NLP approaches, their search methods and results show they do not know a great deal about search tactics and solutions. They do not seem to be successful in mapping the sentence semantics into more relevant and satisfying search results. It seems, from the evidence of these queries, they only know how to parse a sentence for its subject, object and verb and, a lot like Google, find keywords.

Myth 4: Semantic Search is No Better than Google.

Hakia and Powerset are like neophytes in Google’s world of search. That alone makes these engines no better than Google. Yet, that does not apply to semantic search in general. The truth is that the semantics of the search engines we are talking about (Hakia, Powerset, Freebase and Search Monkey), do not appear to make the results any worse than those from Google. Let’s take a look at the Google search results for ‘What is the best vocation for me now’:

As may be predicted, the results are not very good (because the keyword vocation is not popular). Google also wants to be sure we do not mean ‘vacation’ instead of vocation. Hakia , on the other hand , strictly interpreted the query:

Just like the results from Google, these are not very satisfying. You might think that because Hakia is a semantic search engine, it would have the semantic acuity to “understand” that vocation and occupation are related. As you can see in the following search result, this could not be farther from the truth:

Not one of Hakia’s results had to do with occupational specialties or opportunities for career training and employment. Powerset did not produce any results when the term vocation is used and it really had nothing on occupation so it searched for best + me + now. There is nothing semantic about that and it is a pretty bad search decision as well. The results are useless; I will post them so you can judge for yourself:

When you have results like this, it really does not matter what kind of user interface you have to use. If it is a bad or poor user interface, it only makes the experience that much worse. Even if it is a good, fancy, slick or useful interface, it won’t improve irrelevant results.

Another so-called semantic search engine, Cognition, did not fare any better:

This above search result is useless provides a starting point for further investigation, as is does the search for occupation:

I actually was mildly surprised that Cognition related the term occupation to the acronym MOS, which means Military Occupational Specialty. Then I saw that they did not distinguish the acronym from other uses of the three letter keyword combination. Again not a very satisfying experience. I did not leave Freebase out, I just left them until last. All Freebase results do is confirm that vocation is an occupation or a career:

It was not possible for freebase to accept and process the full query. As this result shows, the data indicates that a vocation is also known as an occupation but none of these engines realize that fact.

Myth 5: Already solved by RDBMS.

If the search problem or the “semantic” problem could be solved by the RDBMS, Oracle would be ten times the size of Google and Google might be using Oracle’s technology if it existed at all. None of these problems (aggregated search, semantic parsing of the query and text, attention, relevance) are solved by any RDBMS. But Alex brushed over the real problems to make the claim that it is all up to the User Interface and the semantics only matter there. I suppose that was the point he was trying to make by including Search Monkey in his comments. This is just hog wash though. By that I mean that it is not true and it is in fact misleading.

Conclusions

It is plain to see that a semantic search engine needs acuity to discern the differences and potential relations that can form between existing terms, names and acronyms. It is also plain to see that none of the commercial crop of search engines have it. The Natural Language search engines, which have dictionaries as resources, do not associate vocation to occupation (for example) and therefore cannot offer any satisfying search results.

There are 350,000 words in the English language. How many do you suppose are synonymous and present a case just like this example? Parsing a sentence for its subject, object and verb, is fine. It does not mean it will be useful or helpful in producing satisfying search results.

It is foolish to think that NLP will be all that is needed to obtain more relevant search results. The fact is that search logic and search tactics are arts that are are largely unrelated to linguistics or NLP or database technology. While language has semantics, testing of the semantics of so-called semantic search engines has demonstrated that the semantics, if they are semantics, are pretty weak. I have demonstrated that semantic acuity plays a large role in producing more relevant and satisfying search results. A semantic search engine should also help inventory and utilize knowledge for future researches. An informed search produces more satisfying results.

Read Full Post »

I would like to address the few questions I received on the three parts 1,2 and 3 of the semantics of interpersonal relations. The first and most obvious questions was:

I don’t get it. What are the semantics?

This question is about the actual semantic rules that I did not state fully or formally in any of the three parts. I only referred to Dr. Adi’s semantic theory and related how the elements and relations of language (sounds and signs) correspond with natural and interpersonal elements and relations relevant to an embodied human being.

Alright, so a correspondence can be understood as an agreement or similarity and as a mathematical and conceptual mapping (a mapping on inner thoughts). What we have here, essentially, is a conceptual mapping. Language apparently maps to thought and action and vice-versa. So the idea here is to understand the semantic mechanism underlying these mappings and implement and apply it in computer automations.

Our semantic objects and rules are not like those of NLP or AI or OWL or those defined by the semantic web. These semantic elements do not derive from the parts of speech of a language and the semantic rules are not taken from propositional logic. And so that these semantic rules will make more sense, let me first better define the conceptual space where these semantic rules operate.

Conceptually, this can be imagined as a kind of intersubjective space. It is a space encompassing interpersonal relationships and personal and social interactions. This space constitutes a substantial part of what might be called our “semantic space” where life lived, what the Germans call Erlebnis, and ordinary perception and interpretation (Erfahrung) intersect, and where actions in our self-embodied proximity move us to intuit and ascribe meaning.

Here in this place is the intersection where intention and sensation collide, where sensibilities provoke the imagination and thought begets action. It is where ideas are conceived. This is where language finds expression. It is where we formulate plans and proposals, build multidimensional models and run simulations. It is the semantic space where things become mutually intelligible. Unfortunately, natural language research and developments of “semantic search” and the “Semantic-Web” do not address this semantic space or any underlying mechanisms at all.

In general when someone talks about “semantics” in the computer industry, they are talking either about English grammar, rdf-triples in general or they are talking about propositional logic in a natural or artificial language, e.g., a data definition language, web services language, description logic, Aristotelian logic, etc. There is something linguists call semantics though the rules are mainly syntactic rules that have limited interpretative and predictive value. Those rules are usually applied objectively, to objectively defined objects, according to objectively approved vocabulary defined by objectively-minded people. Of course, it is no better to subjectively define things. Yet, there is no need to remain in a quandary over what to do about this.

We do not live in an completely objective, observable or knowable reality, or a me-centric or I-centric society, it is a we-centric society. The interpersonal and social experience that every person develops from birth is intersubjective — each of us experience the we-centric reality of ourselves and others entirely through our own selves and our entirely personal world view.

Perhaps it is because we do not know and cannot know– through first-hand experience at least– what any others know, or are presently thinking, that there is this sort of dichotomy that sets in between ourselves and others. This dichotomy is pervasive and even takes control of some lives. In any case, conceptually, there is a continuum between the state of self-realization and the alterity of others. This is what I am calling the continuum of intersubjective space.

A continuum of course, is a space that can only be divided arbitrarily. Each culture has their own language for dividing this space. Each subculture in a society have their own language for dividing this space. Every technical field has their own language for dividing the space. And it follows, of course, that each person has their own language, not only for dividing this space, but for interacting within the boundaries of this space. The continuum, though, remains untouched and unchanged by interactions or exchanges in storied or present acts.

The semantics we have adopted for this intersubjective space include precedence rules formulated by Tom Adi. Adi’s semiotic axioms govern the abstract objects and interprocess control structures operating in this space. Cognitively, this can be seen as a sort of combination functional mechanism, used not only for imagining or visualizing, but also for simulating the actions of others. I might add that while most people can call on and use this cognitive faculty at will, its use is not usually a deliberate act; it is mainly used subconsciously and self-reflexively.

We can say that the quality of these semantics determine the fidelity of the sound, visualization, imitation or simulation to the real thing. So when we access and use these semantics in computer software as we do with Readware technology, we are accessing a measure of the fidelity between two or more objects (among other features) . This may sound simplistic though it is a basic level cognitive faculty. Consider how we learn through imitation. Note to self: Don’t leave out the cognitive load to switch roles and consider how easily we can take the opposite or other position on almost any matter.

We all must admit, after careful introspection, that we are able to “decode” the witnessed behavior of others without the need to exert any conscious cognitive effort of the sort required for describing or expressing the features of such behavior using language, for example. It may be only because we must translate sensory information into sets of mutually intelligible and meaningful representations in order to use language to ascribe intentions, order or beliefs, to self or others, that the functional mechanism must also share an interface with language. It may also be because language affords people a modicum of command and control over their environment.

Consider the necessity of situational control in the face of large, complex and often unsolvable problems. I do not know about you, but I need situational control in my environment and I must often fight to retain it in the face of seemingly insurmountable problems and daily ordeals.

Now try and recognize how the functional aspects of writing systems fill a semiotic role in this regard. Our theoretical claim is that these mutually intelligible signs instantiate discrete abstract clusters of multidimensional concepts relative to the control and contextualizing of situated intersubjective processes.

Like the particles and waves of quantum mechanics are to physics, these discrete intersubjective objects and processes are the weft and the warp of the weaving of the literary arts and anthropological sciences on the loom of human culture. We exploited this functional mechanism in the indexing, concept-analysis, search and retrieval software we call Readware.

We derived a set of precedence rules that determine interprocess control structures and gave us root interpretation mappings. These mappings were applied to the word roots of an ancient language that were selected because modern words derived from these word roots are used today. These few thousand root interpretations (formulas) were organized into a library of concepts, a ConceptBase, used for mapping expressions in the same language and from different languages. It was a very successful approach for which we designed a pair of ReST-type servers with an API to access all the functionality.

To make this multi-part presentation more complete, I have posted a page with several tables drawn up by Tom Adi, along with the formal theory and axioms. There are no proofs here as they were published elsewhere by Dr. Adi. These tables and axioms identify all the key abstract objects, the concepts and their interrelationships. Tom describes the mappings from the base set (sounds) and the axioms that pertain to compositions and word-root interpretations, together with the semantic rules determining inheritance and precedence within the control structures. You can find that page here.

And that brings me to the next question, which was: How can you map concepts between languages with centuries of language change and arbitrary signs? The short answer is that we don’t. We map the elements of language to and from the elements of what we believe to be are interlinked thought processes that form mirror-like abstract and conceptual images (snapshots) of perceptive and sensory interactions in a situated intersubjective space.

That is to say that there is a natural correspondence between what is actually happening in an arbitrary situation and the generative yet arbitrary language about that situation. This brings me to the last question that I consider relevant no matter how flippant it may appear to be:

So what?

The benefits of a shared semantic space should not be underestimated. Particularly in this medium of computing where scaling of computing resources and applications is necessary.

Establishing identity relations is important because it affords the self-capacity to better predict the consequences of the ongoing and future behavior of others. In social settings, the attribution of identity status to other individuals automatically contextualizes their behavior. By contextualizing content, for example, knowing that others are acting as we would effectively reduces the cognitive complexity and the amount of information we have to process.

It is the same sort of thing in automated text processing and computerized content discovery processes. By contextualizing content in this way (e.g, with Readware) we dramatically and effectively reduce the amount of information we must process from text, to more directly access and cluster relevant topical and conceptual structure, and to support further discovery processes. We have found that a side-effect to this kind of automated text-analysis is that it clarifies data sources by catching unnatural patterns (e.g., auto-generated spam) and it also helps identify duplication and error in data feeds and collections.

Read Full Post »

I promised in the last post that I could offer a solution to the disconnect between what search engines locate and what people think is relevant. Now there is nothing wrong with search engines as long as you know what you are looking for and it has a uniquely relevant name or handle. Some search engines are better than others and most are very good at what they do. The disconnect happens when you do not know what to look for, or you are looking for interrelationships that are often ambiguously expressed.

Besides the search process itself, there are many applications that could benefit from more intelligent computer processing. The man-machine interface could be vastly improved if computers could only understand what is important to people, how they talk, what they mean. To most computer scientists this means that the terms we use need to be defined.

This is true, yet, sometimes a word definition is not enough to inform the computer process. Try looking for trends, for example. Knowing the definition of a trend and how it is used in a sentence is not sufficient for spotting a trend in in a series of reports or documents. We need to know the semantics, or the significant elements and relations, of trends. Matching the word or a synonym to the word in an index of terms is only somewhat helpful.

In this post I will tell you about the technology I have helped produce and mostly how the underlying techniques address the root problem of semantics, or rather the lack of semantics, that is holding back development of more intelligent systems. Most experts would agree that a unified semantic theory is necessary for progress to be made in intelligent computing platforms and programs.

Since the mid-1980’s, I have been working on this problem with my associate Dr. Tom Adi. Together, we have produced at least a half-dozen products from the semiotic and intelligent techniques we have developed. Most of these have to do with processing and classifying plain text, web pages, documents, messages, etc., The technology is real, proven scalable and adaptable, and we have paying customers using it in one form or another.

The techniques we are perfecting achieve some, not all, of the same objectives as the semantic web; particularly in the case of semantic search techniques. We support or intend to support all the W3C standards. As it stands today we have what can be called a RESTful interface delivered over http services. The technology is called Readware technology.

The Readware IpServer is a set of software services and an application programming interface (API) that provides text and data indexing, classification and a search and retrieval engine, for computers. It is a programmable semantic search engine with its own resources. Unlike semantic web technologies, the language and knowledge resources are provided in the package and the meta-data is automatically generated by the engine.

Unlike most uniformed search engines (ala Google) that match search terms to indexed terms, Readware categorizes terms and phrases while indexing and can equate them and inform the search with measures of fidelity, kinship and affinity between them. These measures encompass linguistic functions, such as synonymy and word disambiguation, and more. However, Readware does not use the kind of rudimentary logical inference or natural language processing that our modern counterparts Hakia or Power-Set do.

Instead, Readware simply and intelligently maps text according to a cognitive model with semantics that are reflective of interpersonal relationships. Readware search algorithms conduct a competitive exegesis to locate relevant bonds, connections and associations between the elements in the information space. We know the connections and associations found will be mostly relevant because they are reflective of interpersonal relationships that are important or significant to people and which are symptomatic of human experience.

Let me define what I mean by interpersonal relationships. Interpersonal relationships are the body of relationships that form (and break):

  1. between people, and;
  2. between their perceptions and beliefs about the events and happenings around them, (often set in text) and;
  3. on account of their own actions and interactions in the world.

Of course everyone thinks and acts differently, each person has their own experiences and point of view, beliefs and perceptions. Even so: in my experience there is a universal and unified framework, a system and semantics that underlies and rules the ways people tend to think, speak and act along with the interpersonal relationships that fall between people.

I am not saying I can tell what people are thinking and I am not talking about religion. I am talking about the fundamentals of interpersonal relationships. These are ontological relationships that stem from the human condition and what it means to be human. These relationships have more to do with personal control, affinity, consanguinity, engagement and interaction with others, trust, and the making and breaking of bonds, than the truth values of the conventional structure of sentences.

In any case, as I have stated quite clearly where our semantics are seated, and that they are rather different from those of our contemporaries, let me start back at the beginning.

I started working with computers in the late 70’s. I became involved in the design of “information-systems” that delivered actionable information. At that time computers were being used as a kind of store and forward device. It seems not much has changed. Initially, I was working on the input, editing and communications side. It was not until after 1981, when I became involved with the output side, that I realized computers should do more than only store and forward. They should classify, filter, route and perform other intelligent functions.

I began looking into what it would take to make more intelligent programs, particularly in text and language processing. If I had not been so naive at that time of my life, I might have realized that I did not even know what it means to be intelligent; neither did anyone else. I felt we were all looking for theories about intelligence, human psychology and language use.

When I met Tom Adi in the early 80’s, I was intrigued when he told me his hypothesis about meaning and interpretation of language: that the elements and relations of natural languages should abstractly correspond to elements and relations of other natural phenomena at all times. When I asked him how he defined the elements of the system, he said that he had not figured that out yet. I will come back to this later, let me just say a few things about the qualities of some systems that everyone reading will recognize.

One of the things that define a system is that it is comprised of a set or body of elements that have specific interconnecting relationships. Examples are our own internal circulatory or metabolic systems, and the solar system. The body, heart, veins and blood are interrelated. Each planet in the solar system has a relationship to each other and to the sun.

And of course nearly everyone reading should be familiar with the personal computers and the relationship between the movement of the mouse and the pointer on the computer screen, for example. These sorts of indispensable elements and consequential relationships are what I am talking about. These indispensable elements and consequential relations are significant or semantic elements and relations.

In natural systems the significant (semantic) relationships are fixed, or invariant, stable or constant, and they are universal in their application. That the planets continue to exist day after day and that they revolve around the sun in the order and way they do is inherited from the earliest days of planet formation. The initial configuration and relationships are inherited through time and space. There is a “natural” order to the solar system that does not seem to change over time– though everything changes in time, certain elements and relations, as I have described above, seem rather stable, constant and predictable, and universal.

It can be hard to think of human behavior as systematic, or to consider that human interactions and activities have specific elements and relationships like those between the planets, their moons, and the sun of our solar system. On the surface it seems not to be so. Yet the truth is that there are fundamental, ontological, elements and relations of interpersonal activities. Because the acts of speaking, reading and writing are natural phenomena and also forms of interpersonal activity, it makes sense that there are elements and relations of natural languages that correspond to elements and relations of interpersonal relationships.

It is probable that when systems of writing were first devised that the semantic relationships, between the sounds comprising a word and the abstract associations inspired by sound combinations, were well understood by those devising writing systems. Yet because sounds degenerate over time and natural languages change and exchange names and words, those relationships were confused and even forgotten, even though the systems of writing, though reformed, survive today.

This is because phonetic systems of writing, like our own western alphabet, were humanity’s first interpersonal memory. They were not designed to preserve the conventions, trappings and sophistry of the philosophers and keepers of human languages. They were designed to preserve personal (and abstracted) thoughts, experience and human memories.

With the advent of the alphabet, for the first time in history, people were able to transmit and exchange their abstract thoughts and their own desires with other people. The Romans became fond of writing and reading love poems and romance novels soon after they acquired their writing system. Western language, science, law, religion and most artifacts of “human culture” began to blossom and spread with the spread of written languages.

Yet, while written languages are widely viewed as “systems of writing,” many modern linguists have not considered the relations between elements of written language (letters) or their relationships to sounds and to words, or anything else, as worthy of scientific inquiry. This explains the disconnect between the ways computers compute semantic relationships in text and the ways people, read, recognize and think about the world at large.

Computer scientists and programmers really cannot be blamed. They depended and relied upon linguists. Linguists, being conventional in all regards to language, and being preoccupied with the relationships within sentences, were either unable or unwilling to consider the interrelationship between abstract thought and the ways people act, read and write.

Today, our modern writing systems are highly refined instruments and devices clothing the thoughts, opinions, politics and religions of the human race. That brings us back to interpersonal relationships, and beliefs and perceptions, which are the factors driving society and culture, and to our interactions in the world. The interrelationships between all these things can be found in text, in literature and in messages and instructions.

The Readware platform is based on scientific research into the interrelationship between natural languages like Arabic, English, German and Russian, for example, and the ways people use words to preserve ideas, culture and the human society. It is not the same thing as looking for relationships in external artifacts such as the truth values between an arbitrary sentence and a possible world. The technology implements the ontology and semantics in a computing platform that can be used by anyone that can use a computer.

In the next part I will tell you more about how we found the system of elements and relations in language and how they correspond to human thinking and activity. In the meantime, feel free to drop me a comment or ask questions.

Read Full Post »

Things do heat up in the summer time and some say there is some competition brewing among Natural Language vendors that are offering search services.

Over at the Conceptualist, Sahar Sarid comments on whether 30 years of research is enough to beat Google. Citing Michael Reisman for MIT Technology, he thinks semantic search is important but he believes that digging relationships from text is not as useful as personalization and understanding the user’s intent.

I cannot argue against the importance of understanding the user’s intent, but personally, I don’t think any search technology, with or without a personalization feature, is “enough” to beat Google. Google is so much more than a search engine at this stage, their business will be hard to upset.

On the other hand, there certainly seems to be a competition brewing, in the views of some bloggers and in the opinions of the technology press, at least. And the competition is about semantic search offerings, or so they say. Over at the Read/Write web Bernard Lunn, claimed that the money seems to be riding on the NLP systems. It does not feel right to him and not to me either.

These NLP systems, along with the AI of the Semantic Web and lexical resources such as WordNet, are each in themselves great and powerful systems. They are each like the old Roman numbering system in that these modern linguistics systems have a similar effect on people using the Internet as the Roman numeral system had on ancient Romans.

Roman Numerals were a numbering system that prevented an entire civilization from doing any higher math. You can read what Thomas Frey has to say about it here. The proponents of NLP and AI systems from 30 years ago have tried to prevent research into other viable semantic methods.

They have blocked the widespread development of semantic techniques that are capable of processing real and conceptual relationships between words and names and topics or subjects of interest, in favor of extracting part of speech relations. It is very important. Because language deals with everything, and human semantics are universal, getting the fundamentals wrong here mucks up the entire works. It makes things become more complex than need be, and more expensive. That is the state of affairs today.

The ways and means of NLP systems and functional grammars, and all their adherents’ and proponents, are preventing semantic search from surfacing. This goes unnoticed by everyone until someone shouts loud enough to rise above the din of the crowd. There is even greater pressure than the burden of unwieldy systems and better cover than market confusion.

After pumping giga-tax and industrial dollars into the research labs of the prominent schools and the works of their scientists and students, Governments need the venture capitalists to cough up the giga-bucks needed to actually produce something and capture some kind of market. I am not saying that this in itself is good or bad — it is just the way of capitalism after all, and it meets the objective of the industrio-academia-government partnerships that dominate the field.

Yet, by focusing research and market development funding on NLP and AI based-systems, “gatekeepers” have nearly prevented independent theories and very creative developers from getting funding and from “going commercial” just by playing their role as gatekeepers. By such actions, they continue to stymie and hobble viable research directions and other quite defeasible possibilities for semantic search. Thomas Frey wrote an essay about that too; you can find it here.

So I predict that although these companies are making in-roads, and they are making NLP systems more adaptable and usable, they will fail as “semantic search” systems because they are not doing semantic search at all. Or perhaps the public is as fickle as they seem and can be fooled, in which case, I could be wrong.

While Hakia and Lexxe have excellent implementations, and I have no doubt that PowerSet’s offering will also have strengths — not one of them qualifies as semantic search in my book. In regards to PowerSet, what Michael Reisman was reporting was that Barney Pell claims that Powerset has innovations that make the system more adaptable so that it can extract deep relationships from text. No one is saying what that “deep relationship” is, mainly because it is not deep at all; it is a surface level linguistic feature.

Not one of these so-called NLP-wonders can answer a third grade question; as I previously wrote here. Neither can they pass a simple test for semantic search capabilities — the most revealing of which is the capability to construe the meaning of a query given in another language, like this.

Commercial NLP based systems, such as Hakia, Lexxe and PowerSet can only do this in regards to English grammar– and how well they handle all forms of grammar is highly questionable and often disagreeable.

People should remember that the relationships they deliver are grammatical relationships. These relations cannot even be classed as semantic except as they relate terms to parts of speech. Having and knowing the concept of noun or verb and extracting the relation between the subject and object in a sentence reveals little about the possible associations and relevance between words and structures and concepts of the mind.

Read Full Post »

How important do you think that it is to recognize when you are about to make an error? If you rate it as pretty important to you, then you will agree that that sort of recognition would be something very meaningful. The very act of distinguishing the error is of perceptual significance and personally meaningful.

Wouldn’t it be nice if a semantic search algorithm can distinguish a bad or false hit (an error) from a good or positive one just as we can?

Recent research (you can read about over at Science Daily: Why We Learn From Our Mistakes) shows that our brains are built for recognizing errors.

Science Daily — Psychologists from the University of Exeter have identified an ‘early warning signal’ in the brain that helps us avoid repeating previous mistakes. Published in the Journal of Cognitive Neuroscience, their research identifies, for the first time, a mechanism in the brain that reacts in just 0.1 seconds to things that have resulted in us making errors in the past.

This is also so universally applicable to human nature that human language has a built in semantic domain to identify, distinguish, communicate and control sense datum such as an error or a deviation.

Some people may be surprised to learn that there are also inheritance rules that require consistency of the semiotic signals as they propagate the properties, characteristics and all variant interpretations of the sense data through all possible symbolic compositions and permutations. The symbols “e r r o r” spell out the property or characteristic to convey, linking the sense datum to the lexical element of the English language in this case. This may seem preposterous without knowing the theoretical basis for that link, but that does not make it so.

Think about it this way: Failures, errors, faults, deviations, in body, mind, health, work, opinion, belief, character: Does the interpretation of these sorts of things occupy any part of your thoughts? Do you talk about them, communicate with others: your spouse, your preacher, your friend, your dog? Do we debate them socially, nationally, culturally? Do they go away if you do nothing or do they stay with you, even grow and propagate? If you don’t get that, consider this: It is clearly an innate function of interpersonal communication, among all species, to observe and alert others of error, deviation, or danger.

It is what we talk about, if you take some time to consider it. Language rests on the ability to interpret such sense data and we have highly refined symbols, semiotics and communications systems to propagate such signals.

The syntactic mechanics and semantics for distinguishing and interpreting sense data makes a fine basis for a parsimonious and scalable computational model of human language. Perhaps it also plays a significant role in the psychological and social basis of meaning and interpersonal communications. Because this particular semantic mechanism has been developed into a functional computational model it presents the possibility of a new direction for research on functional models of language and cognition.

Unfortunately, the research scene is tenured and business and government research and development is incestuously imitative. It is almost unheard of to break from the recent past, in computation, in linguistics, and in philosophy and logic. I started blogging about semantic search mainly because I have direct experience developing and fielding semantic search applications since before the Internet was much more than a figment of people’s imaginations. That is a long time– not as long as John Sowa perhaps, but a very long time nonetheless.

For as long as I have been in the business, the semantics of personal meaning and perception have always been a kind of nebulous subject in the computing sciences where they invent and use formal programming languages with formal logic and accompanying semantics.

Nevertheless, without the slightest consideration of interpersonal meaning and human perception, it seems every young computer science graduate thinks they have the stuff to process natural language statements, messages and texts, simply by building a few arrays and parsers and doing some table look up. Those that try fail.

They fail for many varied reasons, but mostly they fail for lack of a well-grounded and universal semantic theory of natural language and human cognition. It has been that way since before computer science began being offered as a course at major universities. It is that way today.

I guess there is hope and we can have some faith that people are built with the faculties to eventually realize they have made a mistake and orient in a new direction. What do you think?

Read Full Post »

The world of semantic searching just got a little bigger thanks to some pretty savvy investors and talented scientists and computational engineers over at Feedster. My hat is off to them.

Applying semantic search to raw RSS feeds is very challenging and is no small undertaking. Feedster has more than three hundred million posts in their archives and they “semantically” index from ten to thirty-thousand posts, from a collection of more than 70 million feeds, every ten minutes– using Readware technology.

As a semantic search framework, Readware, the underlying search engine at Feedster, is primarily an indexing, search and retrieval engine built on a theoretical semantic foundation. That means that it parses, tokenizes and indexes words from texts as search engines tend to do; it treats tokens as keywords and also as semantic objects that have interconnecting relations with similar objects.

Let me give you an example. I guess we can begin with the example Marti Hearst brought up about compound noun phrases. I think this is a good example because many search queries take this form. I would rather take some nouns where the search engine results can be a little more revealing of the power of the underlying algorithms. So I will choose trends and economy as my nouns.

The noun “economy” is ambiguous, because of course, it could refer to macro economic ideas or it could refer to a state of parsimony. The other noun “trend” is much less ambiguous and has multiple and even metaphoric means of indication. By that I mean that the idea of a state of affairs that someone might correctly call a “trend” can be represented with many other words and word phrases, such as increase, decrease, boom, etc..

To that we add the pragmatic aspect, in the realm of search. To what ends do we search for trends in the economy, say. What is it that we should desire to obtain from the exercise. For that is what drives the algorithms that take these signs and foster a search for valid and relevant responses.

My internal, or intellectual, algorithm tells me that what the phrase “trends in the economy” means depends on the subject. Without the subject there is no way to distinguish whether the phrase refers to trends in fuel economy or trends in the macroeconomic indicators like jobs, housing and trade, such as country GDP.

The easy way out is to take the more popular position on the axiom that you can please some of the people some of the time. It takes more time and an –interaction– to find out what the user really means. In person, if someone asked: “trends in the economy”? I might respond: Do you mean my car’s economy or the general economic situation?

I do not want my search engine to do what is popular, I want it to try to help me. If all search engines pragmatically assume the category or subject is macroeconomics, I will not be able to use the search engine effectively because all it will do is give me pointers to information fitting its narrow interpretation.

In the worst case, the search engine algorithm should take that search phrase and return a representative and relevant set of results. A representative set of semantic results should not only consist of macroeconomic hits, the set should include some things relevant to saving and economy– particularly since the search term is economy and not economic. That is a hint that the popular search engines do not seem to pick up on, although it is obvious and easily distinguished. Let me show you what I mean.

Below are the results I obtained from Google, Hakia and the semantic search engine at Feedster on the search “trends in the economy”:

First, Google:

“trends in the economy”

As you can see, Google seems to have interpreted the phrase as “economic trends” and these first five hits are representative of those that followed. It seems that Google’s pragmatics are claiming to know better than us what our own words may mean. By changing around my search phrase, they altered the meaning and gave me keyword hits as authority for that meaning– a little too 1984ish for me, but it did bring hits with both noun forms (economy or economics and trends).

Because there is nothing about savings or fuel economy in the hits and due to the fact the both nouns are highlighted, I can definitively claim there is nothing “semantic” in Google’s search. Fixing the meaning of the noun economy to denote economics is not meaningful. In my opinion– it is destructive of meaning.

Hakia, interpreted macro economy too. On the other hand Hakia tried to stick to the form trends in the economy and their top five results show this quite clearly:

trends in the economy

Still, there is nothing semantic about Hakia’s results. This is mainly because Hakia’s sort of semantics are not useful for distinguishing semantic objects from nouns, verbs or any other forms of words. Hakia distinguishes sentence meaning using natural language grammar and literal semantics of sentences.

Their search pragmatics seem to be that they fall back on keywords when no sentence exists or they are unable to decipher the semantics of a phrase. This is evidenced by the top five hits, shown above and by the rest that followed.

Now here are the results from Feedster using Readware as the semantic search engine:

trends in the economy

It is telling, I think, that the search term “trend” in not among the titles and summaries of the top five results from Feedster, yet each hit is a relevant hit about trends or a trend in the economy. One hit even refers to a fuel economy increase. Neither Google or Hakia, and I dare say, no other search engine has a semantic algorithm quite so powerful as the one here.

The pragmatics of the Feedster search are to choose semantically relevant results from the most recent posts collected from the Internet. Because perception plays a huge role in the appreciation of a search result, Readware searches for passages with the same words, just as Google and Hakia do, yet for Readware, the concept of economics is not the same as the concept for economy. Also, in these results, the concept of trend is interpreted as growth, increase, boom, etc.

Obviously, Readware distinguishes meaning in much more sophisticated ways than either Google or Hakia can. Feedster is using Readware as their search engine because they are committed to the RSS and Blog community and they are determined to bring better tools to everyone interested.

Many people would think that it is right to interpret economy as the macro economy, and that a search engine should not bother collecting a hit about any other sense of the meaning or reference for the term economy. I would like to know what you think. Feel free to comment.

Read Full Post »

Older Posts »