Peter Mika recently wrote an article about the semantic web and NLP-style semantic search. I should just ignore his claim that there are only two roads to semantic search because he is plainly mistaken on that count. As Peter works for Yahoo, he was mainly discussing data processing with RDF and Yahoo’s Search Monkey. He obviously knows that subject well.
He constructed an example of how to use representational data (such as an address) according to semantic web standards and how to integrate the RDF triples with search results. His claim is that one cannot do “semantics” without some data manipulation and for that the data must be encoded with metadata; essentially data about the data. In this case, the metadata necessary to pick out and show the data at the keyword: address.
At the end of his article, Peter talks about the way going forward, and; in particular, about the need for fostering agreements around vocabularies. I suppose that he means to normalize the relationships between words by having publishers direct how words are to be used. He calls this a social process while calling on the community of publishers to play their role. Interesting.
About the time Peter was beginning his PhD candidacy, industry luminary John Sowa wrote in Ontology, Metadata and Semiotics that:
Ontologies contain categories, lexicons contain word senses, terminologies contain terms, directories contain addresses, catalogs contain part numbers, and databases contain numbers, character strings, and BLOBs (Binary Large OBjects). All these lists, hierarchies, and networks are tightly interconnected collections of signs. But the primary connections are not in the bits and bytes that encode the signs, but in the minds of the people who interpret them.
This is the case in the trivial example offered by Peter. The reason one is motivated to list an address in the search result of a search for Pizza is because it is relevant to people who are searching for a pizza place close to them. In his paper, John Sowa writes:
The goal of various metadata proposals is to make those mental connections explicit by tagging the data with more signs.
This is the essential nature of the use case and proposal offered by Yahoo with SearchMonkey. It seems a good idea, doesn’t it? Yahoo is giving developers the means to tag such data with more signs. Besides, it has people using Yahoo’s index, exposing Yahoo’s advertisers. Sowa cautions that:
The ultimate source of meaning is the physical world and the agents who use signs to represent entities in the world and their intentions concerning them.
Which resources do investigators or developers use to learn about agents and their intentions when using signs? The resource most developers turn to is language and they begin by defining the words of language in each context in which they appear.
Peter says it is common for IR systems to focus on words or grams and syntax. While some officials may object, though NLP systems such as Powerset, Hakia and Cognition use dictionaries and “knowledge bases” to obtain sense data, they each focus mainly on sentence syntax and (perhaps with the exception of Powerset) use keyword indexes for retrieval just like traditional IR systems.
Hakia gets keyword search results from Yahoo as a matter of fact. All of these folks treat words, and even sentences, as the smallest units of meaning of a text. Perhaps these are the most noticeable elements of a language that are capable of conveying a distinction in meaning though they certainly are not the only ones. There are other signs of meaning obtainable from textual discourse.
Believe it or not, the signs people use most regularly are known as phonemes. They are the least salient because we use them so often, and frequently they are also largely used subconsciously. Yet, we have found that these particular sounds are instantiations, or concrete signs, of the smallest elements of abstract thought– distinctive elements of meaning that are sewn and strung together to produce words and form sentences. When they take form in a written text they are also called morphemes.
Some folks may not remember that they learned to read words and texts by stringing phonemes together, sounding them out to evoke, apprehend and aggregate their abstract meanings. I mention this because if a more natural or organic semantic model were standardized, the text on the world wide web could become more tractable and internet use might become more efficient.
This would happen because we could rid ourselves of the clutter of so many levels of metalevel signs and the necessity of controlled vocabularies for parsing web pages, blogs and many kinds of unstructured texts. An unstructured text is any free flowing textual discourse that cannot easily be organized in the field or record structure of a database. Neither is it advantageous to annotate the entirety of unstructured text with metalevel signs. Because as John Sowa wrote:
Those metalevel signs themselves have further interconnections, which can be tagged with metametalevel signs. But meaningless data cannot acquire meaning by being tagged with meaningless metadata.
So now it begs the question of whether or not words and their definitions are just meaningless signs to begin with. The common view of words—as signs— is that they are arbitrarily assigned to objects. I am unsure whether linguists could reach consensus that the sounds of words evoke meaning, as it seems many believe that a horse could have been called an egg without any consequence to its meaning or use in a conversation.
Within the computer industry it becomes even more black and white: A word is used to reference objects by way of general agreement or convention, where the objects are things and entities existing in the world. Some linguists and most philosophers recognize abstract objects as existing in the world as well. Though this has not changed the conventional view that is a kind of defacto standard among search software vendors today.
This view implies that the meaning of a word or phrase -its interpretation- adheres only to conventional and explicit agreements on definitions. The trouble is that it overlooks or ignores the fact that meaning is independently processed and generated (implicitly) in each individual (agents) mind. This is generally very little trouble if the context is narrow and well-defined as in most database and trivial semantic web applications on the scene now.
The problems begin to multiply exponentially when the computer application is purported to be a broker of information (like any search engine) where there is a verbal interchange of typically human ideas in query and text form. This is partly why there is confusion about meaning and about search engine relevance. Relevance is explicit, in as much as you know it when you see it, otherwise, relevance is an implicit matter.
Implicit are the dynamic processes by which information is recognized, organized, acted on, used, changed, etc. The implicit processes in cognitive space are those required to recognize, store and recall information. Normally functioning, rational, though implicit and abstract thought processes organize information so we that may begin to understand it.
It is obvious that there are several methods and techniques of organizing, storing and retrieving information in cyberspace as well. While there are IR processes running both in cyberspace and in cognitive space, it is not the same abstract space and the processes are not at all the same. In cyberspace and in particular in the semantic web, only certain forms of logical deduction have been implemented.
Cognitive processes for organizing information induce the harmonious and coherent integration of perceptions and knowledge with experience, desires, the physical self, and so on. Computational processes typically organize data by adding structure that arranges the information in desired patterns.
Neither the semantic web standards, nor microformats, nor NLP, seek the harmony or coherence of knowledge. Oh, yes, they talk about knowledge and about semantics yet what they deliver are little more than directives; suitable only for data manipulation in well-understood and isolated contexts.
Neither NLP nor semantic web meta data or tools presently have sufficient faculty for abstracting the knowledge that dynamically integrates sense data or external information with the conditions of human experience. The so-called semantic search officials start with names and addresses because these data have conventionally assigned roles that are rather regular.
When it comes down to it, not many words have such regular and conventional interpretations. It would actually be quite alright if we were just talking about a simple database application, but proponents of the semantic web want to incorporate everything into one giant database and controlled vocabulary. Impossible!
While it appears not to be recognized, it should be apparent that adherence to convention is a necessary yet insufficient condition to hold relevant meaning. An interpretation must cohere with its representation and its existence (as an entity or agent in the world) in order to hold. Consider the case of Iraq and weapons of mass destruction. Adhere, cohere, what’s the difference –it’s just semantics– right? Nonetheless, neither harmony nor coherence can be achieved by directive.
A consequence of the conventional view is that such fully and clearly defined directives leave no room for interpretation even though some strive for under specification. The concepts and ideas being represented can not be questioned; because, being explicit directives, they go without question. This is why I believe the common view of words and meaning that many linguists, computer and information experts, like Peter, hold, is mistaken.
If the conventional view were correct, the interpretation of words would neither generate meaning nor provide grounds for creating new concepts and ideas. If it were truly the case, as my friend Tom Adi said, natural language semantics would degenerate into taking an inventory of people’s choices regarding the use of vocabulary.
So, I do not subscribe to the common view. And these are the reasons that I debate semantic technologies even though end-users could probably care less about the techniques being deployed. Because if we are not careful we will end up learning and acting by directive too. That is not the route I would take to semantic search. How about you?
[...] sentiment pretty much echos what I wrote in my last post. You see, computing power was substituted for explanatory power and the superficiality of [...]