Search. I suppose there is no denying that the word “search” ascended to significance in the consciousness of more people since the birth of Information Science than perhaps at any other time in history. This supposition is supported by a recent Pew Foundation internet study stating that:
The percentage of internet users who use search engines on a typical day has been steadily rising from about one-third of all users in 2002, to a new high of just under one-half (49%).
While it may not be obvious, it becomes apparent on closer examination of the phenomena, that the spread and growth in the numbers of words and texts and more formal forms of knowledge, along with the modern development of search technology, had a lot to do with that.
Since people adopted the technology of writing systems, civilizations and societies have flourished. Human knowledge and culture, and technological achievement, have blossomed. No doubt.
Since computers, and ways of linking them over the internet, came along, the numbers of words and the numbers of writers have increased substantially. It was inevitable that search technology would be needed to search through all those words from all those writers. That is what Vannevar Bush was telling his contemporaries in 1945 when he said the perfection of new instruments “call for a new relationship between thinking man and the sum of our knowledge.”
But somewhere along the line things went wrong; some things went very, very wrong. Previous knowledge and the sum of human experience was swept aside. Search technology became superficial, and consequently, writing with words is not considered as any kind of technology at all. That superficiality violates the integrity of the meaning of search, and the classification of words merely as arbitrary strings is also wrong, in my view.
Some scientists I know would argue that the invention of writing is right up there at the top of human technological achievement. I guess we just take that for granted these days, and I am nearly certain that scientists that were embarking into the new field of information technology in the 1940’s and 1950’s were not thinking of writing with words as the world’s first interpersonal memory– the original technology of the human mind and its thoughts and interactions.
Most information scientists have not yet fully appreciated words as technical expressions of human experience but treat them as labels instead. By technical, I mean of or relating to the practical knowledge and techniques (of being an experienced human).
Very early in the development of search technology, information scientists and engineers worked out assumptions that continue to influence the outcome, that is, how search technology is produced and applied today. The first time I wrote about this was in 1991 in the proceedings of the Annual Meeting of the American Society of Information Science. There is a copy in most libraries if anyone is interested.
And here we are in 2008, in what some call a state of frenzy and others might call disinformed and confused– looking at the prospects of the Semantic Web. I will get to all that in this post. I will divide this piece into the topics of the passion for search technology, the state of confusion about search technology, and the semantics of search technology.
The term disinformed is my word for characterizing how people are under-served if not totally misled by search engines. A more encompassing view of this sentiment was expressed by Nick Carr in an article appearing in the latest issue of the Atlantic Monthly where he asks: Is Google making us stupid?
I am going to start off with the passion of search.
Writing about the on-line search experience in general, Carmen-Maria Hetrea of Britannica wrote:
… the computing power of statistical text analysis, pattern-matching, and stopwords has distracted many from focusing on (should I say remembering?) what actually makes the world tick. There are benefits and dangers in a world where the information that is served to the masses is reduced to simple character strings, pattern matches, co-location, word frequency, popularity based on interlinking, etc.
( … ) It has been sold to us as “the trend” or as “the way of the future” to be pursued without question or compromise
That sentiment pretty much echos what I wrote in my last post. You see, computing power was substituted for explanatory power and the superficiality of computational search was given credibility because it was needed to scale to the size of the world wide web.
This is how “good enough” became state of the art. Because search has become such a lucrative business and “good enough” has become the status quo, it has also become nearly impossible for “better” search technology to be recognized, unless it is adopted and backed by one of the market leaders such as Google or Microsoft or Yahoo.
I have argued in dozens of forums and for more than twenty years that search technology has to address the broader logic of inquiry and the use of language in the pursuit of knowledge, learning and enhancing the human experience. It has to accommodate established indexing and search techniques and it has to have explanatory power to fulfill the search role.
Most that know me know that I am not showing up at this party empty-handed. I have software that does all that and while my small corporate concern is no market or search engine giant my passion for search technology is not unique.
In her Britannica Blog post about search and online findabillity, Carmen-Maria Hetrea summed up her passion for search:
Some of us dared to differ by returning to the pursuit of search as something absolutely basic to the foundations of our human existence: the simple word in all of its complexity — in its semantics and in its findability and its futuristic promise.
You have to ask yourself what you are really searching for before you can find that it is not for keywords or patterns at all. Out in the real world almost everyone is searching for happiness. Some are also searching for truth or relevance. And many search for knowledge and to learn. If your searching doesn’t involve such notions, maybe you don’t mind the tedium of thorough, e.g., exegetical, searching. Or maybe you are someone who doesn’t search at all, but depends on others for information.
How is the search for happiness addressed by online search technology? Should it be a requirement of search technology to find truth or relevance? Should a search be thorough or superficial? Is it about computing power or explanatory power? I am going to try and address each of these questions below as I wade through the causes of confusion, expose the roots of my passion and maybe shed some light on search technology and its applications.
Some people have said in the online world you have both the transactional search and the research search, which are not the same. They imply that these search objectives require different instruments or plumbing. I don’t think so. I think it is just a crutch vendors use to justify superficial search. Let’s look at an example transactional search, say, searching for a new car. There are so many places where you can carry out that transaction, being thorough and complete is not an issue. Here’s is a search vendor quiz:
Happiness is a ___________ search experience.
Besides searching for objects of information that we know but don’t have at hand, in cyberspace and on the web, we might search for a pizza place in a new destination. Many search for cheap air fares or computer or car parts, or deals on eBay, while others search for news, music, pictures and many other types of media and information. A few others search for knowledge and for explanation. Happiness in the universe of online search is definitely a satisfying search experience irrespective of what you are searching for.
Relevance is paramount to happiness and satisfaction whether searching for pizza in close proximity or doing research with online resources. Search vendors are delivering hit lists from their search engines, where users are expecting relevance and to be happy with the results. Satisfaction, in this sense, has turned out to be a tall order and nonetheless a necessary benefit of search technology that people still yearn for.
Let’s now turn to the state of confusion.
Carmen-Maria mentions that new search technology has to be backward compatible and she also complains that bad search technology is like the wheel that just keeps getting reinvented:
The wheel is being reinvented in a deplorable manner since search technology is deceptive in its manifestation. It appears simple from the outside, just a query and a hitlist, but that’s just the tip of the iceberg. In its execution, good search is quite complex, almost like rocket science.
… The wealth of knowledge gained by experts in various fields – from linguists to classifiers and catalogers, to indexers and information scientists – has been virtually swept off the radar screen in the algorithm-driven search frenzy.
The wheel is certainly being re-invented; that’s part of the business. I am uncertain what Carmen-Maria means by algorithm-driven search frenzy. Algorithms are the stuff of search technology. I believe that some of the problems with search stem from the use of algorithms that are made fast by being superficial, by cutting corners and by other artificial means. The cutting of corners begins with the statistical indexing algorithms or pre-coordination of text– so retrieval is consequently hobbled by weaknesses in the indexing algorithms. But algorithms are not the cause of the problem.
Old and incorrect assumptions are the real problem.
Modern state-of-the-art search technology (algorithms) invented in the 1960’s and 1970’s strip text of its dependence on human experience under something information science (IS) professionals justify as the independence assumption. Information retrieval (IR) professionals– those that design software methods for search engine algorithms– are driven by the independence assumption to treat each text as a bag of words without connection to other words in other texts or other words in the human consciousness.
I don’t think he was thinking about this assumption when Rich Skrenta wrote:
… – the idea that the current state-of-the-art in search is what we’ll all be using, essentially unchanged, in 5 or 10 years, is absurd to me.
Odds are that he intends to sweep a lot of knowledge out of the garage too, and I would place the same odds that any “new” algorithm Rick brings to the table will implicitly apply that old independence assumption too.
So this illustrates a kind of tug of war between modern experts in search technology and the knowledge of ages of experience. There is also a kind of frenzy or storm over so-called “new” technologies and just what constitutes “semantic” search technology. While some old natural language processing (NLP) technology has debuted on the online search scene, it has not brought any new search algorithms to light. They have only muddied the waters in my opinion. I have written about this in previous posts.
The underlying current is stirred up by imbalance existing in the (significant) history of search technology contrasted with the nascence of online search and other modern applications of search technology. Add to that disturbance the dichotomy exasperated by good (satisfying) and bad (deceptive) search results, multiplied by the number of search engine vendors, monopolistic or otherwise, and you have the conditions where compounding frenzy, absurdity and confusion, rather than relevance, reigns supreme.
I like to think my own view transcends this storm and sets an important development principle that I established when I produced the first concept search technology back in 1987. The subjects of the search may be different but the freedom to search for objects, for answers, or for theories or explanations of unknown phenomena is the right of inquiry.
This right of intellectual inquiry is as important and as basic as the freedom of speech. This is what ignites my passion for search technology. And I cannot stand to have my right of inquiry blocked, limited, biased, restricted, arrested or constrained, whether by others, or by unwarranted procedure (algorithm) or formality, or by mechanical devices.
I wear my passion on my sleeve and it frequently manifests as a rant against the “IT” leaders or so-called experts that Carmen-Maria wrote about:
Many consider themselves experts in this arena and think that information retrieval is this new thing that is being invented and that is being created from scratch. The debate often revolves around casual observations, remarks, and opinions come mostly from an “IT” perspective.
To be fair, not all those with “IT” perspectives are down with all this “new thing” in online search engines. Over at the Beyond Search blog, Stephen Arnold wrote about the problem with the thinking about search technology:
… fancy technology is neither new nor fancy. Google has some rocket science in its bakery. The flour and the yeast date from 1993. Most of the zippy “new” search systems are built on “algorithms”. Some of Autonomy reaches back to the 18th century. Other companies just recycle functions that appear in books of algorithms. What makes something “new” is putting pieces together in a delightful way. Fresh, yes. New, no.
I also think Stephen understands the history of search technology pretty well. He demonstrates this when he writes:
Software lags algorithms and hardware. With fast and cheap processors, some “old” algorithms can be used in the types of systems Ms. Hane identifies; for example, Hakia, Powerset, etc. Google is not inventing “new” things; Google is cleverly assembling bits and pieces that are often well known to college juniors taking a third year math class.
Like Carmen-Maria Hetera, Stephen Arnold sounds biased against algorithms, “old” algorithms in particular, though I don’t think he intended any bias, as many of the best algorithms we have are “old”. There are really not many “new” algorithms. Augmented, yes. Modified, Yes. New, no.
To be involved in IT and biased against algorithms is absurd as long as technology is the application of the scientific method and scientific search methods are understood as collections of investigative steps systematically combined into useful search procedures or algorithms. So there you have my definition of search technology.
The algorithms for most search technology are not rocket science and can be boiled down to simple procedures. At the very least there is an indexing algorithm and a search algorithm:
Pre-coordination per-text/document/record/field procedure:
- Computerize an original text by reading the entire text or chunks of it into computer memory.
- Parse the text into the smallest retrievable atomic components (usually patterns (trigrams, sentences, POS, noun-phrases, etc.) or keywords or a bag (alphabetical list) of infrequent words).
- Store the original text with a unique key and store the parsing results as alternate keys in an index.
- Repeat for each new text added to a database or collection.
Post-coordination per-query procedure:
- Read a string from input, parse the query into keys in the same way as a text.
- Search the index to the selected collection or database with the keys.
- Assemble (sort, rank) key hits into lists and display.
- Choose hit to effect retrieval of the original text.
These basic algorithms are fulfilled differently by different vendors but vendors do not generally bring new algorithms to the table. They bring their methods of fulfilling these algorithms; they may modify or augment regular methods employed in steps 2 and 3 of these procedures as Google does with link analysis.
In addition, vendors fold search technology into a search engine. Most online search engines– those integrated “software systems” or search appliances that process text, data and user-queries, are composed of the following components:
- A crawler for crawling URI’s or files on disk or both.
- An indexer that takes input from the crawler and recognize key patterns or words.
- A database to store crawler results and key indexing (parsing) results.
- A query language (usually SQL, Keyword-Boolean) to use the index and access keys in the database.
- An internet server and/or graphical user interface (GUI) components for getting queries from, and presenting results to, users.
Most search engine wizards, as they are called, are working on one or more of these software components of online search engines. You can look at what a representative sample of these so-called wizards have to say about most of these components at the ArnoldIT blog here. If you read through the articles, you won’t find one of them (and I have not read them all) that is working on new indexing methods or new mapping algorithms for mapping the meaning of the query to the universe of text, for example.
Many of the “new search engines,” popping up everywhere, are not rightly called new search technology even though they frequently bear the moniker. They are more rightly named new applications of search technology. But even vendors are confused and confusing about this. Let’s see what Riza Berkin of Hakia is saying in his most recent article where he writes:
But let’s not blind ourselves by the narrowness of algorithmic advances. If we look closely, the last decade has produced specialist search engines in health, law, finance, travel etc. More than that, search engines in different countries started to take over (like Naver, Baidu, Yandex, ect.)…
He had been writing that Search 1.0 began with Alta Vista (circa 1996) Search 2.0 is Google-like and Search 3.0 is semantic search “where the search algorithms will understand the query and text”. I guess all those search engines from Fulcrum, Lexis-Nexis, OpenText, Thunderstone, Verity, Westlaw, and search products from AskSam to Readware ConSearch to ZyIndex, were Search 0.0 or at leat P.B. …. You know like B.C. but Pre-Berkin.
And so this last paragraph (above) makes me think he is confusing search applications with search technology. His so-called specialists search engines are applications of search technology to the field or domain of law, to the field or domain of health, and so on.
Then he confuses me even more, when he writes about “conversational search”:
Make no mistake about it, a conversational search engine is not an avatar, although avatars represent the idea to some extent. Imagine virtual persons on the Web providing search assistance in chat rooms and on messengers in a humanly, conversational tone. Imagine more advanced forms of it combined with speech recognition systems, and finding yourself talking to a machine on the phone and actually enjoying the conversation! That is Search 2.0 to me.
Now I can sympathize with Riza because I used the phrase “conversational search” to describe the kind of conceptual search engine I was designing in 1986. I am not confused about that. I am confused that he calls that Search 2.0 when earlier– statistically augmenting the inverted index –was described as Search 2.0.
He doesn’t stop there. He continues describing Search 3.0 that “will be the ‘Thinking Search’ where search systems will start to solve problems by inferencing. ” Earlier he wrote that semantic search was Search 3.0. Semantics requires inferencing, so I began to reckon maybe thinking and semantics are equal in his mind, until he writes: “I do not fool myself with the idea that I will see that happening in my life time” — so now I am confused again. I think it is what vendors want; they want the public to remain confused about the semantics of search and what you get with it.
And that brings me to the semantics of search.
There are only two words that matter here: Thoroughness and Explanatory.
When I started tinkering with text processing, search and retrieval software in the early 1980’s, I was captivated by the promise of searching and reading texts on computers. The very first thing that I noticed about the semantics of search, before my imagination became involved in configuring computational search technology, was thoroughness. The word /search/ implies thoroughness if not completeness in its definition. Thoroughness is a part of the definition of search. Look at the definition of search for yourself.
You need only look at one or two hit lists from major search engines and you can see that is not what we get from commercial search engines, or from most search technology. Search is not a process that is completed by delivering some hints of where to look, but that is what it has been fashioned into by the technological leaders in the field. Millions of people have accepted it.
Yet, in our hearts we know that search must be complete and it must be explanatory to be satisfying; We must learn from it, and we expect to learn from conducting a search. Whether we are learning of the address to the nearest pizza place or we are learning how to install solar heating, it is not about computational power, it is about explanatory power. They forgot that words are part of the technique of communicating interpersonal meaning, let’s hope search vendors don’t forget that words have explanatory power too.
Tell me what you think.
Read Full Post »