This is another in a series of notes from the 2009 Enterprise Search Summit and the 2009 KM World. Enterprise Search Technologies was a preconference workshop. These notes are done near real time so please excuse any typos or spacing issues. It was led by independent consultant Miles Kehoe from New Idea Engineering, Inc. Here is the session description.
“This workshop, by a vendor neutral consultant who has hands-on experience with a broad range of “out of the box,” open source, commercial, and home grown solutions, provides an overview of the enterprise search technology landscape. It reviews technologies currently on the market; discusses pros and cons, strengths and weaknesses, and specific requirements Kehoe shares case studies that illuminate how search technologies are leveraged in different types of organizations; and provides a good introduction to and understanding of the enterprise search world.”
Miles said the characteristics of great search include conversational capabilities, open ended, flexible, and smart. Conversational search allows you to interact to focus your quest. This is especially important for enterprise search, as search is much harder inside the enterprise. Never provide just - no hits. Ask more if you cannot find anything. Every search engine is built around a set of indices. This even applies to Google who creates an index through its spiders. Different search engines just add different stuff around the indices. Every search engine goes through a process. Some expose parts of it which gives you added flexibility to pull specific information out.
It used to be you got plain search results pages in enterprise search like basic Google Web search. Now you get what he called enterprise search 2.0 with visualization, navigation, people, facets, etc. strung around the basic results.
There are two basic parts of search: indexing and the actual search. It is better to take time at indexing (when people are not waiting) than search when people are waiting. However, I asked if the real time capabilities of Twitter were changing those expectations. People want to see stuff as soon as it exists.
Before he reviewed vendors, Miles said it is not the technology but the methodology. It is how you implement the search engine. I can agree with this.
As he started to review vendors Miles mentioned Lucene/Soir, a free open source search engine that is behind a number of search engines, including some commercial ones. It is Java based with an Apache license, prolific documentation, many implementations, and you have total control over search and relevance. However, there is some implementation work required and it is hard to find answers. There are limited enterprise support options. SearchBlox is packaged Lucene. Lucid Imagination is packaged Soir.
Miles’ tier one vendors are: Autonomy, Endeca, Exalead, Fast Search (the original independent version), Google, Vivisimo. I have reviewed Exalead (see Exalead’s CloudView Offers Integrated Search Capabilities and Exalead Provides Ability to Integrate with EMC Documentum). His criteria are: broad enterprise presence, multi-platform search, market penetration, and clear product vision. People like the Google brand so they have a perception that Google enterprise search works well. Not being in Tier One is not necessarily bad, just not meeting all the criteria. Other vendors I have reviewed that Miles also mentioned include Attivio (see Attivio Aligns with Traction and Releases New Features) that is newer and Recommind (see Recommind Provides Axcelerate eDiscovery 3.0 with New Features) which is more vertical focused.
Dates are important but web servers provide bad data so it is hard to trust what you get. Miles gave the example of a 1996 document appearing as new because it had just been re-indexed.
The wifi started working so Miles showed us Web sites with good search capabilities. Globrix is a UK real estate site that uses FAST and you could see a lot of facets in home listings such as number of bedroom, bathrooms, price range, etc. Then we looked at Newssift that displays sentiment on topics. We looked at Kosmix that provides an example of exploratory search. It shows things that are related and loosely related.
Next we covered supporting technologies including document filters, connectors, social search, and federation. Document filters are part of the indexing process that converts binary source documents (PDF, Office, etc.) into a stream of text for indexing. Connectors are utility tools to provide a clearly defined interface between a search engine and external content. Some relate to indexing and others to display. Connectbeam is an example (see my reviews: Connectbeam Offers New Social Networking Application Integration Possibilities).
Social search is a popular term that applies to the capability to search corporate personal profiles to find people in an organization with certain skills or experience. It typically requires user to explicitly self-profile in order for searches to return accurate results. Some products now track user behavior to implicitly associate interest to users.
Federation refers to a program that can dispatch user queries to one or more external data sources (search engines, RDBMS systems, etc.) and present the combined results to the user. Federation from unsecure resources is fairly easy. Because relevance from each source is calculated differently, it is sometimes difficult to integrate results in a meaningful way.
Entity extraction recognizes people, places, or things during indexing. In unsupervised extraction entities are recognized through algorithms. In supervised extraction, the process is seeded by human operators prior to processing.
Sentiment analysis recognizes positive or negative sentiment algorithmically during indexing. It is easier to tell positive sentiment than negative.
Results clustering groups sets of documents into categories base don content. It looks like facets and entity extraction however clustering can be done independent of the query. Clustering is often used in search results to assist the user to discover additional related terms and content.
Facted search is the result of assigning documents in a search result list into a pre-defined taxonomy-like order. Unlike clustering, which can appear similar, facets are base don the query and populate pre-defined classes of content (authors location, etc.). Facets are often used to encourage interaction with user.
A key to having good search is to monitor it over time after the initial implementation. Look at what is happening and make corrections. Look at what people are searching for and accommodation them. You need to pull together a diverse collection of skills to have a great search function (e.g., business domain experts and corporate librarians, beyond just technical skills).
Miles mentioned two blogs on the topic that he writes: EnterpriseSearchBlog,com and SearchComponentsOnline.com.
Comments
You can follow this conversation by subscribing to the comment feed for this post.