This is the first in a series of notes from the 2009 Enterprise Search Summit and the 2009 KM World. Please forgive typos as this is done in near real time. Fundamentals of Enterprise Search was a preconference workshop. It was led by Avi Rappoport, Principal - Search Tools Consulting Editor, SearchTools.com. She is an independent consultant not connected with a vendor. Here is the session description.
“Search engines, big and small, have certain standard elements and processes. The more you understand them, the easier to tune them to solve your real information needs. This practical overview provides a big picture view of how search fits within enterprise and websites, and a focused introduction to search technology and user experience. Elements of search covered include robot spiders, database connectors and other tools for locating content, indexing issues, query parsing, retrieval, relevance ranking, and designing usable search interfaces. The workshop addresses common search problems and solutions, security issues, languages, new interface elements, important (and unimportant) features as well as providing tools for choosing a search engine or evaluating an existing one.”
Avi began with similarities and differences of enterprise search and Web search. The differences include: limited scope, fewer meaningful hyperlinks for link analysis. Security and access control issues, content in databases, more control (for specifying value ranking, etc.), and no search spam.
Next she covered text search vs. data base search. Text search indexes multiple content sources and uses simple search commands instead of SQL. There is flexible indexing and retrieval and relevance ranking (major issue). There are new features such as spell check, auto completes, and facets. It works in the real world (e.g., eBay, Google).
Then she covered how information architecture works with search. Information architecture is the art and science of organizing information for access and use. It creates order and systems and provides standard vocabulary. Search can supplement information architecture through user vocabularies and dynamic changes with new content.
KM and search are opposite ways of approaching content. KM organizes stuff, and search finds it. There are two main types of search: known item with short queries and “good enough” answers and exploratory search for research purposes. This is what Darwin Ecosystems excels at. It can help you discover relationships between content that you did know exits and were not explicitly looking for. Avi said that search is an iceberg and people often see it as magic.
It is useful to index everything as it is hard to know in advance what people want. Twitter has changed expectations for real time indexing, even for intranets. Three minutes is a good expectation. Here is another impact of consumer Web on enterprise computing.
Index security is an issue. (see my post: Attivio Aligns with Traction and Releases New Features) Without the right security you can see stuff you should not see. Need to work with security people on search issues to avoid this and have capabilities in the search tool. The first step requires knowing what needs to be controlled and then you can determine how to do this. Be aware of privacy laws.
After you determine what to secure, then deal with access control. Best to keep access control info as part of document store (what Attivio does - see post mentioned above). There are four levels of access control. One – access to search engine, two – collection-level access control (to portions of the search engine), three – locked results for a teaser for subscription, four – hit-level access control – link to a access control database at the point of display - hardest to do, useful when constantly changing rules.
Robot spiders start with a base URL for all hosts. For each page they repeat this process: read text info into internet format, save document in cache, save words into index, extract all links and check for rules, if they are new URLs add them to the list. It can repeat process over and over.
There are common problems with robots that SEO tries to avoid. Spiders can be disallowed by robots.txt or robots meta. Also cannot handle URLs with ? and & (but all spiders should handle these now), Javascript, forms, and interactive dynamic links, session IDs that change, multiple views of same data (wikis and Lotus Notes).
External sources that have APIs like Twitter can be brought into enterprise search. You might want to partition this so it does not clutter standard search. This is one way you can use Darwin Ecosysteminside the enterprise to look outside it. Relevance is relative. (sounds obvious but need to remember this when creating relevance listing).
Indexing multimedia needs to be dealt with now. There can be internal and external metadata to support this. Best to use human judgment rather than automated systems. Automated systems can be a starting point but they need to be fine tuned by people. Speech to text and other automated capabilities are still buggy.
Stop words are common terms or ubiquitous terms. Traditionally you excluded them but there are consequences – such as copyright mentions. Best to index everything, especially since storage is much cheaper now. Avi gave a good example of excluding stop words by searching for phrase “whatever well be,” a song title. On the other hand you can lots of irrelevant stuff. Another example, the rock band, The Who, Here is where relevance can help so you get a lot if you include stop words but only need to see best ten examples. I tired this on Google and it did work as the top results related to the band even though there were 457,000,000. Avi said that Google may have set up an exception for this term. She also said you might get different results on wordpress.com.
Dealing with duplicate documents can be complex. First you need to decide what is a duplicate and then what is the primary if there are some slight differences (e.g. typo corrections). Exact match is easy but similarity is more useful, harder but worth it. Best to remove duplicates from index and hide results unless requested. This is what Google does. Can create rules for handling duplicates. This is a good idea. However, you need human supervision but it is worth it.
Avi went through the search process: search form – query parser – query engine (goes to inverted index and back) – relevance ranker (goes to document store, get stuff and brings back – formatter – search results. This all happens very fast now. Queries come from many sources, not simply search fields. There are alerts, saved searches, automated searches, geographic information systems, and others. You need to balance relevance and completeness. You cannot have both.
Relevance ranking algorithms work differently. The most common is TF-DF or term frequencies : inverse document frequency. How often is the query word in document and how often is word in the index? There are others but this is most efficient. Look at title, metadata, and top of document. Remember relevance is task specific. There is no such thing as objective relevance. You can never please everyone. More like berry picking than hunting, Try different stuff instead of locking on single goal. Here is where Darwin Ecosystem can help with correlations of different topics with target key words.
Be sure to limit the user interface complexity. Google is a great example. Use familiar use interface elements. Put search into navigation so it appears everywhere. With auto-complete, use a drop down menu of matching words. Base this on search logs and use 7 - 10 most popular in alphabetic order
In summary, with enterprise search you have much more control on the capabilities and decisions with your search capability than on the Web. Make good use of these decisions.