I am primarily posting my notes from the 2011 KM World on the Darwin blog. I will share a key sessions here and then provide a summary of all my notes with links. I attended the workshop – Creating & Applying a Semantic Infrastructure Platform for Knowledge Sharing led by Tom Reamy, Chief Knowledge Architect - KAPS Group. Here is the session description.
While the distinction between information and knowledge has often been misused, there is an essential truth that knowledge is richer and deeper than simple information. In the early days of KM, there were numerous attempts to utilize information technologies and approaches (search, taxonomies, etc.) for knowledge sharing, and the results were not pretty. However, today, with a rich array of text analytics and knowledge platform software, we are in a position to support real knowledge creation and sharing, but only if we approach it with the right foundation. That foundation is based on a concept of a semantic infrastructure, an infrastructure needed to support all the ways in which people in organizations use language and meaning. This workshop explores the essential characteristics of creating, refining, and applying a semantic infrastructure for rich knowledge sharing. It starts with how to do a knowledge audit designed to create a platform for knowledge networks that provides the means for enhanced communication within and among self-defined social groups. It then looks at a range of text analytics tools including text mining, auto-categorization, entity and fact extraction, summarization, sentiment analysis, and how these tools can now go beyond mere information applications to form the basis of a range of knowledge-sharing apps.
Tom began by stating that he would first define semantic infrastructure. He said that then he would cover implementation issues and benefits and get to tools later. That is a good order. He asked how many people were involved in KM and everyone raised their hands. Tom is a consultant in this space and has a virtual group that works with him.
Semantic infrastructure has four dimensions. Ideas and content structure is the first. You need to look at all the content, its structure and how it is used. People are the next dimension as creators and consumers of content. Activities are the third dimension. This means the information behaviors associated with the work process more than the process itself. Technology is the fourth dimension.
With content and content structure you need to look at both structured and unstructured content, both inside and outside the organization. It also needs to include the tacit knowledge inside people’s heads. You need to model flexible schemes to handle all this stuff. It includes faceted metadata, simple taxonomies with intelligence (such as an auto-categorization tool as taxonomies can be static otherwise – I would agree), ontology, and the semantic web, as well as best bets and user metadata. He next went over a framework. Level one has such things as key words, level two has such things as a thesaurus, and level three has facets. Level four is a knowledge map.
The people dimension includes how to deal with all the “tribes” in an organization. Each group has its own vocabularies and culture with values. They also have different models of knowledge. There are both individual people and communities to deal with. Also a central team needs to be put in place to support the semantic infrastructure. This team needs to be cross-organizational and interdisciplinary. There are various technologies that come into play such as text mining and search-based applications.
The business processes analysis needs to include knowledge architecture experts since process SMEs tend to be bad at knowledge structures. The knowledge map is the foundation for this work. There are various techniques to put together a knowledge map such as interviews. In Phase One you start with a high-level structure inventory and content structure, as well as a map of the organization itself. You need to identify the SMEs in each area and secure access to them.
Next, in Phase Two you spider the relevant content to explore it and categorize it. You work with SMEs to better understand the content and how it is used. In Phase Three you develop the knowledge map and refine it by exposing it to people. You also create an expertise map.
Tom said that enterprise taxonomies are a dead end. They are often expensive and difficult to apply. Maintenance is also difficult. I would agree here as the world is constantly changing. Tom said that instead of giant taxonomies, you create small ones with intelligence built in so you keep up with changes. Let the tools such as text mining help you keep up. The key is to have a big picture of how things fit together but not a rigid structure. Then you can adapt wisely.
Tom next went over the benefits. First, a semantic infrastructure allows you to handle the massive amounts of unstructured content. He said that this gives it some structure to make the content useful. Google cannot just do it. Google works well on the Web because of its link structure that adds a human element. However, enterprise is not linked in the same way as the Web. Google also has thousands of editors and it actually uses taxonomies but does not publicize this. Within the enterprise Google does not bring its Web tools.
Semantic technology makes the enterprise task possible. The benefits can be huge. They can be so large that people do not believe them. People waste too much time looking for information or re-creating information that was already there. The cost of missing the right content can be even larger. There are figures that say it cost about $12 million per thousand employees wasting time looking for stuff. Recreating existing content can be $4.5 million per 1,000 employees.
Another benefit is getting better return on technology investments as most enterprise content management under-perform without a semantic infrastructure. For example, authors are usually reluctant to tag their material and are often poor at it. Text analytic tools can suggest tags and then authors can simply approve. This works better than having the authors start from a blank slate. This approach can be better than going out to buy new content management tools and still having the same problem. Tom said that enterprise search is static but building search-based applications is expanding.
Tom said that companies organize their people. This is not questioned. The same logic indicates that they should organize their content. This is a nice point. At the same time, I like his point that a top down rigid taxonomy does not work for organizing content. This is also consistent with the new ways of managing people.
Tom next covreed text analytic tools that make the semantic analytics possible. He also covered how to evaluate text analytics tools. These tools begin with noun phrase extraction and this is used to build categories. It also feeds facet development. In addition, they can do summarization of documents. There is also fact extraction. A new hot area is sentiment analysis.
Auto-categorization is the core of these tools. One issue here is disambiguation. You usually need to train the tool and continue to refine it. A set of rules is built to help determine meaning. Tom showed an example for looking at call center rep notes. There were rules on such things as proximity of word combinations to help determine meaning. You can go and see how documents are categorized and make refinements. Sample rule: if the word begins with a capital letter and the next word is “says” or “said” it is likely a person. There may be exceptions to specify.
With sentiment analysis you need to go beyond positive and negative, as there are levels of these dimensions. Most sentiment analysis involves finding positive and negative words. With categorization you can build more complex rules and analyses.
Tom next covered how to evaluate text analytic tools. You need to start with the business context, like any tools. Then you look at existing tools to see what is available and what the new tools need to interact with it. Text analytics is different that most software so the IT department can be challenged to handle this evaluation task. The business players may know the content but need help on how to use the tools. The team needs to be interdisciplinary.
Two-phased approach: Start like traditional software evaluation: what does the industry say about tools, do a feature scorecard and have a short list for demos. Then do a proof of concept. These proof of concept efforts usually take 4 to 6 weeks. You need to go beyond a vendor’s demo to see what happens with your content. In your proof of concept evaluation, you need to see various refinement cycles as the tools need to be trained and will not work out of the box. You need to see if the auto-categorization works with your content. With entity extraction you need to see about scale and disambiguation. When you look for vendors you can find standalone text analytics tools and those embedded in content management and search tools. You also need to consider that categorization is iterative so you need to adapt over time and provide budget for it. There cannot be a single total score in your proof of concept comparisons but a matrix of things to score on.
No aspect of this is simple and it does not end, as it is infrastructure that continues to evolve. The complexity of semantic infrastructure creation opens up many risks on the people and cultural side, as well as the usual suspects on the technology side. Each area brings is own people baggage and there are many areas to involve. Tom gave an example of how this can go wrong. It is more about anthropology than library science because of the people issues.
I mentioned that an issue in complex implementations like this is that many people can say no and often no one can say yes. You need to have executive sponsorship lined up to deal with the naysayers.
Next we looked at applications starting with search. The software should not start and end with search. You need text analytics and text mining. Rich search results are the result of conversations. You start the search and refine it and let people choose facets for this refinement. Facet navigation is advanced search for non-advanced searchers. It lets you get more sophisticated. Two areas for the combination of semantic technology and search are eCommerce and within the enterprise. The Web itself is too complex.
Sample use cases: finding duplicate documents, text mining, content aggregation, combining with data mining and others. Getting more specific you can look at customer sentiment in call center logs but this can be challenging. Tom went through the development process for sentiment analysis. There is a training process here also and it can be complex. You start with general rules and then create more sophisticated ones to get more precise.
Future directions include greater integration between advanced tools such as text mining and predictive analytics. Another area is combining text analytics and text mining. Semantic infrastructure is the way to make effective use of these new tools. Tom said Watson was just the start. It seems like there is a lot of promise in this space and a ways to go to fully realize this promise.
Comments
You can follow this conversation by subscribing to the comment feed for this post.