All Keyword Analysis: Introduction

Traditional database query tools allow a user to retrieve records based on the content of each record in isolation. In a hospital database, for example, a user might request all records for hospital stays that are less than one day with a cost greater than $10,000. Each retrieved record is selected because the information in that record, independent of any other record, satisfies the user's query. In contrast, KDD work provides tools for accessing information based onpatterns appearing across records. For example, KDD tools might provide a user the ability to ask for records of patients whose medical care for some illness is much higher than typical (where "typical" is implicitly defined by the values of other records in the database), or to investigate if there exist some statistical patterns relating the length of patients’ hospital stay and their family circumstances (whether the patient is married, how many children the patient has,
etc.). Although the goal of KDD work is to provide access to patterns and information in online information collections, most efforts have focused on knowledge discovery in
structured databases. However, a tremendous amount of online information
appears only in collections of unstructured text. Most research in Information Retrieval
(IR) has developed methods for providing access to documents based on the information
contained in a document in isolation (analogous to what traditional database query tools provide for databases). In this case, it is assumed that the user knows in advance the
topic of documents of interest. Clustering methods were used to impose structure over a
collection of documents, enabling the user to browse through the collection and select clusters of documents of interest (e.g. Salton 1989; Cutting et al, 1993). Visualization
methods were also used for presenting some additional structures hidden in a document or a set of documents (Williamson and Shneiderman, 1992; Hearst 1995). However,
there has been little work on providing KDDstyle tools for browsing and analyzing text
collections based on information appearing across documents. Applying such tools to
texts means that the system would take an active role in suggesting topics of interest to
the user, as well as supply new browsing methods that rely on inter-document
information. A KDD framework for texts may thus be viewed as an intermediate point
between user-specified retrieval queries and unsupervised document clustering: the user
typically provides some guidance to the system about the type of patterns of interest,
but then the system makes unsupervised decisions in finding specific statistically
motivated patterns. This paper describes the Knowledge Discovery in Texts (KDT) system, which applies a novel knowledge discovery framework to textual databases. Our goal is to provide similar types of KDD operations previously provided for structured databases. To do so, we rely on a text-categorization paradigm where each document is labeled with sets of keywords, where each keyword comes from a hierarchy of terms. Unlike in traditional IR work, where keywords (category labels) are used in specification of retrieval (or routing) queries, KDT allows a user to access documents and recognize
patterns across them based on the observed co-occurrence distributions of keywords in
documents of the collection. A key insight in this work is that keyword co-occurence
frequencies (or distributions) can provide the foundation for a wide range of KDD
operations on collections of textual documents, including:

1. Summarization and Browsing: KDT allows the user to view the frequency of occurrence of keywords from some category in a collection of documents that contain particular keywords from some other category, and to browse the collection of documents based on these frequencies.

Previous Next

All Keyword Analysis

Pages

December 18, 2013

Introduction

No comments:

Post a Comment