December 18, 2013

Keyword Tagging and the Keyword Hierarchy

Applying KDD operations to texts requires that documents will be represented in some
structured way. We chose to base the current version of the system on the very simple
representation scheme of annotating (or tagging) each document with a set of
category label keywords. Category labels are commonly used in commercial and scientific
text collections and information feeds, and provide a high level summary for the content
of the document. For example, articles in hitech domains may be annotated with sets of
keywords such as {IBM, product announcement, Power PC} and {Motorola,
patent, cellular phone}. The annotation of documents with category labels may be either
manual or automatic. Automatic text categorization has recently been the focus of
substantial research in the IR and text processing communities (e.g. Apte et al
1994; Finch 1994; Iwayama and Tokunaga 1994). Altogether, we assume that having the
documents of the collection annotated with category labels is a reasonable pre-requisite
for the KDT system, which would hold for many text collections in the market. KDT also requires that the category keywords would be organized in a hierarchical structure. This keyword hierarchy is a directed acyclic graph (DAG) of terms, where each of the terms is identified by a unique name. Figure 1 shows a portion of an example keyword hierarchy, the one used in our work with the Reuters data (see below), which will serve as a running example throughout this paper. In such a hierarchy an arc from A to B denotes that A is a more general term than B (i.e., countries → G7 → Japan). We use a general DAG rather then a tree structure so that a keyword may belong to several parent nodes (e.g. Germany is both a European-Community and a G7 country).
It should be emphasized that the sole purpose of the keyword hierarchy is to enable generalizations and partitioning of KDD findings over sibling nodes. The structure of the hierarchy is typically simple, and reflects the basic generalizations common for the domain of interest. Such keyword hierarchies are commonly used by information providers
(e.g. the Dialog service of Knight Ridder Information Inc. or the First service of
Individual Inc.), and resemble in their form to a “subject index” in a yellow pages book.
Rich hierarchies have been developed for several professional domains, such as the
Medical Subject Heading (MeSH) hierarchy, and have been used to assist and
augment free-text searching. The task of constructing, obtaining and modifying such
hierarchies is thus relatively easy, and should not be confused with the task of
constructing a semantically rich structure, such as a semantic network or a taxonomy
in the “knowledge representation” sense. The KDT system provides a simple GUI for
constructing and editing the hierarchy, supporting additions, deletions and modifications of nodes and links (Figure A is a screen dump of the hierarchy maintenance editor).

Previous                                     Next

No comments:

Post a Comment