December 18, 2013

Keyword Co-Occurrence Distributions

All KDD operations supported by the KDT system are based on an analysis of the
keywords that annotate the articles in the collection. More specifically, KDT
computes the distribution of daughter terms relative to their siblings for all keywords in
the hierarchy. For example, the annotations of documents with daughters of the
keyword node computers may be distributed as follows: mainframes: 0.1;
work-stations: 0.4; PCs: 0.5. In formal terms, we set a node C in the hierarchy to
specify a discrete random variable whose values are denoted by its daughters, where
each occurrence of a daughter provides a data point. We denote the distribution of the
random variable by P(C=c), where c ranges over all daughters of C. The event C=c
corresponds to the annotation of a document with the daughter category c.
P(C=ci) is the proportion of annotations of documents with ci among all annotations of
documents with any daughter of C. In the example above we would say that
P(C=mainframes)=0.1, where C denotes the random variable which corresponds to
the node computers. In KDT we are most interested in conditional keyword distributions of the form P(C=c|x), where x is a conditioning event which denotes some other category
keyword. Such distributions describe the co-occurrence of the category x with all
daughters of C. Figure B shows an example for such a distribution, where C stands for the node topics and x stands for Argentina. In other words, the figure presents the distribution of topic keywords (i.e., keywords that are daughters of the topics node) in articles that are annotated also with the keyword Argentina. In Figure B the distribution is presented as a pie-chart, along with the absolute frequency of each slice in the pie:
12 articles among all articles of Argentina are annotated with sorghum, 20 with corn,
32 with grain, etc. The KDT system presents distributions in several forms,
graphical (e.g. bar-chart) or alphanumeric (see Figure C), listing absolute frequencies
or probabilities (percentage). More generally, a keyword cooccurrence distribution may be conditioned by the joint occurrence of several category keywords, and not just one. For example, Figure C displays the distribution P(C=c|x,y), where C stands for topics, x
for UK, and y for USA. In other words, this is the distribution of topics in articles that
deal with both UK and USA. The distribution is presented in the lower right
window of the screen. By letting the user specify and display
conditional keyword co-occurrence distributions, as in Figure 2 and Figure C,
the KDT system provides a powerful browsing mechanism for large subsets of
documents. A traditional document retrieval system enables the user to ask for all
documents containing the keywords UK and USA, but then presents the entire set of
matching documents without describing its internal structure. Typically, the documents
will be sorted by either relevance score, which would be determined in this case by
the frequency and position of the given keywords in the document, or by
chronological order. The KDT system, on the other hand, enables the user to
investigate the contents of this document set by sorting it according to the daughter
distribution of any node in the hierarchy, such as topics, countries, companies etc.
Once the documents are sorted, and the distribution is displayed, the user can
access the specific documents of each subgroup. In Figure C, for example, the
user chose to click on the 24 documents annotated with trade, which led to the
display of all titles of these documents (those annotated by UK, USA, and trade) in
the upper window of the screen.

Previous                                     Next

No comments:

Post a Comment