December 18, 2013

Distribution Comparison

So far we have seen that the ability to specify keyword co-occurrence distributions provides the user with a useful mechanism for exploring subsets of documents. Taking a
KDD perspective, we are interested not only in displaying an entire distribution to the user
but also in identifying specific points in a distribution which are likely to be “interesting”.
We suggest to quantify the degree of “interest” of some data by comparing it to a
given, or an “expected”, model. For example, we may want to compare the data regarding IBM to a model constructed by some averaging of the data regarding other
computer manufacturers. Alternatively, we may want to compare the data regarding IBM
in the last year to a model constructed from the data regarding IBM in previous years.
In our case, we use keyword distributions to describe the data. We therefore need a
measure for comparing the distribution defined by the data to a model distribution.
We chose to use the relative entropy measure (or Kullback-Leibler (KL) distance), defined in information theory, though we plan to investigate other measures as well. The
KL-distance seems to be an appropriate measure for our purpose since it
measures the amount of information that we lose if we model a given distribution p by
another distribution q. Denoting the distribution of the data by p and the model
distribution by q, the distance from p(x) to q(x) measures the amount of “surprise”
in seeing p while expecting q. Formally, the relative entropy between two probability
distributions p(x) and q(x) is defined as:

The relative entropy is always non-negative and is 0 if and only if p=q.
According to this view, interesting distributions will be those with a large
distance to the model distribution. Interesting data points will be those that make a big
contribution to the distance between the given distribution and the model (i.e., x’s
whose contribution to the sum is large). The following sections show how various
interesting patterns can be identified by measuring the relative entropy distance
between a distribution and different baselinemodels.


Previous                                     Next

No comments:

Post a Comment