Consider a conditional distribution of the form P(C=c | xi), where xi is a conditioning
concept. In many cases, it is natural to expect that this distribution would be similar to
other distributions of this form, in which the conditioning event is a sibling of xi. For
example, when C denotes the node commercial-activity, and xi=Ford (the car
manufacturer), we could expect a distribution that is quite similar to such distributions
where the conditioning concept is another car manufacturer (a sibling of Ford in the
hierarchy). To capture this reasoning, we use Avg P(C=c | x), the average sibling
distribution, as a model for P(C=c | xi), where x ranges over all siblings of xi
(including xi itself). In the above example, we would measure the distance from the
distribution P(C=c | Ford) to the average distribution Avg P(C=c | x), where x ranges
over all car manufacturers and C denotes the node commercial-activity. The distance
between these two distributions would be large if the activity profile of Ford differs a
lot from the average profile of other car manufacturers. Furthermore, specific points
in the distribution (specific activities) that make a large contribution to the distance are activities which are associated with Ford much more than with other car
manufacturers.
Figure D demonstrates this type of comparison, between the topic distribution of
each G7 country and the average sibling distribution of topics for all G7 countries.
The countries are sorted in decreasing order of their distance to the average distribution,
revealing that Japan is the most “atypical” G7 country (with respect to its topic
distribution) while Germany is the most typical one. The topics that made the largest
contributions to the distance for each countries are also displayed. The user can
then click on any class member and get an expanded view of the comparison between
the topic distribution of this member and the average distribution. In Figure D we have
expanded the topic list of the UK (at the bottom-right list box), providing the
statistical detail for the strong associations between the UK and topics like bonds, sugar,
cocoa etc. In addition to their value in finding associations, comparisons of this type
provide a hierarchical browsing mechanism for keyword co-occurrence distributions. For
example, an analyst that is interested in studying the topic distribution in articles
dealing with G7 countries may first browse the average class distribution for G7, using a
presentation as in Figures 2,3. This will reveal the major topics that are generally
common for G7 countries. Then, the presentation of Figure D would reveal the
major characteristics which are specific for each country. Figure D - Comparison of the topic distribution of members of the G7 organization vs. the average topic distributions of the G7. Entries in the top listbox are sorted in decreasing order of their relative entropy distance to the average topic distribution (2nd column). The 3rd column shows the major topics that contributed to that distance. In the lower-right listbox, we can see a detailed information about these topics, for a selected country (UK). The 2nd column shows the contribution of the topic to the relative entropy distance. The 3rd and 5th columns show respectively, the percentage that the topic takes from the topic distribution of the specific country (3rd) and from the average topic distribution of the G7 countries (5th). The 4th and 6th columns show, respectively, the total number of articles in
which the topic appears with the specific country(4th), and with any G7 country(6th).
Figure E - Country-Topic associations with a high contribution to the relative entropy
distance between the topic distribution of the country and the average topic distribution
for all countries. Associations are sorted in decreasing order of the relative entropy
distance to the global average (3rd column). The 4th and 6th columns show, respectively, the percentage that the topic takes from the topic distribution of the specific country (4th) and from the average topic distribution of all countries (6th). The 5th and 7th columns show, respectively, the total number of articles in which the topic appears with the specific country(5th) and with any country(7th).
Previous Next
concept. In many cases, it is natural to expect that this distribution would be similar to
other distributions of this form, in which the conditioning event is a sibling of xi. For
example, when C denotes the node commercial-activity, and xi=Ford (the car
manufacturer), we could expect a distribution that is quite similar to such distributions
where the conditioning concept is another car manufacturer (a sibling of Ford in the
hierarchy). To capture this reasoning, we use Avg P(C=c | x), the average sibling
distribution, as a model for P(C=c | xi), where x ranges over all siblings of xi
(including xi itself). In the above example, we would measure the distance from the
distribution P(C=c | Ford) to the average distribution Avg P(C=c | x), where x ranges
over all car manufacturers and C denotes the node commercial-activity. The distance
between these two distributions would be large if the activity profile of Ford differs a
lot from the average profile of other car manufacturers. Furthermore, specific points
in the distribution (specific activities) that make a large contribution to the distance are activities which are associated with Ford much more than with other car
manufacturers.
Figure D demonstrates this type of comparison, between the topic distribution of
each G7 country and the average sibling distribution of topics for all G7 countries.
The countries are sorted in decreasing order of their distance to the average distribution,
revealing that Japan is the most “atypical” G7 country (with respect to its topic
distribution) while Germany is the most typical one. The topics that made the largest
contributions to the distance for each countries are also displayed. The user can
then click on any class member and get an expanded view of the comparison between
the topic distribution of this member and the average distribution. In Figure D we have
expanded the topic list of the UK (at the bottom-right list box), providing the
statistical detail for the strong associations between the UK and topics like bonds, sugar,
cocoa etc. In addition to their value in finding associations, comparisons of this type
provide a hierarchical browsing mechanism for keyword co-occurrence distributions. For
example, an analyst that is interested in studying the topic distribution in articles
dealing with G7 countries may first browse the average class distribution for G7, using a
presentation as in Figures 2,3. This will reveal the major topics that are generally
common for G7 countries. Then, the presentation of Figure D would reveal the
major characteristics which are specific for each country. Figure D - Comparison of the topic distribution of members of the G7 organization vs. the average topic distributions of the G7. Entries in the top listbox are sorted in decreasing order of their relative entropy distance to the average topic distribution (2nd column). The 3rd column shows the major topics that contributed to that distance. In the lower-right listbox, we can see a detailed information about these topics, for a selected country (UK). The 2nd column shows the contribution of the topic to the relative entropy distance. The 3rd and 5th columns show respectively, the percentage that the topic takes from the topic distribution of the specific country (3rd) and from the average topic distribution of the G7 countries (5th). The 4th and 6th columns show, respectively, the total number of articles in
which the topic appears with the specific country(4th), and with any G7 country(6th).
Figure E - Country-Topic associations with a high contribution to the relative entropy
distance between the topic distribution of the country and the average topic distribution
for all countries. Associations are sorted in decreasing order of the relative entropy
distance to the global average (3rd column). The 4th and 6th columns show, respectively, the percentage that the topic takes from the topic distribution of the specific country (4th) and from the average topic distribution of all countries (6th). The 5th and 7th columns show, respectively, the total number of articles in which the topic appears with the specific country(5th) and with any country(7th).
Previous Next
No comments:
Post a Comment