December 18, 2013

The Text Collection

As mentioned above, the KDT system expects as input documents which are
annotated with category labels, where annotation might be achieved either
manually or automatically. In the experiments described here we used the
Reuters-22173 text categorization test collection, containing about 22,000 articles,
totaling 25 megabytes. The documents in this collection appeared on the Reuters
news wire in the late 1980’s, and were assembled and indexed with categories by
personnel from Reuters Ltd. and Carnegie Group, Inc. Further formatting and data file
production was done in 1991 and 1992 by David D. Lewis and Peter Shoemaker.
The categories in this collection are classified only to five types of tags:
countries, topics, people, organizations and stock exchanges. These five types provided
us the skeleton of the keyword hierarchy, where each of the 5 types serves as an intermediate node in a two level hierarchy. We then enriched the hierarchy with some
additional sub-types of categories, such as agriculture and metals as daughters of the
topics node, and various international organizations (taken from the CIA Factbook on the Internet) as daughters of the countries node.


Previous                                     Next

No comments:

Post a Comment