All Keyword Analysis

December 21, 2013

Search Strategies

Twenty years ago, Marchionini [15] found that while using a CD-ROM encyclopedia to search, children had difficulty formulating search terms and frequently used
natural language or phrases instead of keywords. The younger children in the study, 8-10 year olds, were much more likely than the 11-12 year old group to use phrases or sentences, leading to unsuccessful searches. Large et al. [13] found that while using multimedia CDROM programs, 11 and 12-year old children preferred to browse for information rather than to search. Schacter etal. [19] suggest that lack of planning when attempting a
complex search and a desire for the easiest path to the desired information lead children to prefer browsing to keyword search. This finding is consistent with the observation that novice adult users tend to prefer strategies that require less cognitive load [15]. Bilal [3, 4] also found that when looking for information on Yahooligans, a web directory, 12 and 13 year old
children were better at finding information by browsing than by searching, and that they browsed much more than they searched. Despite this, the children still preferred to
use keywords to search. Thus, while children might use sub-optimal search strategies and fail to find the information they need, they still may want to use keyword search interfaces.

RELATED RESEARCH

The research on children’s search interfaces includes three areas of HCI and information science research: (1) search strategies; (2) typing and spelling; (3) deciphering results. These three aspects also identify the areas of considerable challenge for children. In the sections that follow, we discuss research and challenges in each of these areas.

The Need for Research

Browsing the web exposes children to vast numbers of websites on every imaginable topic and in many different media formats (e.g. web sites, documents, videos, images). Using a search engine to find information from large numbers of disparate web pages is very different
from searching the finite and pre-determined content found in the CD-ROM applications, online digital libraries, and web directories of the past. However, most of what we know about how children search is based on studies using these kinds of sources. Today’s search engines are not only more expansive than past technologies, but more ubiquitous in a child’s world. It is not uncommon for young people to begin to go online with a parent or sibling by the age of three or four. They move from their home computers, to their schools’
computing facilities, to their mobile phones, searching for online games, information for school assignments, and random facts they are curious about because of the world around them. Today’s children are the first generation of what are being called “digital natives” [1]. As a result, it is important to explore how children search the Internet with today’s ubiquitous keyword interfaces. With the knowledge gained from such studies, we hope to design and test new interfaces and algorithms to better support the needs of children.

Author Keywords

Children, Internet, search, search engine, query formulation, typing, search results
ACM Classification Keywords H.3.3. Information Search and Retrieval; H.5.2 User
Interfaces: Graphical user interfaces (GUI) and Usercentered design.
INTRODUCTION
The leading activity for all age groups on the Internet is general exploration: activities such as searching, surfing, and reading about interests, sports, and movies [6]. Recent studies in the U.S. have shown that 74% of children ages 8-18 years have access to the Internet [18]. Children make up one of the largest groups of users of computers and the Internet [17]. Despite childrens’ frequent use of the Internet and exposure to technology at an early age, when asked ‘what frustrates you most about searching on the Internet’, several child participants in our study provided some revealing answers. Child (age 7): “Writing words is hard for me because I'm not really good at the writing.” Child (age 9): “It doesn’t do all the words you say.” Child (age 11): “It's hard because you have to find the
right words to put in the box."
These challenges were a just a few of several we saw when conducting our initial study on how children search the Internet. When, where, what, and how they search were prominent concerns as we interviewed the children and parents who participated. This paper describes our methods, highlights our findings, and offers considerations for the design of future Internet search interfaces for children.

ABSTRACT

Children are among the most frequent users of the Internet, yet searching and browsing the web can present many challenges. Studies over the past two decades on how children search were conducted with finite and predetermined content found in CD-ROM applications,
online digital libraries, and web directories. However, with the current popularity of the open Internet and keyword-based interfaces for searching it, more critical analysis of the challenges children face today is needed. This paper presents the findings of our initial study to
understand how children ages 7, 9, and 11 search the Internet using keyword interfaces in the home. Our research has revealed that although today’s children have been exposed to computers for most of their lives, spelling, typing, query formulation, and deciphering results are all still potential barriers to finding the information they need.

December 18, 2013

Conclusions

We have presented a framework and an implemented system for browsing and
analyzing sets of documents which are annotated with category keyword labels.
The system might be used as a support tool for domain experts that need to analyze
and summarize large document sets. It may also be used in the regular query-andbrowse
cycle of a document retrieval session, to support the browsing phase.
Currently, when users face the common response of the type “1000 documents
match your query”, they need to guess in advance how they might restrict their query.
In such cases the KDT system could provide much help in figuring out the
content of these 1000 documents, and narrowing down the sets of target
documents. The KDT system is based on a compact model, which relies on rather modest assumptions. It requires annotation of documents with category keywords which are organized in a simple hierarchy. It also demonstrates the rich variety of KDD
operations that can be based on keyword co-occurrence distributions and their
comparison with the relative entropy distance measure. The simplicity of the
model makes it rather easy to implement, and the pre-computation of keyword cooccurrence distributions makes online computations very efficient. In future work we plan to extend the KDT framework to work also on cooccurrence distributions of terms and
groups of terms that were extracted directly from the texts. This way we hope to
combine these two levels of representation, namely category labels and document
terms, in analogy to the way they are often combined in retrieval queries.

Previous

Finding Trends Over Time

One of the most important needs of an analyst is the ability to follow changes
over time in the behavior of entities of interest. For example, a trend analysis tool
should be able to compare the activities that a company performed in some domain
in the past with its current activities in that domain. For example, a possible
conclusion from such an analysis would be that a company is shifting its activities from
one domain to another. The KDT system identifies trends by comparing a distribution of data taken from one period of time to a corresponding model distribution which is constructed from data of another period. Trends are then
discovered by searching for significant deviations from the expected model, as
before. Figure 6 lists trends that were identified across the different quarters of
the year. The program was directed to search for significant changes in the cooccurrence
distributions of Arab League countries with any other country. For example, the first line of the top listbox shows that in the 3rd quarter there was a large increase in the proportion of articles that mention both Libya and Chad among all articles mentioning Libya (from 0% in the 2nd quarter to 35.29% in the 3rd quarter). The second line shows that the proportion of such articles in the 3rd quarter was also
much higher than in the fourth quarter (a decrease over time, again to 0%). An
analyst might then want to investigate what happened in the 3rd quarter regarding Libya
and Chad. To facilitate such an investigation, the system provides access to
the specific articles that support the trend, by double clicking on the appropriate line.
Then, a listbox containing all titles of the relevant documents appears, as in Figure J,
revealing that the cause for the trend was the fighting between Libya and Chad at that
period.
Figure I - Trends in co-occurrence of Arab League countries with other countries. The
distance is measured from the period (quarter) listed in the second column (P1) and the period in the third column (P2), where each line corresponds to a large contribution to this distance. The last five columns are as in previous figures.

Finally, the user can request a graphical representation of co-occurrence frequencies
of any 2 categories, in a desired level ofgranularity of time segments. Figure K
displays the percentage of articles annotated with the category crude within the average
topic distribution of OPEC countries, across different quarters.

Figure K - Crude proportion of the topic distribution of OPEC across the year quarters

Previous Next

Specific comparisons

The mechanism for identifying strong associations relative to a model is also useful
for comparing conditional distributions oftwo specific nodes in the hierarchy. In Figure
G we measure the distance from the average topic distribution of Arab League countries
to the average topic distribution of G7 countries. This reveals the topics with which
Arab League countries are associated much more than G7 countries, like crude-oil and
wheat. Figure H shows the comparison in the opposite direction, revealing the topics with
which G7 countries are highly associated relative to the Arab League.
Figure G - Topics Profile Comparison of the Arab League countries vs. the G7 countries.
Entries in the top listbox are sorted in decreasing order of their contribution to the relative entropy distance (2nd column). The 3rd and 5th columns show, respectively, the percentage of the topic in the average topic distribution of the Arab League countries and in the average topic distribution of the G7 countries. The 4th and 6th columns show, respectively, the total number of articles in which the topic appears with any Arab League country and any G7 country. Figure H - Topics Profile Comparison of the G7 countries vs. the Arab League countries. The columns in the upper listbox are the same as in Figure G.

Previous Next

General associations

Another form of association can be defined by taking as the baseline model the average
distribution of the conditioned category over all possible instantiations of the conditioning
category (in the formulation of the previous sub-section, x would range over all categories of the same type, rather than over all immediate siblings). This form is demonstrated in Figure E, which lists the strongest associations found between some
country and some topic. The system also enables the user to investigate further the
subset of documents which corresponds to a certain association. In Figure E we chose to
explore the set of documents corresponding to the association between South Korea and
trade, presenting the distribution of countries within this set (lower-right listbox, specified
by the “Expand Category” pull-down menu). This reveals which countries are most
prominent in articles dealing with both South Korea and trade, conveniently linking the
browsing mechanism of Figure C to the association display screen.
In many cases, the system generates a very large number of associations, making it
difficult to draw overall conclusions. To summarize the information, the system
groups together correlations whose second component belongs to the same class in the
hierarchy. Figure F shows the clusters that were formed by the system when grouping all
the individual associations of Figure E. For example, in 43 associations of Figure E the
right hand side of the association (the topic) was a daughter of the node agriculture. The user can examine any cluster and see the specific associations it contains (lower
listbox, for the selected cluster caffeinedrinks).
In addition, the system tries to provide a compact generalization for all the categories on the left hand side of the associations in the cluster. In our example, the system found that all countries that are highly correlated with caffeine drinks belong either to the OAU (African Union) or the OAS (South American countries) organizations.

Figure F - Clustering associations using the category hierarchy. In the upper listbox we can see all association clusters that were formed by the system along with their sizes (in
parenthesis). In the lower listbox we see the members of the cluster that was selected in the upper listbox (caffeine drinks).

Previous Next

Associations relative to a class

Consider a conditional distribution of the form P(C=c | xi), where xi is a conditioning
concept. In many cases, it is natural to expect that this distribution would be similar to
other distributions of this form, in which the conditioning event is a sibling of xi. For
example, when C denotes the node commercial-activity, and xi=Ford (the car
manufacturer), we could expect a distribution that is quite similar to such distributions
where the conditioning concept is another car manufacturer (a sibling of Ford in the
hierarchy). To capture this reasoning, we use Avg P(C=c | x), the average sibling
distribution, as a model for P(C=c | xi), where x ranges over all siblings of xi
(including xi itself). In the above example, we would measure the distance from the
distribution P(C=c | Ford) to the average distribution Avg P(C=c | x), where x ranges
over all car manufacturers and C denotes the node commercial-activity. The distance
between these two distributions would be large if the activity profile of Ford differs a
lot from the average profile of other car manufacturers. Furthermore, specific points
in the distribution (specific activities) that make a large contribution to the distance are activities which are associated with Ford much more than with other car
manufacturers.
Figure D demonstrates this type of comparison, between the topic distribution of
each G7 country and the average sibling distribution of topics for all G7 countries.
The countries are sorted in decreasing order of their distance to the average distribution,
revealing that Japan is the most “atypical” G7 country (with respect to its topic
distribution) while Germany is the most typical one. The topics that made the largest
contributions to the distance for each countries are also displayed. The user can
then click on any class member and get an expanded view of the comparison between
the topic distribution of this member and the average distribution. In Figure D we have
expanded the topic list of the UK (at the bottom-right list box), providing the
statistical detail for the strong associations between the UK and topics like bonds, sugar,
cocoa etc. In addition to their value in finding associations, comparisons of this type
provide a hierarchical browsing mechanism for keyword co-occurrence distributions. For
example, an analyst that is interested in studying the topic distribution in articles
dealing with G7 countries may first browse the average class distribution for G7, using a
presentation as in Figures 2,3. This will reveal the major topics that are generally
common for G7 countries. Then, the presentation of Figure D would reveal the
major characteristics which are specific for each country. Figure D - Comparison of the topic distribution of members of the G7 organization vs. the average topic distributions of the G7. Entries in the top listbox are sorted in decreasing order of their relative entropy distance to the average topic distribution (2nd column). The 3rd column shows the major topics that contributed to that distance. In the lower-right listbox, we can see a detailed information about these topics, for a selected country (UK). The 2nd column shows the contribution of the topic to the relative entropy distance. The 3rd and 5th columns show respectively, the percentage that the topic takes from the topic distribution of the specific country (3rd) and from the average topic distribution of the G7 countries (5th). The 4th and 6th columns show, respectively, the total number of articles in
which the topic appears with the specific country(4th), and with any G7 country(6th).
Figure E - Country-Topic associations with a high contribution to the relative entropy
distance between the topic distribution of the country and the average topic distribution
for all countries. Associations are sorted in decreasing order of the relative entropy
distance to the global average (3rd column). The 4th and 6th columns show, respectively, the percentage that the topic takes from the topic distribution of the specific country (4th) and from the average topic distribution of all countries (6th). The 5th and 7th columns show, respectively, the total number of articles in which the topic appears with the specific country(5th) and with any country(7th).

Previous Next

Pages

December 21, 2013

December 18, 2013