| |
Pattern-Oriented Access to Document Collections
Garett Dworman
This dissertation investigates pattern-oriented access to collections
of unstructured text documents. A pattern-oriented information search
differs from a more traditional record-oriented search just as the
study of an entire forest differs from the inspection of specific
trees. For example, to enjoy Abraham Lincoln's eloquence, we might
look up a particular speech such as the Gettysburg Address (a trees-perspective);
to understand the evolution of Lincoln's ideas, we must seek trends
across the collection of his public statements (a forest perspective).
Data-mining seeks this forest-perspective by finding statistical
patterns in data. Unfortunately, data-mining is only applied to
highly-structured data, and therefore ignores much, if not most,
of the world's information, which exists as unstructured text.
Evidence from the Information Retrieval, Information Visualization,
Bibliometrics, and Library Science literatures demonstrate that
pattern-oriented access to document collections is a critically
important task; one in which people often engage even if they do
not have tools designed for this purpose. Informed by these literatures,
a prototypical pattern-discovery system named Homer
is introduced and applied in two empirical studies. The first study
required subjects to answer specific questions about the prose of
a photographer's captions; the second study required subjects to
respond to open-ended medical questions based on a collection of
emergency room medical reports. Results show Homer
users learning more and taking less time, on average, than users
of more-traditional record-oriented systems. These results, combined
with evidence from the literature, argue strongly that pattern-oriented
access to document collections is possible, and can potentially
tap vast, previously-unavailable sources of knowledge by helping
us find the stories hidden within our document collections.
|
|