|
|
How to Create a WAIS Query
This document describes how to create a query for the WAIS servers that
are maintained by Tennessee Valley Authority (TVA) for GILS. This document
was copied from the WAIS Inc. administration manual's description of WAIS
queries.
Types of Queries
Natural Language Query
At the heart of the WAISserver product is the WAIS search engine. The
WAIS search engine receives a user's question or query, searches its database
for documents most relevant to the question, and returns a relevance-ranked
list of documents back to the user. Each document is given a score from
1 to 1000, based on how well it matched the user's question (how many
words it contained, their importance in the document, etc.). A question,
or query, is an expression that can contain any combination of natural
language, literal strings, boolean operators, field name-value pairs,
date or numeric ranges, and relevant documents.The WAIS search engine
also supports right truncation (wildcard searching), word stemming, and
relevance ranking. Each of these capabilities is explained below.
Natural Language The server can be queried using natural language
questions. The server does not understand the question, rather it takes
the words and phrases in the question and finds documents that have those
words and phrases in them. "Tell me about portable computers." is an example
of a natural language question. In this example, the WAIS server would
search for document containing the words 'portable' and 'computers'; the
other words, 'tell', 'me', and 'about', are called "stop words" -- words
so common that they occur in almost every document and so they are not
used for searching a document.
Literal Strings
A similar but more specific kind of query asks to find documents that
contain one or more exact phrases by enclosing them in double quotation
marks. This is known as a literal. For example, the query "search engine
capabilities" returns only documents that contain this exact phrase. The
WAIS search engine performs a literal search exactly as if you had used
the boolean operator ADJ. Thus the above example would yield the same
results as search ADJ engine ADJ capabilities. For this reason, it is
best to stick to noun phrases when using literals; if your literal phrase
includes stopwords, it won't work.
Relevance Feedback
Relevance feedback is the ability to select a document or a portion
of a document and find a set of documents related to it. For example,
suppose you perform a search on a news database with the natural language
question "What's going on in personal computers?". Scanning the headlines
returned, you see the headline, "Personal Computers in K-12", where you
are interested in finding more articles related to this. You can then
perform the search again using your original question, selecting this
article, or a portion of the article, for relevance feedback. The search
engine then returns a new list of headlines for related articles. In essence,
relevance feedback adds more words to the original question. These words
are determined by finding the "significant" words in the document that
was fed back to the server; the significant words are those that best
distinguish it from all other documents. It then tries to find other documents
that share these words. One of the primary uses of relevance feedback
is to help users quickly focus their search without the need to learn
complex query languages. For example, you can use natural language to
find a list of document headlines, and then use relevance feedback to
focus your search on the documents most relevant. Since a WAIS search
is fast, you can interactively and iteratively refine your search using
a combination of natural language questions and relevance feedback.
Boolean Operators
The boolean operators, AND, OR, NOT, and ADJ aid in establishing logical
relationships between concepts expressed in natural language. These operators
are especially useful in narrowing down the search.
AND, &&
The AND operator is helpful in restricting a search when a particular
pair or larger group of terms is known. For instance, when searching for
documents on the weather in Boston, a question such as "weather AND Boston"
would return only those documents that contain both the word "weather"
and the word "Boston". You can use more than one AND in a query, e.g.
"weather AND Boston AND November". Note that the C like double ampersand
(&&) may be used instead of spelling out the word AND.
OR, | |
The OR operator is often used to join two different phrases of a Boolean
search. A question such as "hurricane OR tornado" would search for all
documents containing either the word "hurricane", or the word "tornado",
or both. You can also use more than one OR in a query. A natural language
question is much like having an implicit OR between the words, except
that the search engine does more work in a natural language query to determine
the relevance of words and their relationships in a phrase. Note that
the C-like double vertical bars (||) may be used instead of spelling out
the word OR.
NOT
NOT is a binary operator. That is, it has to come between two or more
words or parenthesized clauses. NOT is used to reject any documents that
contain certain words. The question "basketball NOT college" would find
all documents containing the word "basketball", that do not also contain
the word "college". Note, however, that this question would eliminate
articles on any professional players that mention their alma maters; in
other words, be careful not to limit your search too much with the NOT
operator, make sure that you know what you're throwing away.
ADJ
The adjacent operator, ADJ, is used to ensure that one word is followed
by another in the returned document, with no other words in between.
For example, "cordless ADJ telephone" returns only documents containing
"cordless telephone" and ignores documents that only contain one of the
words or that contain both but not adjacent to one another. ADJ will nonetheless
work when stopwords interrupt two words; for example, the preceding example
will find occurrences of "cordless for telephone". Note that the ADJ operator
yields the same results as does a literal query. Also note that ADJ, unlike
AND, OR, and NOT, is not a commutative property - "telephone ADJ cordless"
does not work the same as "cordless ADJ telephone".
Mixing Natural Language, Literals, And Booleans
The ability to mix natural language, literals, and boolean operators
is unique to the WAISserver search engine. Combining natural language
and boolean operators enables end users to better target their searches.
For example, suppose you were looking for documents specifically on portable
laptop computers that are not made by Tosuji Corporation. The question
could then be "Tell me about portable laptop computers NOT Tosuji."
Fielded Search
For data sets whose documents have special data fields, selected portions
of the documents can be tagged by the WAIS parser as fields. A client
can then ask a WAIS server to limit its search to those documents containing
a user-specified value of a particular field. This is called a fielded
search.
The mail-or-rmail parse format is an example of a parse format in which
fields are tagged. For this parse format, the WAIS parser detects the
"to" and "cc" fields, the "from" and "sender" fields, the "subject" field,
and the "date" field. An example of a question using natural language,
a boolean operator, and fielded search is: "company picnic AND from =
barbara". The WAIS server would then find email messages about a company
picnic that Barbara sent.
Date and Numeric Ranges
For a date or numeric field, a range may be specified using the syntax
field-name comparison-operator value where comparison-operator may be
one of > (greater than), < (less than), <= (greater than or equal to),
>= (less than or equal to), or = (equal to). Currently, dates with the
following formats are supported: m-d-yy m-d-yyyy mm-dd-yy m/d/yy mm/dd/yy
m.d.yy today yesterday and positive integers only are supported for numeric
fields. If the comparison operator is =, then the range may be specified
using the word TO, as in date = 4/15/93 TO 4/14/94 Both ends of the range
are inclusively specified.
Right Truncation (Wildcards)
A user can specify right truncation by ending a word with the asterisk
(*) wild card character. This tells the search engine to search on words
whose first several characters match the base characters before the *.
For example, you might use right truncation in a question such as geo*,
which may retrieve documents containing the words: geographer, geography,
geologist, geometry, geometrical, etc.
Grouping Search Terms
A user can group search terms and phrases together using parentheses.
For example, if you wish to search for information about snowstorms, tornadoes,
or hurricanes in New York City, you might search for "(snowstorms OR tornadoes
OR hurricanes) AND (New ADJ York ADJ City)."
You can also nest your parentheses; for example, "from = ((ben ADJ wais)
OR (brewster ADJ think))" searches for messages from either ben@wais.com
or brewster@think.com. When you're using several boolean operators, you
should always group, to disambiguate how the operators are to be applied.
Relevance Ranking
Each document is scored based on its relevance to a user's question,
where the most relevant document has the highest score, or rank -- 1000
being the highest, 1 being the lowest. A document receives a higher score
if the words in the question are in the headline, if the words appear
many times, or if phrases occur as they do in the question. A document's
score is derived using techniques such as word weighting, term weighting,
proximity relationships, and word density. These scoring techniques are
outlined below.
Word Weight
If a word in a document is found to match a word in the user's question,
the word is assigned a weight, and this weight adds to the overall score
of the document. The exact weight that a word receives depends on the
emphasis given to the word by the author, and on where in the document
the word was found. For example, a word is weighted normally if it appears
only in the text body with no capitalization, higher if the word has all
capital letters or if the first letter of the word is capitalized, and
highest if it appears in the headline. The WAIS parser determines word
weights as it reads through the original data set.
Term Weight
Each word used in a document is assigned a numerical value, called the
term weight, based on the frequency of occurrence of that word over all
documents in the data set. Words that occur frequently are not weighted
as highly as those that appear less frequently. Very common words are
either ignored or diminished in the scoring. For example, since the term,
"animal", may occur frequently in many of the documents in a zoology data
set, its term weight is small compared to a term such as "hippopotamus",
which may occur only a few times.
Proximity Relationships
Proximity relationship scoring specifies that if the words in a natural
language question are located close together in a document, they are given
a higher weight than those found further apart. The idea behind a proximity
relationship is that if a document contains a phrase similar to one in
the user's question, that document is more likely to be relevant.
Density
The ratio of the number of times a word appears in a document to the
size of the document is called the word density. It is a measure of how
important a word is to the overall content of the document. A higher word
density results in a higher relevance ranking for that document with respect
to that word.
Special Characters
The WAIS server was originally designed to be as general as possible
and, in this spirit, it ignores all characters in a document that are
not either an alphabetical letter or a number. In fact, non-alphanumeric
characters usually separate words for the parser, for example, "F.Y.I."
parses out to three words. This rule also applies to queries used to search
a directory of servers.
Stemming
Stemming is a technique used to automatically derive variations of a
queried word. These variations are then used as part of the search. If
stemming is used, then when a data set is indexed, word stems are indexed
where possible. For example, "dancing," "danced," and "dancer" would all
be indexed as "dance." A question containing the word "dancer", would
then turn up documents that may also include "dancing", "danced", and
"dancing".
Two types of stemming are supported: Plural and Porter stemming. Plural
stemming attempts to determine the plural form of a word. Porter stemming
attempts to find the real base, or stem, of a word and derive any possible
alternate variations.
Since WAIS Inc. servers allow either form of stemming for a given database,
be sure to ask the administrator of a database whether stemming is used
before you search it. For example, you may search for "lens*" when curious
about telescopes, but a plural stemmer will reduce your query to "len*"
and return undesirable hits. A worst-case scenario is when you search
on "s*" in a plural-stemming database; you will find no results, because
the server stemmed your query to "*", the empty string.

|
|