How to Create a WAIS Query


This document describes how to create a query for the WAIS servers that are maintained by Tennessee Valley Authority (TVA) for GILS. This document was copied from the WAIS Inc. administration manual's description of WAIS queries.

Types of Queries

Natural Language Query

At the heart of the WAISserver product is the WAIS search engine. The WAIS search engine receives a user's question or query, searches its database for documents most relevant to the question, and returns a relevance-ranked list of documents back to the user. Each document is given a score from 1 to 1000, based on how well it matched the user's question (how many words it contained, their importance in the document, etc.). A question, or query, is an expression that can contain any combination of natural language, literal strings, boolean operators, field name-value pairs, date or numeric ranges, and relevant documents.The WAIS search engine also supports right truncation (wildcard searching), word stemming, and relevance ranking. Each of these capabilities is explained below.

Natural Language The server can be queried using natural language questions. The server does not understand the question, rather it takes the words and phrases in the question and finds documents that have those words and phrases in them. "Tell me about portable computers." is an example of a natural language question. In this example, the WAIS server would search for document containing the words 'portable' and 'computers'; the other words, 'tell', 'me', and 'about', are called "stop words" -- words so common that they occur in almost every document and so they are not used for searching a document.

Literal Strings

A similar but more specific kind of query asks to find documents that contain one or more exact phrases by enclosing them in double quotation marks. This is known as a literal. For example, the query "search engine capabilities" returns only documents that contain this exact phrase. The WAIS search engine performs a literal search exactly as if you had used the boolean operator ADJ. Thus the above example would yield the same results as search ADJ engine ADJ capabilities. For this reason, it is best to stick to noun phrases when using literals; if your literal phrase includes stopwords, it won't work.

Relevance Feedback

Relevance feedback is the ability to select a document or a portion of a document and find a set of documents related to it. For example, suppose you perform a search on a news database with the natural language question "What's going on in personal computers?". Scanning the headlines returned, you see the headline, "Personal Computers in K-12", where you are interested in finding more articles related to this. You can then perform the search again using your original question, selecting this article, or a portion of the article, for relevance feedback. The search engine then returns a new list of headlines for related articles. In essence, relevance feedback adds more words to the original question. These words are determined by finding the "significant" words in the document that was fed back to the server; the significant words are those that best distinguish it from all other documents. It then tries to find other documents that share these words. One of the primary uses of relevance feedback is to help users quickly focus their search without the need to learn complex query languages. For example, you can use natural language to find a list of document headlines, and then use relevance feedback to focus your search on the documents most relevant. Since a WAIS search is fast, you can interactively and iteratively refine your search using a combination of natural language questions and relevance feedback.

Boolean Operators

The boolean operators, AND, OR, NOT, and ADJ aid in establishing logical relationships between concepts expressed in natural language. These operators are especially useful in narrowing down the search.

AND, &&

The AND operator is helpful in restricting a search when a particular pair or larger group of terms is known. For instance, when searching for documents on the weather in Boston, a question such as "weather AND Boston" would return only those documents that contain both the word "weather" and the word "Boston". You can use more than one AND in a query, e.g. "weather AND Boston AND November". Note that the C like double ampersand (&&) may be used instead of spelling out the word AND.

OR, | |

The OR operator is often used to join two different phrases of a Boolean search. A question such as "hurricane OR tornado" would search for all documents containing either the word "hurricane", or the word "tornado", or both. You can also use more than one OR in a query. A natural language question is much like having an implicit OR between the words, except that the search engine does more work in a natural language query to determine the relevance of words and their relationships in a phrase. Note that the C-like double vertical bars (||) may be used instead of spelling out the word OR.

NOT

NOT is a binary operator. That is, it has to come between two or more words or parenthesized clauses. NOT is used to reject any documents that contain certain words. The question "basketball NOT college" would find all documents containing the word "basketball", that do not also contain the word "college". Note, however, that this question would eliminate articles on any professional players that mention their alma maters; in other words, be careful not to limit your search too much with the NOT operator, make sure that you know what you're throwing away.

ADJ

The adjacent operator, ADJ, is used to ensure that one word is followed by another in the returned document, with no other words in between.

For example, "cordless ADJ telephone" returns only documents containing "cordless telephone" and ignores documents that only contain one of the words or that contain both but not adjacent to one another. ADJ will nonetheless work when stopwords interrupt two words; for example, the preceding example will find occurrences of "cordless for telephone". Note that the ADJ operator yields the same results as does a literal query. Also note that ADJ, unlike AND, OR, and NOT, is not a commutative property - "telephone ADJ cordless" does not work the same as "cordless ADJ telephone".

Mixing Natural Language, Literals, And Booleans

The ability to mix natural language, literals, and boolean operators is unique to the WAISserver search engine. Combining natural language and boolean operators enables end users to better target their searches. For example, suppose you were looking for documents specifically on portable laptop computers that are not made by Tosuji Corporation. The question could then be "Tell me about portable laptop computers NOT Tosuji."

Fielded Search

For data sets whose documents have special data fields, selected portions of the documents can be tagged by the WAIS parser as fields. A client can then ask a WAIS server to limit its search to those documents containing a user-specified value of a particular field. This is called a fielded search.

The mail-or-rmail parse format is an example of a parse format in which fields are tagged. For this parse format, the WAIS parser detects the "to" and "cc" fields, the "from" and "sender" fields, the "subject" field, and the "date" field. An example of a question using natural language, a boolean operator, and fielded search is: "company picnic AND from = barbara". The WAIS server would then find email messages about a company picnic that Barbara sent.

Date and Numeric Ranges

For a date or numeric field, a range may be specified using the syntax field-name comparison-operator value where comparison-operator may be one of > (greater than), < (less than), <= (greater than or equal to), >= (less than or equal to), or = (equal to). Currently, dates with the following formats are supported: m-d-yy m-d-yyyy mm-dd-yy m/d/yy mm/dd/yy m.d.yy today yesterday and positive integers only are supported for numeric fields. If the comparison operator is =, then the range may be specified using the word TO, as in date = 4/15/93 TO 4/14/94 Both ends of the range are inclusively specified.

Right Truncation (Wildcards)

A user can specify right truncation by ending a word with the asterisk (*) wild card character. This tells the search engine to search on words whose first several characters match the base characters before the *. For example, you might use right truncation in a question such as geo*, which may retrieve documents containing the words: geographer, geography, geologist, geometry, geometrical, etc.

Grouping Search Terms

A user can group search terms and phrases together using parentheses. For example, if you wish to search for information about snowstorms, tornadoes, or hurricanes in New York City, you might search for "(snowstorms OR tornadoes OR hurricanes) AND (New ADJ York ADJ City)."

You can also nest your parentheses; for example, "from = ((ben ADJ wais) OR (brewster ADJ think))" searches for messages from either ben@wais.com or brewster@think.com. When you're using several boolean operators, you should always group, to disambiguate how the operators are to be applied.

Relevance Ranking

Each document is scored based on its relevance to a user's question, where the most relevant document has the highest score, or rank -- 1000 being the highest, 1 being the lowest. A document receives a higher score if the words in the question are in the headline, if the words appear many times, or if phrases occur as they do in the question. A document's score is derived using techniques such as word weighting, term weighting, proximity relationships, and word density. These scoring techniques are outlined below.

Word Weight

If a word in a document is found to match a word in the user's question, the word is assigned a weight, and this weight adds to the overall score of the document. The exact weight that a word receives depends on the emphasis given to the word by the author, and on where in the document the word was found. For example, a word is weighted normally if it appears only in the text body with no capitalization, higher if the word has all capital letters or if the first letter of the word is capitalized, and highest if it appears in the headline. The WAIS parser determines word weights as it reads through the original data set.

Term Weight

Each word used in a document is assigned a numerical value, called the term weight, based on the frequency of occurrence of that word over all documents in the data set. Words that occur frequently are not weighted as highly as those that appear less frequently. Very common words are either ignored or diminished in the scoring. For example, since the term, "animal", may occur frequently in many of the documents in a zoology data set, its term weight is small compared to a term such as "hippopotamus", which may occur only a few times.

Proximity Relationships

Proximity relationship scoring specifies that if the words in a natural language question are located close together in a document, they are given a higher weight than those found further apart. The idea behind a proximity relationship is that if a document contains a phrase similar to one in the user's question, that document is more likely to be relevant.

Density

The ratio of the number of times a word appears in a document to the size of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking for that document with respect to that word.

Special Characters

The WAIS server was originally designed to be as general as possible and, in this spirit, it ignores all characters in a document that are not either an alphabetical letter or a number. In fact, non-alphanumeric characters usually separate words for the parser, for example, "F.Y.I." parses out to three words. This rule also applies to queries used to search a directory of servers.

Stemming

Stemming is a technique used to automatically derive variations of a queried word. These variations are then used as part of the search. If stemming is used, then when a data set is indexed, word stems are indexed where possible. For example, "dancing," "danced," and "dancer" would all be indexed as "dance." A question containing the word "dancer", would then turn up documents that may also include "dancing", "danced", and "dancing".

Two types of stemming are supported: Plural and Porter stemming. Plural stemming attempts to determine the plural form of a word. Porter stemming attempts to find the real base, or stem, of a word and derive any possible alternate variations.

Since WAIS Inc. servers allow either form of stemming for a given database, be sure to ask the administrator of a database whether stemming is used before you search it. For example, you may search for "lens*" when curious about telescopes, but a plural stemmer will reduce your query to "len*" and return undesirable hits. A worst-case scenario is when you search on "s*" in a plural-stemming database; you will find no results, because the server stemmed your query to "*", the empty string.

 

 

top of page