#set TITLE = "Multiword text fields"
#include top

.SH Searching multiword text fields
.LP
Text fields such as titles, descriptions, remarks fields, etc. are common in databases.
Users generally prefer to search multiword content by entering a simple intuitive search 
request, like they might use with Google or Ebay.
.LP
For instance, when a user enters \fCchild asthma\fR into a search application, they expect 
to see a list of results where entries containing both \fCchild\fR and \fCasthma\fR 
are at the top.  Top priority goes to entries that contain \fCchild asthma\fR literally,
followed by entries that contain the words \fCchild\fR and \fCasthma\fR regardless of
position or order.
Entries containing only one or the other are shown further down the list, or
not at all.  When a user enters \fC"blue ribbon beer"\fR they expect to see only entries that
contain that phrase exactly.  
.LP
\fBshsql\fR provides this capability via the \fBCONTAINS\fR comparison operator, along with
\fBword indexes\fR or \fBcombinedword indexes\fR.  
.LP
\fBshsql\fR text fields are by default limited to 255 characters in length.  
\fBshsql\fR does not support "memo" fields or text fields of infinite length.
A suggested strategy for accomodating larger blocks of text is to store the text in individual files, 
then store the filename in the database.  This approach is used by
#set FILE = "../../qman/html/quisp_home.html"
#set TAG = "quisp."
#include link

#include space

.SH CONTAINS
.ig >>
<a name=contains></a>
.>>
.LP
\fCWHERE authors CONTAINS 'smith'\fR
.LP
\fCWHERE authors CONTAINS 'smith jones'\fR
.LP
\fCWHERE title CONTAINS '"gene expression" tumor'\fR
.LP
\fBshsql\fR provides the \fCCONTAINS\fR 
#set FILE = "whereclause.html"
#set TAG = "WHERE clause"
#include link
operator for finding
a word (or several words) in a database field that has multiword content
(e.g. titles, descriptions, lists, etc.).
\fCCONTAINS\fR allows the user search request to be inserted directly into the query 
#set FILE = "whereclause.html"
#set TAG = "WHERE clause"
#include link
without any preprocessing.
It also ranks result rows by how well they match the user's request.
.LP
Rows that do not match at all are rejected.  For rows that match at least minimally,
a scoring metric is generated and placed into a field called \fC_matchscore\fR.
\fC_matchscore\fR is 0 (best) to 19 (worst).
The scoring takes into account number
of search words present and word order and position.
This field may be used to order the result rows, or may be used further to the right in the WHERE clause.
.LP
All matching is case insensitive.  Words are delimited on any combination of whitespace or 
punctuation characters (this applies to both requested words and words in the database field).
.LP
Words enclosed within double quotes (") are considered a phrase (see the 3rd example above), and are 
treated as a single search term which must match exactly.  
Non-quoted words are each an individual search term to which open-ended matching is applied eg.
\fC'casey at the bat' CONTAINS 'case'\fR would be true (with a slightly worse score),
but \fC'casey at the bat' CONTAINS '"case"'\fR would be false, because of the double quotes.
Open-ended matching is not used for very short search terms (1 or 2 characters in length); these 
must match exactly.
A single word may be enclosed in double quotes to force it to match exactly.
.LP
If \fCCONTAINS\fR is used more than once in a \fCWHERE\fR clause expression, the \fC_matchscores\fR
are summed, up to a maximum of 99.
.LP
\fBCONTAINS examples:\fR
.br
.nf
	select title, _matchscore 
	from journalcits 
	where title contains "retina"
	order by _matchscore

	select authorlist, _matchscore
	from journalcits
	where authorlist contains "smith, jones, greene" 
        and _matchscore < 5
.fi

#include space
.ig >>
<a name=wordindex></a>
.>>

.SH Word indexes
A \fBword index\fR is a type of 
#set FILE = "indexes.html"
#set TAG = "index"
#include link
where every unique word in a multiword field has its 
own index entry, for fast word-based searching on fields that contain titles, descriptions, 
or lists of values.  Words are delimited by any combination of whitespace and punctuation 
characters.
.LP
To create a word index use an SQL command like this: 
\fCcreate index type=word on auctionitems ( desc )\fR
.LP
Very common words (such as \fCand\fR and \fCthe\fR in English) can be omitted for better 
search efficiency by setting up a "common words" file.
Common words should be inserted into a file, one word per line.
See the \fCsqlexampledb\fR for an example English common words file.
Then, set the
#set FILE = "config.html#dbcommonwordsfile"
#set TAG = "dbcommonwordsfile attribute in your project config file."
#include link
.LP
Note: unless CONTAINS is used, queries that attempt to compare a word-indexed field against a multiword
constant will not be successful.

#include space
.ig >>
<a name=combinedwordindex></a>
.>>
.SH Combinedword indexes
A \fCcombinedword\fR index is similar to a \fCword\fR index, except that words from
several fields are combined into one index.
For situations where a table contains \fBseveral\fR multiword fields that will
usually be searched simultaneously for the same search words (as is often the case with
search engine applications), a \fCcombinedword\fR index can be used instead of
several \fCword\fR indexes, for better search efficiency.
.LP
For example, suppose you
have a table holding information on journal articles with fields \fCtitle\fR,
\fCauthors\fR, and \fCkeywords\fR, and you have a search engine application
that allows a user to type in one or several words.  All three of the fields
need to be searched.  A query to search the table would be something like this:
.nf
select * from journalcits
where title contains \fIsearchwords\fC
   OR author contains \fIsearchwords\fC 
   OR keywords contains \fIsearchwords\fC
.fi
.LP
If each field had its own word index, three index lookups would 
be required, since there are three \fCOR\fR terms.  With a \fCcombinedword\fR index
(which contains words from all three fields)
only one index lookup is needed.
.LP
A \fCcombinedword\fR index is associated by name with one field, called the primary field.
Additional represented fields are called secondary fields.
In the above example, the combinedword index's primary field is \fCtitle\fR, and
its secondary fields are \fCauthor\fR and \fCkeywords\fR.
.LP
When issueing a query,
the primary fieldname must be specified in the leftmost comparison in the \fCWHERE\fR clause.  
The indexing mechanism detects the fact that the index is \fCcombinedword\fR type, and cancels
index lookups for the other \fCOR\fR terms.
For this reason, the following restriction applies: \fBWhen a combinedword lookup is OR'ed with
other terms, the other terms must involve secondary fields.\fR
Otherwise, retrievals may be
incomplete, since index lookups for rightward \fCOR\fR terms will not be done.
.LP
Searches involving only the primary field, or the primary field and a subset of the secondary fields,
will use the combined index and will still give correct results.
Searches involving only one or more secondary fields will not interact with the \fCcombinedword\fR index,
but may have their own separate indexes.
.LP
When CREATEing a combinedword index, the first fieldname mentioned in the CREATE command
is the primary field with; the remaining mentioned fields are the secondary fields.
Thus, only one combinedword index can be created by a CREATE INDEX command.
For example: 
.IP
\fCcreate index type=combinedword on journalcits ( title, author, keywords )\fR
.LP
Note: unless CONTAINS is used, queries that attempt to compare a combinedword-indexed field against a multiword
constant will not be successful.

#include space

.SH Notes
.LP
If a query is eligible for indexing,
\fCSELECT DISTINCT\fR is automatically done whenever \fCOR\fR or \fCCONTAINS\fR are used.
.LP
When a \fCword\fR or \fCcombinedword\fR index exists for a field, searches that use \fB=\fR or \fBlike\fR
will work with a single word or \fCNULL\fR, but not with multiple words.  Use
\fCCONTAINS\fR to find multiple words.  Searches that involve 
only "common words" will not find anything.
.LP
Numeric fields / numeric comparisons cannot be used with \fCword\fR or \fCcombinedword\fR indexes.
.LP
#set FILE = "indexes.html"
#set TAG = "General information on shsql indexes"
#include link


#include bottom
