Translating AltaVista® syntax for OracleText

This sample code provides Java and PL/SQL procedures to convert queries from "web syntax" (as used by AltaVista(R) for example) to OracleText boolean syntax.

Contents

AltaVista® Syntax

The AltaVista search engine uses an easy-to-remember search syntax, as follows:
   +word       -   Word must be present
   -word       -   Word must not be present
   word        -   Word is optional

   pref*       -   Word starting with string "pref"
   title:word  -   Word appears in title (various other fields are supported)

Emulating the Syntax in OracleText

It is non-trivial to reproduce this syntax in standard boolean terms, as used by OracleText. The problem is that boolean syntax works only with pairs of words or phrases, whereas AltaVista® (AV) syntax works at the individual word level. As a simple example, consider the AV search:
   +cat dog
In OracleText syntax, we must represent this as:
   cat OR ( cat AND dog)
Some more complicated examples (AV first, OracleText after)
   +cat dog rabbit
   cat OR ( cat AND (dog OR rabbit) )


   cat -dog
   cat NOT dog

   +cat +dog rabbit fox -fish
   (cat AND dog) OR ( (cat AND dog) AND (rabbit OR fox) ) NOT fish
Additionally, we must do some extra processing to ensure the scores come out sensibly. I will not go into full detail here, but remember the following rules:
  • OR scores the higher of the two operands
  • AND scores the lower of the two operands
  • ACCUMULATE scores the sum of the two operands, but otherwise acts like OR
So the last search above is actually represented as:
   ((cat & dog) | ((cat & dog)*10*10 & (cat , dog , rabbit , fox)) ) NOT fish
Notes:
  • "&" represents AND, "|" represents OR and "," represents ACCUMULATE.
  • The construct "*10*10" is used to push the score of the preceding ANDed section very high. This should ensure that the score is actually that from the ACCUMULATE, rather than the AND, and means that every word contributes to the score.

The Java translator

Using Java functions in OracleText queries

These are the steps needed to use a Java-based user-written function inside an OracleText query:
  • Write and test your Java class in the development tool of your choice (eg. JDeveloper, Visual J++ or command-line JDK).
  • Save your class to the database
  • Create a PL/SQL 'wrapper' declaration for the Java
  • Invoke your function inside the CONTAINS clause of a query
Here's a fully worked example:

Some notes on the "translate" procedure

  • Searches of the form "field:word" will be translated into an OracleText section search: "word within field".
  • The search "-dog" is not possible in ConText. Such a string will be converted into an empty search string.
  • It is not possible to bracket terms: "+cat +(dog fox)" (meaning "cat", and either "dog" or "fox") is not allowed. This is true of AltaVista as well.
  • Asterix ("*")is converted into the OracleText wild card - "%". This means it is not possible to allow weighting of terms. The developer may choose to implement this in some other way.
  • All terms in the search are weighted equally. It is arguable that required terms should be weighted higher than optional terms. This would be easy to change in the "accumulate" section of the Java source if required.

PL/SQL Version

There are some significant differences between the Java translator, and the PL/SQL version - aside from the language they are written in!

The PL/SQL version...

  • Allows for user weighting of search terms (eg. 'dog*3')
  • Does not support "*" as a wild-card - use "%" instead ('do%')
  • Does not support explicit section searching (eg. 'title:dog')
  • Optionally weights phrases appearing in the "homepage", "head" or user-defined sections
  • Includes a theme search ('about(dog)') - this should only be included when for languages where theme searches are supported
Here's a fully worked example:

Using the optional section searches

The full syntax for the PL/SQL transaltor is:
  FUNCTION pavtranslate RETURNS VARCHAR2
   Argument Name                  Type                    In/Out Default?
   ------------------------------ ----------------------- ------ --------
   QUERY                          VARCHAR2                IN     DEFAULT
   SECTION_FLAG                   BOOLEAN                 IN     DEFAULT
   SECTION1                       VARCHAR2                IN     DEFAULT
   SECTION2                       VARCHAR2                IN     DEFAULT
If the section_flag is false, then search terms/phrases will score very highly if they are found in SECTION1, and fairly highly if they are in SECTION2. These sections default to "HOMEPAGE" and "HEAD" respectively.
AltaVista® is a registered trademark of AltaVista Company.

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy