The ZCatalog Users Guide

  First Draft: Feb 1st, 2000

  Author: Michel Pelletier (michel@digicool.com)

  ZCatalog is a component of Zope.  This Guide assumes that you have
  basic Zope skills.  If you encounter material in this Guide that you 
  do not understand you may find an explantation in one of the other
  Zope Guides: The Content Manager's Guide, the DTML Programer's
  Guide, or the The Zope Administrator's Guide.

  If you continue to be confused, send feedback to the author, Michel 
  Pelletier (michel@digicool.com).

  What ZCatalog does.

    ZCatalog is just like the card catalog in a library.  Think 
    of your all your Zope objects as books in a library.  If you
    wanted to find all of the books by the author 'Aldous Huxley' then 
    you would walk up to the card catalog and look up 'Aldous Huxley'
    in the author index.  This will give you the location of all of
    the books by that author.

    The ZCatalog works exactly like that.  You can walk up to the
    ZCatalog (in DTML) and ask it for all of the objects whose
    'author' property was 'Aldous Huxley'.  Like a real library
    catalog, the ZCatalog must be built before it is searched.  This
    can be done either by a brute force method; where the ZCatalog
    catalogs everything it can find, or it can be done by more
    selective means.  For maximum cataloging flexibility, objects can
    also be taught how to index themselves (and unindex themselves).

    Zope allows you to build your web application in a very flexibly
    way by allowing you to organize your objects into simple, clear
    structures.  Zope gives you the ability to create destroy objects
    programatically at any time.  This means that your application
    could very well scale on the order of thousands of objects or
    more.

    Consider an employee database that created one new employee object 
    for each employee.  If you were a small company, then a very
    simple DTML loop could look for the employee with the name 'Bob'::

      <dtml-in Employees>
        <dtml-if "last_name == 'Bob'">
          Found Bob
        </dtml-if>
      </dtml-in>

    However, if your company had thousands of employees, then this
    loop would take a long time to find just one object.

    Alternativly, you could create a ZCatalog to keep track of your
    employees.  The catalog allows you to create indexes for various
    properties, for example, 'last_name' of your Employee objects, and 
    to search those indexes very quicky.  The above loop, which could
    take many minutes to execute if you had thousands of employees, is 
    returned in milliseconds by ZCatalog with this query::

      <dtml-in "ZCatalog({'last_name' : 'Bob'})">
        Hi Bob!
      </dtml-in>

    Here we walk up to the catalog and pass it a python data structure 
    called a dictionary.  This dictionary maps the name 'last_name' to 
    the value 'Bob'.  This is how you tell the ZCatalog what index you 
    want to query, and what value you are looking for in the index.
    You can pass mulitple parameters to ZCatalog::

      <dtml-in "ZCatalog({'last_name' : 'Bob', 
                          'last_modified_usage' :'range:min',
                          'last_modified' : DateTime('Feb 1, 2000')
                         })">
        New Bobs!
      </dtml-in>

    The ZCatalog takes a very object oriented view to cataloging
    objects.  ZCatalog's are very flexible and can often be a little
    confusing.  The ZCatalog managment interface controls how your
    catalog behaves.

  What ZCatalog Returns

    A query to ZCatalog returns a sequence of record objects.  Those
    record objects corespond to objects that are cataloged in the
    catalog.  Record objects are NOT the objects that they refer to,
    they are just handy little objects that work like the index cards
    in a card catalog, they just store meta-information about the
    object, such as the time it was created and its title (what the
    record object actually concerns itself with is described in the
    section 'The Meta-Data Table').

    In addition to whatever meta-data the record object has, each
    record object has the following attributes:

      o data_record_id_
        
        'data_record_id_' is the document id that this record object
        refers to.  The Catalog can be queried for the path to this
        object with 'ZCatalog.getpath(data_record_id_)'.

      o data_record_score_

        'data_record_score_' is the score that this result record
        matched against the query.  For text indexes, this is the
        number of occourances of a search term in the document.  For
        field indexes, this is 1 if the record matched the query.  For 
        keyword indexes, the behavior is the same as field indexes,
        but it *should* actually return the number of keywords within
        the sequence that matched the query.

      o data_record_normalized_score_

        'data_record_normalized_score_' is the score of the record
        normalized with the rest of the result set to be between 0 and 
        1.  For text indexes, this is 1 of the record matches the
        query term.

  The Managment Interface.

    In the Zope managment interface, you can create a ZCatalog
    anywhere in Zope.  In order to play with ZCatalog, you will need
    some objects to actualy index.  A fresh Zope install comes with a
    folder called 'QuickStart' that contains some introductory
    material to Zope.  This is a nice small document set to play with
    the ZCatalog.

    In your QuickStart folder, select 'ZCatalog' from the add Menu and 
    add a ZCatalog.  You will see this screen:

      snapshot

    Here you have a few options we will quickly examine.  Like all
    Zope objects, your new ZCatalog must be given an id.  For the
    purposes of this document, we'll call our catalog 'Catalog'.  Also 
    like many other Zope objects, you can give your new ZCatalog a
    title.  This is optional, you may put anything here you want.  It
    is often useful, however, to title the Catalog in a way that
    meaningfully represents its contents, for example: 'Employees' if
    the Catalog mainly cataloged Employee objects.

    After title there is a checkbox that lets you select whether or
    not you want to select a Vocabulary.  Vocabulary objects will be
    explained a little later.  For now, leave this box unchecked.
    Note that this screen tells you that if you leave the box
    unchecked like we're going to, a Vocabulary object will be created 
    for us.

    After you have filled in this screen properly, click 'Add
    ZCatalog'.

    Your managment interface will be taken back to the QuickStart
    folder.  Notice that there is now a new object in that folder
    called 'Catalog'.

      snapshot

    To enter the catalog, click on its name or icon.

  The Contents View

    When you click on a ZCatalog you are taken to it's contents view.
    From this view, a ZCatalog looks and acts much like a Folder.  In
    fact, the ZCatalog built partialy from the same code that Folders
    come from.  Here you have the familiar Zope Add menu, and you can
    any kind of Zope object you normally have access to add.  You can
    even create a ZCatalog within your ZCatalog.

    If you have been following along so far, you will see that your
    new Catalog contains only one Object named 'Vocabulary'.  This is
    the mysterious Vocabulary object which will described later.

    Note that the objects show here are NOT the objects that are
    cataloged in this catalog.

  The Find Objects to Catalog view

    So that we can start having some fun right away, we're going to
    skip one view over to the right and jump straight to the 'Find
    Objects to Catalog' View.  This long winded view presents you will 
    the following screen:

      snapshot

    This screen lets you specify what kind of objects you want the
    Catalog to find.  This screen is identical to the standard Zope
    'Find' interface.  Like the find interface, this screen will make
    ZCatalog traverse through your Zope objects.  However, in the case 
    of the ZCatalog, the Catalog will index each object that it finds
    that matches the criteria you specify here.

    If you are in the QuickStart Catalog, just click the 'Find' button 
    and do a wide open search.  This will cause your Catalog to index
    all of the contents of the QuickStart folder.  If you are running
    on a slower machine, this may take up to a few minutes.

    You will be returned to the view we skipped over, 'The Cataloged
    Objects View', which is discussed next.

    Note that using the Find interface to Catalog objects is very
    ineficient and should only be used if you know that you are not
    going to be traversing over lots of objects.  The more objects you
    traverse over, the longer the cataloging operation takes.  While
    the ZCatalog is quite capable of searching through thousands and
    thousands of objects very quickly, actually indexing those objects 
    is a much slower operation.  If you attempt to index too many
    objects too quickly your in memory indexes soon get very large and 
    Zope start aggresivly swapping objects to and from memory to the
    database.  Of course, if your Zope process gets larger than your
    available memory, your operating system will soon start swapping
    bits of Zope out on its own.  This can cause indexing to slow to a 
    crawl.

    Thus, the most efficient way to index content in Zope is to have
    that content index and unindex itself when it is created and
    destroyed.  Since the time it takes to index only one object is
    very negiligible, this turns out to keep your machine running
    quicky fast even in high usage sites.  Of course, the same problem
    will occour if you try to create too many of these smart objects
    too fast.  A scenario like this is unlikly, however, in an
    editorial or publication based system.  For higher write-intesive
    operations, larger scale solutions should be considered.

  The Cataloged Objects View

    The Cataloged Objects View shows you all of the objects in the
    catalog one screenful at a time.

    It is important to know that these catalog listings are NOT the
    Zope objects they refer to.  They are just references to objects
    in your Zope system.  If you deleted an object that is cataloged
    in a catalog, then the catalog will contain a reference that is no 
    longer valid.  Unless you also uncatalog the object before you
    delete it, you may get search results that point to objects that
    no longer exist.

    At the top of the screen are two buttons that allows you to either 
    update or clear the catalog.  Update will go through each object
    reference in the catalog and try and update the index information
    for that object.  For example, if an employee's last name changes, 
    the ZCatalog can be updated to reflect that change.  If the object 
    no longer exists, updating it will remove it from the Catalog.
    Note that updating the entire catalog could take a long, long time 
    if you have many objects.

    The Clear Catalog button does just that, it clears all of the
    indexes and object references out of the Catalog.  It does NOT
    delete the objects it refers to.

    In addition to updating or clearing the entire catalog, you can
    individually choose objects for selection or deletion by selecting
    the checkbox just to the left of the link to that object:

      screen shot

    And clicking the 'Update' or 'Remove' button on either the top or
    bottom of the listing.

  The Indexes View

    In order for ZCatalog to keep track of information about objects
    for you, you must tell it what kind of information you are
    interested in.  You do this by creating indexes.  

    When you create an index, you give it a name and you specify a
    type.  The name of the index is the property, attribute, or method
    you want the Catalog to use when indexing an object.  For example, 
    if all of your Employee objects had the attribute 'last_name',
    then you would want to create a 'last_name' index to index that
    value for every cataloged object.

  The Meta-data Table View

    The meta-data table allows the catalog to maintain a table of
    information about cataloged objects.  For each object in the
    catalog, the meta-data table stores a sequence of values, one for
    each column in this table.  This is useful, for example, if you
    want your search reports to include additional information about
    your objects such as their ids, titles, and URLs without having to 
    wake the actual object up to get the information.

    The meta-data table directly effects the shape of the resulting
    objects that come from Catalog queries.

  The Status View

    The Status view shows information about your Catalog

    Setting your sub-transaction threshold

    The Status view provides you with the option to specify a
    sub-transaction threshold value.  Zope is a transactional system,
    meaning that all changes made to zope happen within a transaction,
    including all changes made to the catalog.  While the Catalog is
    indexing lots of information, lots of index objects are being
    changes.  For speed, the changes made in a transaction are kept in
    memory at all times, this gets a bit dangerous when the Catalog is
    more than happy to eat up every bit of memory your computer has.
    Eventually, Zope will raise a MemoryError and the transaction will
    be rolled backed, forcing your OS's virtual memory pager into spastic 
    fits.

    In order to prevent this, the Catalog will commit a sub-transaction
    every so often to allow the Zope cache to remove some of the changed
    data from memory.  If an error occurs, the entire tranaction, and
    it's sub-transactions, will be rolled back.

    Basicly, the threshold is a knob that allows you to tweak how much
    memory you're going to allow Zope to consume, while the Catalog is
    cataloging lots of objects *in one transaction*.  Setting this
    number higher will make ZCatalog commit a sub-transaction less
    frequently and indexing will consume more memory.  Setting this
    number lower will make ZCAtalog commit a sub-transcation more
    frequently and indexing will consume less memory.

    Note that sub-transaction data is store wherever the python tempfile 
    module wants to put it.  If you are indexing lots and lots of data
    in one transaction, it is possible to fill up the temporary
    partition on certain systems.  Make sure you have ample memory and
    tempfile space if you plan on indexing gigabytes of data.


  ZCatalog Objects

    ZCatalog objects provide a number of methods for DTML programmers
    to manipulate and query a catalog.  The first three such methods
    are identical in parameters in operation, and differ only by
    name.  This is done for historical as well as convenience reasons.

      ZCatalog.query(query_object)

        The 'query' method accepts one parameter, a query object, and
        returns a list of result objects.

      ZCatalog.searchResults(query_object)

        The 'searchResults' method is identical to the query method.
        It is a useful mnemonic method in DTML when all of your search 
        paremeters are in the DTML Namespace (see the DTML Namespace
        How-To)::

          <dtml-in searchResults>
            <dtml-var sequence-item>
          </dtml-in>

      ZCatalog(query_object)

        Just calling a catalog object with a query object is identical 
        to calling the query method.

      ZCatalog.get_path(document_id)

        'get_path' takes one integer as an argument, and returns a
        path to the object that coresponsed to 'document_id'.  This
        path can be used by REQUEST.resolve_url to return the actual
        object the path refers to (this is what 'get_object' does).
        Note that returning this object means this it will be wrapped
        in an Acquisition context of the ZCatalog.

      ZCatalog.get_object(document_id)

        Returns the object refered to by 'document_id'.  This object
        is wrapped in the Acquisition context of the ZCatalog.

      ZCatalog.uniqueValuesFor(index)

        Returns the unique values for the index 'index'.  Note that
        'index' must be the name of an existing FieldIndex or
        KeywordIndex object.  Future indexes may support this method.
        Text Indexes do not and will raise an error.

      ZCatalog.catalog_object(unique_id, object)

        This method tells the catalog to catalog 'object' with the
        unique id 'unique_id'.  The ZCatalog stores an internal
        mapping between the unique id and a set of integers.  Each
        unique_id maps to exactly one unique integer:

          /path/to/object/one -> 42
          /path/to/object/two -> 68

        Zope assumes that 'unique_id' is the absolute path to
        'object'.  If you plan on passing anything other than
        'object.absolute_url()' as unique_id to this method, you
        better know what your doing.

      ZCatalog.uncatalog_object(unique_id)

        Removes all references to the object whose unique id is
        'unique_id'.  This is the inverse operation to
        'catalog_object'.

      ZCatalog.schema()

        Returns a list of meta-data table names.

      ZCatalog.indexes()

        Returns a list of index names.

      ZCatalog.index_objects()

        Returns a list of actual index objects.

  Field Indexes

    Field Indexes treat the value of a property atomically.  This
    means that the value can be an object such as a string, integer,
    tuple, list, dictionary or other python object.  When you search
    a Field Index for a particular value, that value will only match
    objects whose property value matches the search term exactly.

    For example, consider an employee catalog with four employees
    cataloged in it.  Each employee has a property called 'last_name'
    and 'id'::

       id          last_name
       0001        magill
       0002        lil
       0003        nancy

    To index these attributes, create two new indexes named 'id' and
    'last_name' and make them both 'Field Indexes'.

    Example.

    Fields indexes support range searching if the values support
    comparison operations.  This is done by specifying extra query
    terms which control the range in which the query is made.

    Field indexes can be queried to return a list of all unique values 
    stored in the index.  This is useful, for example, when you know
    that all the objects contained in a certain index map to only a
    small collection of values.  In Zope, this is often used to index
    'meta_type'.  This allows you write a snippet of DTML which
    returns a list of all known (cataloged) meta types::

      <dtml-in "Catalog.uniqueValuesFor('meta_type')">
        <dtml-var sequence-item>
      </dtml-in>

  Keyword Indexes

    Keyword indexes index a sequence of atomic values.  Querying the
    index for a value will match any object which has that value
    anywhere in the sequence of atomic values.  This makes it easy to
    create the popular 'keyword' approach to searchable databases.
    Keyword indexes allow you to provide a heirarchical structure on
    top of your cataloged data.

  Text Indexes

    Text Indexes treat the value of an object's attribute as string of 
    text.  According to certain language specific actions, this text
    is typically parsed into 'words' which are then stored in the
    index.  This makes searching for the ocourance of a word within a
    document easy.

    Text indexes are the most complex of all indexes.  Consider the
    following text::

      "Bob is your Uncle."

    For the purpose of text indexing, this is called the 'document'.

    The Text index will transform this string into a sequence of
    values in a language specific way.  Currently, Zope supports only 
    english and european languages (see Appendix A).

    The default engish specific action is to turn the string into::

      ['bob', 'uncle']

    Note that 'is' and 'your' were removed.  This is because they are
    'stopwords', which are very common english words that are removed
    for the sake of index efficiency.  It is not very useful, for
    example, to search for the word 'is' or 'the', since they will
    probably match almost 100% of all english documents.  A search for 
    those common terms would yeild the entire dataset.  The same goes
    for other language specific common words, such as the french
    'les'.

    The text index then takes each value sucked out of the document
    string and maintains a list of word to document mappings::

      'bob'   -> 1,2,3
      'uncle' -> 1,2
      'zope'  -> 2,3,4

    This is a classic 'index'.  Searching for the word 'bob' will
    return documents 1, 2 and 3.

    Further, the text index query language supports boolean
    expressions such as::

      Query                      Result
      "bob AND uncle"            1,2
      "bob OR zope"              1,2,3,4
      "zope AND NOT uncle"       3,4

    Expressions can also be nested in arbitrary levels of parenthesis::

      "(bob AND uncle) OR zope"  1,2,3

  Language Specifics: Vocabulary Objects

    In the section on text indexes, we said that text indexes were a
    mapping from words to the documents containing those words.  This
    is true in a sense, but ZCatalog goes to great lengths to make
    searching and storing this information efficiently.  For this
    reason, the ZCatalog actually maps integer word ids to integer
    document ids.

    Because the word ids are itegers, they have no language specific
    meaning.  Another object, called a Vocabulary object, is used to
    turn the meaningless integer into a meaningful 'word'.

    Vocabulary objects map word ids to words.  When you search for the 
    term 'bob', it is first looked up in the vocabulary to find out
    what integer word id it maps to.  Then the index is queried with
    that integer.  Whatever integer document ids contain that search
    term are returned for the query.

    Vocabulary objects are also in charge of determining how documents 
    are split up into words.  Currently, Zope's Vocabulary object only 
    support splitting english and some european languages.  The
    'Splitter' takes a document and turns it into a list of values to
    be indexed or used as search terms.

    Vocabulary objects let you:

      o Manage Synonyms

        Because words map to integers, many words can map to the same
        integer.  This allows you to assign synonyms to words.  For
        example, a search for 'car' would also return documents
        containing 'automobile'.  This is done by mapping automobile
        and car to the same integer.

      o Manage Stopwords

        Stopwords are common words that are removed from the document
        before indexing.  In english, such common words as 'is' and
        'the' are not useful in most general indexing cases.
        Vocabulary objects let you manage which words are removed from 
        a document.

      o Control Splitting

        Vocabulary objects let you control which character values you
        consider part of a word and which character values you do
        not.  This lets you control how special or non-english
        characters get treated by the splitter.  For example, the
        euro-english splitter may consider your special accent
        character to be a space, and split important words into
        nonsense.  By customising the splitter, you can control how
        words are parsed from your document.

      o Support wildcard searching

        Globbing Vocabulary objects support the notion of searching
        for words by specifing a pattern.  This allows searches of the 
        form::

          "bob*"
          "*bob"
          "un?le"
          "un*e"
          "?n*e"
          
        The wildcard characters '?' and '*' can be customized to be
        any characters.

        These wildcard characters can be customized to suit your needs 
        From the Vocabulary object managment screen:

          screenshot

  Stopwords and Synonyms

    A 'stopword' is a word that is removed from a document before it
    is indexed.  In english, it is useful to remove certain very
    common words such as 'is' and 'the'.  A 'synonym' is a word that
    has a similar meaning to another word.  For example, the word
    'car' is a synonym of the word 'automobile'.

    Both stopwords and synonyms are very language specific.  A set of
    stopwords for the english language, for example, would have no
    beneficial effect when removed from a document written in Dutch.
    Similarly, synonyms depend on meaning, and meaning is itegrally
    tied to language.  'car' and 'automobile' are maybe not even
    words in other language, much less synonyms.

    Because Vocabulary objects encapsulate language specific
    information for a catalog, they are the obvious place to manage
    your stopwords and synonyms.

    Vocabulary objects provide a view to manage stopwords and
    synonyms.  This screen looks like this:

      screenshot

    Here, you see a text area that contains a sequence of lines::

      the
      is
      he
      she
      automobile:car
      meal:food
      it
      your

    Each line contains either a stopword or a synonym.  If the line
    contains a colon, then the string to the right of the colon is the
    word, the string to the left is the synonym that has the same
    meaning as that word.  

    It is possible to use multiple synonyms together to create lists
    of related words::

      automobile:car
      vehicle:car
      transport:car

    After editing to text area to suit your needs, click 'Submit'.  If
    you have made an error in the document, you will get an error
    message to that effect.


Appendix A:  Setting your locale