ZCUG.stx

File details

Size

25 K

File type

text/html

Click here to get the file

File contents

The ZCatalog Users Guide

First Draft: Feb 1st, 2000

Author: Michel Pelletier ([email protected])

ZCatalog is a component of Zope. This Guide assumes that you have basic Zope skills. If you encounter material in this Guide that you do not understand you may find an explantation in one of the other Zope Guides: The Content Manager's Guide, the DTML Programer's Guide, or the The Zope Administrator's Guide.

If you continue to be confused, send feedback to the author, Michel Pelletier ([email protected]).

What ZCatalog does.

ZCatalog is just like the card catalog in a library. Think of your all your Zope objects as books in a library. If you wanted to find all of the books by the author Aldous Huxley then you would walk up to the card catalog and look up Aldous Huxley in the author index. This will give you the location of all of the books by that author.

The ZCatalog works exactly like that. You can walk up to the ZCatalog (in DTML) and ask it for all of the objects whose author property was Aldous Huxley. Like a real library catalog, the ZCatalog must be built before it is searched. This can be done either by a brute force method; where the ZCatalog catalogs everything it can find, or it can be done by more selective means. For maximum cataloging flexibility, objects can also be taught how to index themselves (and unindex themselves).

Zope allows you to build your web application in a very flexibly way by allowing you to organize your objects into simple, clear structures. Zope gives you the ability to create destroy objects programatically at any time. This means that your application could very well scale on the order of thousands of objects or more.

Consider an employee database that created one new employee object for each employee. If you were a small company, then a very simple DTML loop could look for the employee with the name 'Bob':

      &lt;dtml-in Employees&gt;
        &lt;dtml-if &quot;last_name == 'Bob'&quot;&gt;
          Found Bob
        &lt;/dtml-if&gt;
      &lt;/dtml-in&gt;

However, if your company had thousands of employees, then this loop would take a long time to find just one object.

Alternativly, you could create a ZCatalog to keep track of your employees. The catalog allows you to create indexes for various properties, for example, last_name of your Employee objects, and to search those indexes very quicky. The above loop, which could take many minutes to execute if you had thousands of employees, is returned in milliseconds by ZCatalog with this query:

      &lt;dtml-in &quot;ZCatalog({'last_name' : 'Bob'})&quot;&gt;
        Hi Bob!
      &lt;/dtml-in&gt;

Here we walk up to the catalog and pass it a python data structure called a dictionary. This dictionary maps the name last_name to the value Bob. This is how you tell the ZCatalog what index you want to query, and what value you are looking for in the index. You can pass mulitple parameters to ZCatalog:

      &lt;dtml-in &quot;ZCatalog({'last_name' : 'Bob', 
                          'last_modified_usage' :'range:min',
                          'last_modified' : DateTime('Feb 1, 2000')
                         })&quot;&gt;
        New Bobs!
      &lt;/dtml-in&gt;

The ZCatalog takes a very object oriented view to cataloging objects. ZCatalog's are very flexible and can often be a little confusing. The ZCatalog managment interface controls how your catalog behaves.

What ZCatalog Returns

A query to ZCatalog returns a sequence of record objects. Those record objects corespond to objects that are cataloged in the catalog. Record objects are NOT the objects that they refer to, they are just handy little objects that work like the index cards in a card catalog, they just store meta-information about the object, such as the time it was created and its title (what the record object actually concerns itself with is described in the section The Meta-Data Table).

In addition to whatever meta-data the record object has, each record object has the following attributes:

data_recordid
data_record_id_ is the document id that this record object refers to. The Catalog can be queried for the path to this object with ZCatalog.getpath(data_record_id_).
data_recordscore
data_record_score_ is the score that this result record matched against the query. For text indexes, this is the number of occourances of a search term in the document. For field indexes, this is 1 if the record matched the query. For keyword indexes, the behavior is the same as field indexes, but it should actually return the number of keywords within the sequence that matched the query.
data_record_normalizedscore
data_record_normalized_score_ is the score of the record normalized with the rest of the result set to be between 0 and 1. For text indexes, this is 1 of the record matches the query term.
The Managment Interface.
In the Zope managment interface, you can create a ZCatalog anywhere in Zope. In order to play with ZCatalog, you will need some objects to actualy index. A fresh Zope install comes with a folder called QuickStart that contains some introductory material to Zope. This is a nice small document set to play with the ZCatalog.

In your QuickStart folder, select ZCatalog from the add Menu and add a ZCatalog. You will see this screen:
snapshot

Here you have a few options we will quickly examine. Like all Zope objects, your new ZCatalog must be given an id. For the purposes of this document, we'll call our catalog Catalog. Also like many other Zope objects, you can give your new ZCatalog a title. This is optional, you may put anything here you want. It is often useful, however, to title the Catalog in a way that meaningfully represents its contents, for example: Employees if the Catalog mainly cataloged Employee objects.

After title there is a checkbox that lets you select whether or not you want to select a Vocabulary. Vocabulary objects will be explained a little later. For now, leave this box unchecked. Note that this screen tells you that if you leave the box unchecked like we're going to, a Vocabulary object will be created for us.

After you have filled in this screen properly, click Add ZCatalog.

Your managment interface will be taken back to the QuickStart folder. Notice that there is now a new object in that folder called Catalog.
snapshot

To enter the catalog, click on its name or icon.
The Contents View
When you click on a ZCatalog you are taken to it's contents view. From this view, a ZCatalog looks and acts much like a Folder. In fact, the ZCatalog built partialy from the same code that Folders come from. Here you have the familiar Zope Add menu, and you can any kind of Zope object you normally have access to add. You can even create a ZCatalog within your ZCatalog.

If you have been following along so far, you will see that your new Catalog contains only one Object named Vocabulary. This is the mysterious Vocabulary object which will described later.

Note that the objects show here are NOT the objects that are cataloged in this catalog.
The Find Objects to Catalog view
So that we can start having some fun right away, we're going to skip one view over to the right and jump straight to the Find Objects to Catalog View. This long winded view presents you will the following screen:
snapshot

This screen lets you specify what kind of objects you want the Catalog to find. This screen is identical to the standard Zope Find interface. Like the find interface, this screen will make ZCatalog traverse through your Zope objects. However, in the case of the ZCatalog, the Catalog will index each object that it finds that matches the criteria you specify here.

If you are in the QuickStart Catalog, just click the Find button and do a wide open search. This will cause your Catalog to index all of the contents of the QuickStart folder. If you are running on a slower machine, this may take up to a few minutes.

You will be returned to the view we skipped over, The Cataloged Objects View, which is discussed next.

Note that using the Find interface to Catalog objects is very ineficient and should only be used if you know that you are not going to be traversing over lots of objects. The more objects you traverse over, the longer the cataloging operation takes. While the ZCatalog is quite capable of searching through thousands and thousands of objects very quickly, actually indexing those objects is a much slower operation. If you attempt to index too many objects too quickly your in memory indexes soon get very large and Zope start aggresivly swapping objects to and from memory to the database. Of course, if your Zope process gets larger than your available memory, your operating system will soon start swapping bits of Zope out on its own. This can cause indexing to slow to a crawl.

Thus, the most efficient way to index content in Zope is to have that content index and unindex itself when it is created and destroyed. Since the time it takes to index only one object is very negiligible, this turns out to keep your machine running quicky fast even in high usage sites. Of course, the same problem will occour if you try to create too many of these smart objects too fast. A scenario like this is unlikly, however, in an editorial or publication based system. For higher write-intesive operations, larger scale solutions should be considered.
The Cataloged Objects View
The Cataloged Objects View shows you all of the objects in the catalog one screenful at a time.

It is important to know that these catalog listings are NOT the Zope objects they refer to. They are just references to objects in your Zope system. If you deleted an object that is cataloged in a catalog, then the catalog will contain a reference that is no longer valid. Unless you also uncatalog the object before you delete it, you may get search results that point to objects that no longer exist.

At the top of the screen are two buttons that allows you to either update or clear the catalog. Update will go through each object reference in the catalog and try and update the index information for that object. For example, if an employee's last name changes, the ZCatalog can be updated to reflect that change. If the object no longer exists, updating it will remove it from the Catalog. Note that updating the entire catalog could take a long, long time if you have many objects.

The Clear Catalog button does just that, it clears all of the indexes and object references out of the Catalog. It does NOT delete the objects it refers to.

In addition to updating or clearing the entire catalog, you can individually choose objects for selection or deletion by selecting the checkbox just to the left of the link to that object:
screen shot

And clicking the Update or Remove button on either the top or bottom of the listing.
The Indexes View
In order for ZCatalog to keep track of information about objects for you, you must tell it what kind of information you are interested in. You do this by creating indexes.

When you create an index, you give it a name and you specify a type. The name of the index is the property, attribute, or method you want the Catalog to use when indexing an object. For example, if all of your Employee objects had the attribute last_name, then you would want to create a last_name index to index that value for every cataloged object.
The Meta-data Table View
The meta-data table allows the catalog to maintain a table of information about cataloged objects. For each object in the catalog, the meta-data table stores a sequence of values, one for each column in this table. This is useful, for example, if you want your search reports to include additional information about your objects such as their ids, titles, and URLs without having to wake the actual object up to get the information.

The meta-data table directly effects the shape of the resulting objects that come from Catalog queries.
The Status View
The Status view shows information about your Catalog

Setting your sub-transaction threshold

The Status view provides you with the option to specify a sub-transaction threshold value. Zope is a transactional system, meaning that all changes made to zope happen within a transaction, including all changes made to the catalog. While the Catalog is indexing lots of information, lots of index objects are being changes. For speed, the changes made in a transaction are kept in memory at all times, this gets a bit dangerous when the Catalog is more than happy to eat up every bit of memory your computer has. Eventually, Zope will raise a MemoryError and the transaction will be rolled backed, forcing your OS's virtual memory pager into spastic fits.

In order to prevent this, the Catalog will commit a sub-transaction every so often to allow the Zope cache to remove some of the changed data from memory. If an error occurs, the entire tranaction, and it's sub-transactions, will be rolled back.

Basicly, the threshold is a knob that allows you to tweak how much memory you're going to allow Zope to consume, while the Catalog is cataloging lots of objects in one transaction. Setting this number higher will make ZCatalog commit a sub-transaction less frequently and indexing will consume more memory. Setting this number lower will make ZCAtalog commit a sub-transcation more frequently and indexing will consume less memory.

Note that sub-transaction data is store wherever the python tempfile module wants to put it. If you are indexing lots and lots of data in one transaction, it is possible to fill up the temporary partition on certain systems. Make sure you have ample memory and tempfile space if you plan on indexing gigabytes of data.
ZCatalog Objects
ZCatalog objects provide a number of methods for DTML programmers to manipulate and query a catalog. The first three such methods are identical in parameters in operation, and differ only by name. This is done for historical as well as convenience reasons. ZCatalog.query(query_object)
The query method accepts one parameter, a query object, and returns a list of result objects.
ZCatalog.searchResults(query_object)
The searchResults method is identical to the query method. It is a useful mnemonic method in DTML when all of your search paremeters are in the DTML Namespace (see the DTML Namespace How-To):
```
          &lt;dtml-in searchResults&gt;
            &lt;dtml-var sequence-item&gt;
          &lt;/dtml-in&gt;
```
ZCatalog(query_object)
Just calling a catalog object with a query object is identical to calling the query method.
ZCatalog.get_path(document_id)
get_path takes one integer as an argument, and returns a path to the object that coresponsed to document_id. This path can be used by REQUEST.resolve_url to return the actual object the path refers to (this is what get_object does). Note that returning this object means this it will be wrapped in an Acquisition context of the ZCatalog.
ZCatalog.get_object(document_id)
Returns the object refered to by document_id. This object is wrapped in the Acquisition context of the ZCatalog.
ZCatalog.uniqueValuesFor(index)
Returns the unique values for the index index. Note that index must be the name of an existing FieldIndex or KeywordIndex object. Future indexes may support this method. Text Indexes do not and will raise an error.
ZCatalog.catalog_object(unique_id, object)
This method tells the catalog to catalog object with the unique id unique_id. The ZCatalog stores an internal mapping between the unique id and a set of integers. Each unique_id maps to exactly one unique integer:
/path/to/object/one -> 42 /path/to/object/two -> 68

Zope assumes that unique_id is the absolute path to object. If you plan on passing anything other than object.absolute_url() as unique_id to this method, you better know what your doing.
ZCatalog.uncatalog_object(unique_id)
Removes all references to the object whose unique id is unique_id. This is the inverse operation to catalog_object.
ZCatalog.schema()
Returns a list of meta-data table names.
ZCatalog.indexes()
Returns a list of index names.
ZCatalog.index_objects()
Returns a list of actual index objects.
Field Indexes
Field Indexes treat the value of a property atomically. This means that the value can be an object such as a string, integer, tuple, list, dictionary or other python object. When you search a Field Index for a particular value, that value will only match objects whose property value matches the search term exactly.

For example, consider an employee catalog with four employees cataloged in it. Each employee has a property called last_name and 'id':
```
       id          last_name
       0001        magill
       0002        lil
       0003        nancy
```
To index these attributes, create two new indexes named id and last_name and make them both Field Indexes.

Example.

Fields indexes support range searching if the values support comparison operations. This is done by specifying extra query terms which control the range in which the query is made.

Field indexes can be queried to return a list of all unique values stored in the index. This is useful, for example, when you know that all the objects contained in a certain index map to only a small collection of values. In Zope, this is often used to index meta_type. This allows you write a snippet of DTML which returns a list of all known (cataloged) meta types:
```
      &lt;dtml-in &quot;Catalog.uniqueValuesFor('meta_type')&quot;&gt;
        &lt;dtml-var sequence-item&gt;
      &lt;/dtml-in&gt;

  Keyword Indexes

    Keyword indexes index a sequence of atomic values.  Querying the
    index for a value will match any object which has that value
    anywhere in the sequence of atomic values.  This makes it easy to
    create the popular 'keyword' approach to searchable databases.
    Keyword indexes allow you to provide a heirarchical structure on
    top of your cataloged data.

  Text Indexes

    Text Indexes treat the value of an object's attribute as string of 
    text.  According to certain language specific actions, this text
    is typically parsed into 'words' which are then stored in the
    index.  This makes searching for the ocourance of a word within a
    document easy.

    Text indexes are the most complex of all indexes.  Consider the
    following text::

      &quot;Bob is your Uncle.&quot;

    For the purpose of text indexing, this is called the 'document'.

    The Text index will transform this string into a sequence of
    values in a language specific way.  Currently, Zope supports only 
    english and european languages (see Appendix A).

    The default engish specific action is to turn the string into::

      ['bob', 'uncle']

    Note that 'is' and 'your' were removed.  This is because they are
    'stopwords', which are very common english words that are removed
    for the sake of index efficiency.  It is not very useful, for
    example, to search for the word 'is' or 'the', since they will
    probably match almost 100% of all english documents.  A search for 
    those common terms would yeild the entire dataset.  The same goes
    for other language specific common words, such as the french
    'les'.

    The text index then takes each value sucked out of the document
    string and maintains a list of word to document mappings::

      'bob'   -&gt; 1,2,3
      'uncle' -&gt; 1,2
      'zope'  -&gt; 2,3,4

    This is a classic 'index'.  Searching for the word 'bob' will
    return documents 1, 2 and 3.

    Further, the text index query language supports boolean
    expressions such as::

      Query                      Result
      &quot;bob AND uncle&quot;            1,2
      &quot;bob OR zope&quot;              1,2,3,4
      &quot;zope AND NOT uncle&quot;       3,4

    Expressions can also be nested in arbitrary levels of parenthesis::

      &quot;(bob AND uncle) OR zope&quot;  1,2,3

  Language Specifics: Vocabulary Objects

    In the section on text indexes, we said that text indexes were a
    mapping from words to the documents containing those words.  This
    is true in a sense, but ZCatalog goes to great lengths to make
    searching and storing this information efficiently.  For this
    reason, the ZCatalog actually maps integer word ids to integer
    document ids.

    Because the word ids are itegers, they have no language specific
    meaning.  Another object, called a Vocabulary object, is used to
    turn the meaningless integer into a meaningful 'word'.

    Vocabulary objects map word ids to words.  When you search for the 
    term 'bob', it is first looked up in the vocabulary to find out
    what integer word id it maps to.  Then the index is queried with
    that integer.  Whatever integer document ids contain that search
    term are returned for the query.

    Vocabulary objects are also in charge of determining how documents 
    are split up into words.  Currently, Zope's Vocabulary object only 
    support splitting english and some european languages.  The
    'Splitter' takes a document and turns it into a list of values to
    be indexed or used as search terms.

    Vocabulary objects let you:

      o Manage Synonyms

        Because words map to integers, many words can map to the same
        integer.  This allows you to assign synonyms to words.  For
        example, a search for 'car' would also return documents
        containing 'automobile'.  This is done by mapping automobile
        and car to the same integer.

      o Manage Stopwords

        Stopwords are common words that are removed from the document
        before indexing.  In english, such common words as 'is' and
        'the' are not useful in most general indexing cases.
        Vocabulary objects let you manage which words are removed from 
        a document.

      o Control Splitting

        Vocabulary objects let you control which character values you
        consider part of a word and which character values you do
        not.  This lets you control how special or non-english
        characters get treated by the splitter.  For example, the
        euro-english splitter may consider your special accent
        character to be a space, and split important words into
        nonsense.  By customising the splitter, you can control how
        words are parsed from your document.

      o Support wildcard searching

        Globbing Vocabulary objects support the notion of searching
        for words by specifing a pattern.  This allows searches of the 
        form::

          &quot;bob*&quot;
          &quot;*bob&quot;
          &quot;un?le&quot;
          &quot;un*e&quot;
          &quot;?n*e&quot;

        The wildcard characters '?' and '*' can be customized to be
        any characters.

        These wildcard characters can be customized to suit your needs 
        From the Vocabulary object managment screen:

          screenshot

  Stopwords and Synonyms

    A 'stopword' is a word that is removed from a document before it
    is indexed.  In english, it is useful to remove certain very
    common words such as 'is' and 'the'.  A 'synonym' is a word that
    has a similar meaning to another word.  For example, the word
    'car' is a synonym of the word 'automobile'.

    Both stopwords and synonyms are very language specific.  A set of
    stopwords for the english language, for example, would have no
    beneficial effect when removed from a document written in Dutch.
    Similarly, synonyms depend on meaning, and meaning is itegrally
    tied to language.  'car' and 'automobile' are maybe not even
    words in other language, much less synonyms.

    Because Vocabulary objects encapsulate language specific
    information for a catalog, they are the obvious place to manage
    your stopwords and synonyms.

    Vocabulary objects provide a view to manage stopwords and
    synonyms.  This screen looks like this:

      screenshot

    Here, you see a text area that contains a sequence of lines::

      the
      is
      he
      she
      automobile:car
      meal:food
      it
      your

    Each line contains either a stopword or a synonym.  If the line
    contains a colon, then the string to the right of the colon is the
    word, the string to the left is the synonym that has the same
    meaning as that word.  

    It is possible to use multiple synonyms together to create lists
    of related words::

      automobile:car
      vehicle:car
      transport:car

    After editing to text area to suit your needs, click 'Submit'.  If
    you have made an error in the document, you will get an error
    message to that effect.
```

Appendix A: Setting your locale

Log in

You are here:

Log in

ZCUG.stx

File contents