The ZCatalog Users Guide
First Draft: Feb 1st, 2000
Author: Michel Pelletier (michel@digicool.com)
ZCatalog is a component of Zope. This Guide assumes that you have
basic Zope skills. If you encounter material in this Guide that you
do not understand you may find an explantation in one of the other
Zope Guides: The Content Manager's Guide, the DTML Programer's
Guide, or the The Zope Administrator's Guide.
If you continue to be confused, send feedback to the author, Michel
Pelletier (michel@digicool.com).
What ZCatalog does.
ZCatalog is just like the card catalog in a library. Think
of your all your Zope objects as books in a library. If you
wanted to find all of the books by the author 'Aldous Huxley' then
you would walk up to the card catalog and look up 'Aldous Huxley'
in the author index. This will give you the location of all of
the books by that author.
The ZCatalog works exactly like that. You can walk up to the
ZCatalog (in DTML) and ask it for all of the objects whose
'author' property was 'Aldous Huxley'. Like a real library
catalog, the ZCatalog must be built before it is searched. This
can be done either by a brute force method; where the ZCatalog
catalogs everything it can find, or it can be done by more
selective means. For maximum cataloging flexibility, objects can
also be taught how to index themselves (and unindex themselves).
Zope allows you to build your web application in a very flexibly
way by allowing you to organize your objects into simple, clear
structures. Zope gives you the ability to create destroy objects
programatically at any time. This means that your application
could very well scale on the order of thousands of objects or
more.
Consider an employee database that created one new employee object
for each employee. If you were a small company, then a very
simple DTML loop could look for the employee with the name 'Bob'::
Found Bob
However, if your company had thousands of employees, then this
loop would take a long time to find just one object.
Alternativly, you could create a ZCatalog to keep track of your
employees. The catalog allows you to create indexes for various
properties, for example, 'last_name' of your Employee objects, and
to search those indexes very quicky. The above loop, which could
take many minutes to execute if you had thousands of employees, is
returned in milliseconds by ZCatalog with this query::
Hi Bob!
Here we walk up to the catalog and pass it a python data structure
called a dictionary. This dictionary maps the name 'last_name' to
the value 'Bob'. This is how you tell the ZCatalog what index you
want to query, and what value you are looking for in the index.
You can pass mulitple parameters to ZCatalog::
New Bobs!
The ZCatalog takes a very object oriented view to cataloging
objects. ZCatalog's are very flexible and can often be a little
confusing. The ZCatalog managment interface controls how your
catalog behaves.
What ZCatalog Returns
A query to ZCatalog returns a sequence of record objects. Those
record objects corespond to objects that are cataloged in the
catalog. Record objects are NOT the objects that they refer to,
they are just handy little objects that work like the index cards
in a card catalog, they just store meta-information about the
object, such as the time it was created and its title (what the
record object actually concerns itself with is described in the
section 'The Meta-Data Table').
In addition to whatever meta-data the record object has, each
record object has the following attributes:
o data_record_id_
'data_record_id_' is the document id that this record object
refers to. The Catalog can be queried for the path to this
object with 'ZCatalog.getpath(data_record_id_)'.
o data_record_score_
'data_record_score_' is the score that this result record
matched against the query. For text indexes, this is the
number of occourances of a search term in the document. For
field indexes, this is 1 if the record matched the query. For
keyword indexes, the behavior is the same as field indexes,
but it *should* actually return the number of keywords within
the sequence that matched the query.
o data_record_normalized_score_
'data_record_normalized_score_' is the score of the record
normalized with the rest of the result set to be between 0 and
1. For text indexes, this is 1 of the record matches the
query term.
The Managment Interface.
In the Zope managment interface, you can create a ZCatalog
anywhere in Zope. In order to play with ZCatalog, you will need
some objects to actualy index. A fresh Zope install comes with a
folder called 'QuickStart' that contains some introductory
material to Zope. This is a nice small document set to play with
the ZCatalog.
In your QuickStart folder, select 'ZCatalog' from the add Menu and
add a ZCatalog. You will see this screen:
snapshot
Here you have a few options we will quickly examine. Like all
Zope objects, your new ZCatalog must be given an id. For the
purposes of this document, we'll call our catalog 'Catalog'. Also
like many other Zope objects, you can give your new ZCatalog a
title. This is optional, you may put anything here you want. It
is often useful, however, to title the Catalog in a way that
meaningfully represents its contents, for example: 'Employees' if
the Catalog mainly cataloged Employee objects.
After title there is a checkbox that lets you select whether or
not you want to select a Vocabulary. Vocabulary objects will be
explained a little later. For now, leave this box unchecked.
Note that this screen tells you that if you leave the box
unchecked like we're going to, a Vocabulary object will be created
for us.
After you have filled in this screen properly, click 'Add
ZCatalog'.
Your managment interface will be taken back to the QuickStart
folder. Notice that there is now a new object in that folder
called 'Catalog'.
snapshot
To enter the catalog, click on its name or icon.
The Contents View
When you click on a ZCatalog you are taken to it's contents view.
From this view, a ZCatalog looks and acts much like a Folder. In
fact, the ZCatalog built partialy from the same code that Folders
come from. Here you have the familiar Zope Add menu, and you can
any kind of Zope object you normally have access to add. You can
even create a ZCatalog within your ZCatalog.
If you have been following along so far, you will see that your
new Catalog contains only one Object named 'Vocabulary'. This is
the mysterious Vocabulary object which will described later.
Note that the objects show here are NOT the objects that are
cataloged in this catalog.
The Find Objects to Catalog view
So that we can start having some fun right away, we're going to
skip one view over to the right and jump straight to the 'Find
Objects to Catalog' View. This long winded view presents you will
the following screen:
snapshot
This screen lets you specify what kind of objects you want the
Catalog to find. This screen is identical to the standard Zope
'Find' interface. Like the find interface, this screen will make
ZCatalog traverse through your Zope objects. However, in the case
of the ZCatalog, the Catalog will index each object that it finds
that matches the criteria you specify here.
If you are in the QuickStart Catalog, just click the 'Find' button
and do a wide open search. This will cause your Catalog to index
all of the contents of the QuickStart folder. If you are running
on a slower machine, this may take up to a few minutes.
You will be returned to the view we skipped over, 'The Cataloged
Objects View', which is discussed next.
Note that using the Find interface to Catalog objects is very
ineficient and should only be used if you know that you are not
going to be traversing over lots of objects. The more objects you
traverse over, the longer the cataloging operation takes. While
the ZCatalog is quite capable of searching through thousands and
thousands of objects very quickly, actually indexing those objects
is a much slower operation. If you attempt to index too many
objects too quickly your in memory indexes soon get very large and
Zope start aggresivly swapping objects to and from memory to the
database. Of course, if your Zope process gets larger than your
available memory, your operating system will soon start swapping
bits of Zope out on its own. This can cause indexing to slow to a
crawl.
Thus, the most efficient way to index content in Zope is to have
that content index and unindex itself when it is created and
destroyed. Since the time it takes to index only one object is
very negiligible, this turns out to keep your machine running
quicky fast even in high usage sites. Of course, the same problem
will occour if you try to create too many of these smart objects
too fast. A scenario like this is unlikly, however, in an
editorial or publication based system. For higher write-intesive
operations, larger scale solutions should be considered.
The Cataloged Objects View
The Cataloged Objects View shows you all of the objects in the
catalog one screenful at a time.
It is important to know that these catalog listings are NOT the
Zope objects they refer to. They are just references to objects
in your Zope system. If you deleted an object that is cataloged
in a catalog, then the catalog will contain a reference that is no
longer valid. Unless you also uncatalog the object before you
delete it, you may get search results that point to objects that
no longer exist.
At the top of the screen are two buttons that allows you to either
update or clear the catalog. Update will go through each object
reference in the catalog and try and update the index information
for that object. For example, if an employee's last name changes,
the ZCatalog can be updated to reflect that change. If the object
no longer exists, updating it will remove it from the Catalog.
Note that updating the entire catalog could take a long, long time
if you have many objects.
The Clear Catalog button does just that, it clears all of the
indexes and object references out of the Catalog. It does NOT
delete the objects it refers to.
In addition to updating or clearing the entire catalog, you can
individually choose objects for selection or deletion by selecting
the checkbox just to the left of the link to that object:
screen shot
And clicking the 'Update' or 'Remove' button on either the top or
bottom of the listing.
The Indexes View
In order for ZCatalog to keep track of information about objects
for you, you must tell it what kind of information you are
interested in. You do this by creating indexes.
When you create an index, you give it a name and you specify a
type. The name of the index is the property, attribute, or method
you want the Catalog to use when indexing an object. For example,
if all of your Employee objects had the attribute 'last_name',
then you would want to create a 'last_name' index to index that
value for every cataloged object.
The Meta-data Table View
The meta-data table allows the catalog to maintain a table of
information about cataloged objects. For each object in the
catalog, the meta-data table stores a sequence of values, one for
each column in this table. This is useful, for example, if you
want your search reports to include additional information about
your objects such as their ids, titles, and URLs without having to
wake the actual object up to get the information.
The meta-data table directly effects the shape of the resulting
objects that come from Catalog queries.
The Status View
The Status view shows information about your Catalog
Setting your sub-transaction threshold
The Status view provides you with the option to specify a
sub-transaction threshold value. Zope is a transactional system,
meaning that all changes made to zope happen within a transaction,
including all changes made to the catalog. While the Catalog is
indexing lots of information, lots of index objects are being
changes. For speed, the changes made in a transaction are kept in
memory at all times, this gets a bit dangerous when the Catalog is
more than happy to eat up every bit of memory your computer has.
Eventually, Zope will raise a MemoryError and the transaction will
be rolled backed, forcing your OS's virtual memory pager into spastic
fits.
In order to prevent this, the Catalog will commit a sub-transaction
every so often to allow the Zope cache to remove some of the changed
data from memory. If an error occurs, the entire tranaction, and
it's sub-transactions, will be rolled back.
Basicly, the threshold is a knob that allows you to tweak how much
memory you're going to allow Zope to consume, while the Catalog is
cataloging lots of objects *in one transaction*. Setting this
number higher will make ZCatalog commit a sub-transaction less
frequently and indexing will consume more memory. Setting this
number lower will make ZCAtalog commit a sub-transcation more
frequently and indexing will consume less memory.
Note that sub-transaction data is store wherever the python tempfile
module wants to put it. If you are indexing lots and lots of data
in one transaction, it is possible to fill up the temporary
partition on certain systems. Make sure you have ample memory and
tempfile space if you plan on indexing gigabytes of data.
ZCatalog Objects
ZCatalog objects provide a number of methods for DTML programmers
to manipulate and query a catalog. The first three such methods
are identical in parameters in operation, and differ only by
name. This is done for historical as well as convenience reasons.
ZCatalog.query(query_object)
The 'query' method accepts one parameter, a query object, and
returns a list of result objects.
ZCatalog.searchResults(query_object)
The 'searchResults' method is identical to the query method.
It is a useful mnemonic method in DTML when all of your search
paremeters are in the DTML Namespace (see the DTML Namespace
How-To)::
ZCatalog(query_object)
Just calling a catalog object with a query object is identical
to calling the query method.
ZCatalog.get_path(document_id)
'get_path' takes one integer as an argument, and returns a
path to the object that coresponsed to 'document_id'. This
path can be used by REQUEST.resolve_url to return the actual
object the path refers to (this is what 'get_object' does).
Note that returning this object means this it will be wrapped
in an Acquisition context of the ZCatalog.
ZCatalog.get_object(document_id)
Returns the object refered to by 'document_id'. This object
is wrapped in the Acquisition context of the ZCatalog.
ZCatalog.uniqueValuesFor(index)
Returns the unique values for the index 'index'. Note that
'index' must be the name of an existing FieldIndex or
KeywordIndex object. Future indexes may support this method.
Text Indexes do not and will raise an error.
ZCatalog.catalog_object(unique_id, object)
This method tells the catalog to catalog 'object' with the
unique id 'unique_id'. The ZCatalog stores an internal
mapping between the unique id and a set of integers. Each
unique_id maps to exactly one unique integer:
/path/to/object/one -> 42
/path/to/object/two -> 68
Zope assumes that 'unique_id' is the absolute path to
'object'. If you plan on passing anything other than
'object.absolute_url()' as unique_id to this method, you
better know what your doing.
ZCatalog.uncatalog_object(unique_id)
Removes all references to the object whose unique id is
'unique_id'. This is the inverse operation to
'catalog_object'.
ZCatalog.schema()
Returns a list of meta-data table names.
ZCatalog.indexes()
Returns a list of index names.
ZCatalog.index_objects()
Returns a list of actual index objects.
Field Indexes
Field Indexes treat the value of a property atomically. This
means that the value can be an object such as a string, integer,
tuple, list, dictionary or other python object. When you search
a Field Index for a particular value, that value will only match
objects whose property value matches the search term exactly.
For example, consider an employee catalog with four employees
cataloged in it. Each employee has a property called 'last_name'
and 'id'::
id last_name
0001 magill
0002 lil
0003 nancy
To index these attributes, create two new indexes named 'id' and
'last_name' and make them both 'Field Indexes'.
Example.
Fields indexes support range searching if the values support
comparison operations. This is done by specifying extra query
terms which control the range in which the query is made.
Field indexes can be queried to return a list of all unique values
stored in the index. This is useful, for example, when you know
that all the objects contained in a certain index map to only a
small collection of values. In Zope, this is often used to index
'meta_type'. This allows you write a snippet of DTML which
returns a list of all known (cataloged) meta types::
Keyword Indexes
Keyword indexes index a sequence of atomic values. Querying the
index for a value will match any object which has that value
anywhere in the sequence of atomic values. This makes it easy to
create the popular 'keyword' approach to searchable databases.
Keyword indexes allow you to provide a heirarchical structure on
top of your cataloged data.
Text Indexes
Text Indexes treat the value of an object's attribute as string of
text. According to certain language specific actions, this text
is typically parsed into 'words' which are then stored in the
index. This makes searching for the ocourance of a word within a
document easy.
Text indexes are the most complex of all indexes. Consider the
following text::
"Bob is your Uncle."
For the purpose of text indexing, this is called the 'document'.
The Text index will transform this string into a sequence of
values in a language specific way. Currently, Zope supports only
english and european languages (see Appendix A).
The default engish specific action is to turn the string into::
['bob', 'uncle']
Note that 'is' and 'your' were removed. This is because they are
'stopwords', which are very common english words that are removed
for the sake of index efficiency. It is not very useful, for
example, to search for the word 'is' or 'the', since they will
probably match almost 100% of all english documents. A search for
those common terms would yeild the entire dataset. The same goes
for other language specific common words, such as the french
'les'.
The text index then takes each value sucked out of the document
string and maintains a list of word to document mappings::
'bob' -> 1,2,3
'uncle' -> 1,2
'zope' -> 2,3,4
This is a classic 'index'. Searching for the word 'bob' will
return documents 1, 2 and 3.
Further, the text index query language supports boolean
expressions such as::
Query Result
"bob AND uncle" 1,2
"bob OR zope" 1,2,3,4
"zope AND NOT uncle" 3,4
Expressions can also be nested in arbitrary levels of parenthesis::
"(bob AND uncle) OR zope" 1,2,3
Language Specifics: Vocabulary Objects
In the section on text indexes, we said that text indexes were a
mapping from words to the documents containing those words. This
is true in a sense, but ZCatalog goes to great lengths to make
searching and storing this information efficiently. For this
reason, the ZCatalog actually maps integer word ids to integer
document ids.
Because the word ids are itegers, they have no language specific
meaning. Another object, called a Vocabulary object, is used to
turn the meaningless integer into a meaningful 'word'.
Vocabulary objects map word ids to words. When you search for the
term 'bob', it is first looked up in the vocabulary to find out
what integer word id it maps to. Then the index is queried with
that integer. Whatever integer document ids contain that search
term are returned for the query.
Vocabulary objects are also in charge of determining how documents
are split up into words. Currently, Zope's Vocabulary object only
support splitting english and some european languages. The
'Splitter' takes a document and turns it into a list of values to
be indexed or used as search terms.
Vocabulary objects let you:
o Manage Synonyms
Because words map to integers, many words can map to the same
integer. This allows you to assign synonyms to words. For
example, a search for 'car' would also return documents
containing 'automobile'. This is done by mapping automobile
and car to the same integer.
o Manage Stopwords
Stopwords are common words that are removed from the document
before indexing. In english, such common words as 'is' and
'the' are not useful in most general indexing cases.
Vocabulary objects let you manage which words are removed from
a document.
o Control Splitting
Vocabulary objects let you control which character values you
consider part of a word and which character values you do
not. This lets you control how special or non-english
characters get treated by the splitter. For example, the
euro-english splitter may consider your special accent
character to be a space, and split important words into
nonsense. By customising the splitter, you can control how
words are parsed from your document.
o Support wildcard searching
Globbing Vocabulary objects support the notion of searching
for words by specifing a pattern. This allows searches of the
form::
"bob*"
"*bob"
"un?le"
"un*e"
"?n*e"
The wildcard characters '?' and '*' can be customized to be
any characters.
These wildcard characters can be customized to suit your needs
From the Vocabulary object managment screen:
screenshot
Stopwords and Synonyms
A 'stopword' is a word that is removed from a document before it
is indexed. In english, it is useful to remove certain very
common words such as 'is' and 'the'. A 'synonym' is a word that
has a similar meaning to another word. For example, the word
'car' is a synonym of the word 'automobile'.
Both stopwords and synonyms are very language specific. A set of
stopwords for the english language, for example, would have no
beneficial effect when removed from a document written in Dutch.
Similarly, synonyms depend on meaning, and meaning is itegrally
tied to language. 'car' and 'automobile' are maybe not even
words in other language, much less synonyms.
Because Vocabulary objects encapsulate language specific
information for a catalog, they are the obvious place to manage
your stopwords and synonyms.
Vocabulary objects provide a view to manage stopwords and
synonyms. This screen looks like this:
screenshot
Here, you see a text area that contains a sequence of lines::
the
is
he
she
automobile:car
meal:food
it
your
Each line contains either a stopword or a synonym. If the line
contains a colon, then the string to the right of the colon is the
word, the string to the left is the synonym that has the same
meaning as that word.
It is possible to use multiple synonyms together to create lists
of related words::
automobile:car
vehicle:car
transport:car
After editing to text area to suit your needs, click 'Submit'. If
you have made an error in the document, you will get an error
message to that effect.
Appendix A: Setting your locale