ZCUG.stx
File contents
The ZCatalog Users Guide
First Draft: Feb 1st, 2000
Author: Michel Pelletier ([email protected])
ZCatalog is a component of Zope. This Guide assumes that you have
basic Zope skills. If you encounter material in this Guide that you
do not understand you may find an explantation in one of the other
Zope Guides: The Content Manager's Guide, the DTML Programer's
Guide, or the The Zope Administrator's Guide.
If you continue to be confused, send feedback to the author, Michel
Pelletier ([email protected]).
What ZCatalog does.
ZCatalog is just like the card catalog in a library. Think
of your all your Zope objects as books in a library. If you
wanted to find all of the books by the author Aldous Huxley
then
you would walk up to the card catalog and look up Aldous Huxley
in the author index. This will give you the location of all of
the books by that author.
The ZCatalog works exactly like that. You can walk up to the
ZCatalog (in DTML) and ask it for all of the objects whose
author
property was Aldous Huxley
. Like a real library
catalog, the ZCatalog must be built before it is searched. This
can be done either by a brute force method; where the ZCatalog
catalogs everything it can find, or it can be done by more
selective means. For maximum cataloging flexibility, objects can
also be taught how to index themselves (and unindex themselves).
Zope allows you to build your web application in a very flexibly
way by allowing you to organize your objects into simple, clear
structures. Zope gives you the ability to create destroy objects
programatically at any time. This means that your application
could very well scale on the order of thousands of objects or
more.
Consider an employee database that created one new employee object
for each employee. If you were a small company, then a very
simple DTML loop could look for the employee with the name 'Bob':
<dtml-in Employees>
<dtml-if "last_name == 'Bob'">
Found Bob
</dtml-if>
</dtml-in>
However, if your company had thousands of employees, then this
loop would take a long time to find just one object.
Alternativly, you could create a ZCatalog to keep track of your
employees. The catalog allows you to create indexes for various
properties, for example, last_name
of your Employee objects, and
to search those indexes very quicky. The above loop, which could
take many minutes to execute if you had thousands of employees, is
returned in milliseconds by ZCatalog with this query:
<dtml-in "ZCatalog({'last_name' : 'Bob'})">
Hi Bob!
</dtml-in>
Here we walk up to the catalog and pass it a python data structure
called a dictionary. This dictionary maps the name last_name
to
the value Bob
. This is how you tell the ZCatalog what index you
want to query, and what value you are looking for in the index.
You can pass mulitple parameters to ZCatalog:
<dtml-in "ZCatalog({'last_name' : 'Bob',
'last_modified_usage' :'range:min',
'last_modified' : DateTime('Feb 1, 2000')
})">
New Bobs!
</dtml-in>
The ZCatalog takes a very object oriented view to cataloging
objects. ZCatalog's are very flexible and can often be a little
confusing. The ZCatalog managment interface controls how your
catalog behaves.
What ZCatalog Returns
A query to ZCatalog returns a sequence of record objects. Those
record objects corespond to objects that are cataloged in the
catalog. Record objects are NOT the objects that they refer to,
they are just handy little objects that work like the index cards
in a card catalog, they just store meta-information about the
object, such as the time it was created and its title (what the
record object actually concerns itself with is described in the
section The Meta-Data Table
).
In addition to whatever meta-data the record object has, each
record object has the following attributes:
- data_recordid
data_record_id_
is the document id that this record object
refers to. The Catalog can be queried for the path to this
object with ZCatalog.getpath(data_record_id_)
.
- data_recordscore
data_record_score_
is the score that this result record
matched against the query. For text indexes, this is the
number of occourances of a search term in the document. For
field indexes, this is 1 if the record matched the query. For
keyword indexes, the behavior is the same as field indexes,
but it should actually return the number of keywords within
the sequence that matched the query.
- data_record_normalizedscore
data_record_normalized_score_
is the score of the record
normalized with the rest of the result set to be between 0 and
1. For text indexes, this is 1 of the record matches the
query term.
The Managment Interface.
In the Zope managment interface, you can create a ZCatalog
anywhere in Zope. In order to play with ZCatalog, you will need
some objects to actualy index. A fresh Zope install comes with a
folder called QuickStart
that contains some introductory
material to Zope. This is a nice small document set to play with
the ZCatalog.
In your QuickStart folder, select ZCatalog
from the add Menu and
add a ZCatalog. You will see this screen:
snapshot
Here you have a few options we will quickly examine. Like all
Zope objects, your new ZCatalog must be given an id. For the
purposes of this document, we'll call our catalog Catalog
. Also
like many other Zope objects, you can give your new ZCatalog a
title. This is optional, you may put anything here you want. It
is often useful, however, to title the Catalog in a way that
meaningfully represents its contents, for example: Employees
if
the Catalog mainly cataloged Employee objects.
After title there is a checkbox that lets you select whether or
not you want to select a Vocabulary. Vocabulary objects will be
explained a little later. For now, leave this box unchecked.
Note that this screen tells you that if you leave the box
unchecked like we're going to, a Vocabulary object will be created
for us.
After you have filled in this screen properly, click Add
ZCatalog
.
Your managment interface will be taken back to the QuickStart
folder. Notice that there is now a new object in that folder
called Catalog
.
snapshot
To enter the catalog, click on its name or icon.
The Contents View
When you click on a ZCatalog you are taken to it's contents view.
From this view, a ZCatalog looks and acts much like a Folder. In
fact, the ZCatalog built partialy from the same code that Folders
come from. Here you have the familiar Zope Add menu, and you can
any kind of Zope object you normally have access to add. You can
even create a ZCatalog within your ZCatalog.
If you have been following along so far, you will see that your
new Catalog contains only one Object named Vocabulary
. This is
the mysterious Vocabulary object which will described later.
Note that the objects show here are NOT the objects that are
cataloged in this catalog.
The Find Objects to Catalog view
So that we can start having some fun right away, we're going to
skip one view over to the right and jump straight to the Find
Objects to Catalog
View. This long winded view presents you will
the following screen:
snapshot
This screen lets you specify what kind of objects you want the
Catalog to find. This screen is identical to the standard Zope
Find
interface. Like the find interface, this screen will make
ZCatalog traverse through your Zope objects. However, in the case
of the ZCatalog, the Catalog will index each object that it finds
that matches the criteria you specify here.
If you are in the QuickStart Catalog, just click the Find
button
and do a wide open search. This will cause your Catalog to index
all of the contents of the QuickStart folder. If you are running
on a slower machine, this may take up to a few minutes.
You will be returned to the view we skipped over, The Cataloged
Objects View
, which is discussed next.
Note that using the Find interface to Catalog objects is very
ineficient and should only be used if you know that you are not
going to be traversing over lots of objects. The more objects you
traverse over, the longer the cataloging operation takes. While
the ZCatalog is quite capable of searching through thousands and
thousands of objects very quickly, actually indexing those objects
is a much slower operation. If you attempt to index too many
objects too quickly your in memory indexes soon get very large and
Zope start aggresivly swapping objects to and from memory to the
database. Of course, if your Zope process gets larger than your
available memory, your operating system will soon start swapping
bits of Zope out on its own. This can cause indexing to slow to a
crawl.
Thus, the most efficient way to index content in Zope is to have
that content index and unindex itself when it is created and
destroyed. Since the time it takes to index only one object is
very negiligible, this turns out to keep your machine running
quicky fast even in high usage sites. Of course, the same problem
will occour if you try to create too many of these smart objects
too fast. A scenario like this is unlikly, however, in an
editorial or publication based system. For higher write-intesive
operations, larger scale solutions should be considered.
The Cataloged Objects View
The Cataloged Objects View shows you all of the objects in the
catalog one screenful at a time.
It is important to know that these catalog listings are NOT the
Zope objects they refer to. They are just references to objects
in your Zope system. If you deleted an object that is cataloged
in a catalog, then the catalog will contain a reference that is no
longer valid. Unless you also uncatalog the object before you
delete it, you may get search results that point to objects that
no longer exist.
At the top of the screen are two buttons that allows you to either
update or clear the catalog. Update will go through each object
reference in the catalog and try and update the index information
for that object. For example, if an employee's last name changes,
the ZCatalog can be updated to reflect that change. If the object
no longer exists, updating it will remove it from the Catalog.
Note that updating the entire catalog could take a long, long time
if you have many objects.
The Clear Catalog button does just that, it clears all of the
indexes and object references out of the Catalog. It does NOT
delete the objects it refers to.
In addition to updating or clearing the entire catalog, you can
individually choose objects for selection or deletion by selecting
the checkbox just to the left of the link to that object:
screen shot
And clicking the Update
or Remove
button on either the top or
bottom of the listing.
The Indexes View
In order for ZCatalog to keep track of information about objects
for you, you must tell it what kind of information you are
interested in. You do this by creating indexes.
When you create an index, you give it a name and you specify a
type. The name of the index is the property, attribute, or method
you want the Catalog to use when indexing an object. For example,
if all of your Employee objects had the attribute last_name
,
then you would want to create a last_name
index to index that
value for every cataloged object.
The Meta-data Table View
The meta-data table allows the catalog to maintain a table of
information about cataloged objects. For each object in the
catalog, the meta-data table stores a sequence of values, one for
each column in this table. This is useful, for example, if you
want your search reports to include additional information about
your objects such as their ids, titles, and URLs without having to
wake the actual object up to get the information.
The meta-data table directly effects the shape of the resulting
objects that come from Catalog queries.
The Status View
The Status view shows information about your Catalog
Setting your sub-transaction threshold
The Status view provides you with the option to specify a
sub-transaction threshold value. Zope is a transactional system,
meaning that all changes made to zope happen within a transaction,
including all changes made to the catalog. While the Catalog is
indexing lots of information, lots of index objects are being
changes. For speed, the changes made in a transaction are kept in
memory at all times, this gets a bit dangerous when the Catalog is
more than happy to eat up every bit of memory your computer has.
Eventually, Zope will raise a MemoryError and the transaction will
be rolled backed, forcing your OS's virtual memory pager into spastic
fits.
In order to prevent this, the Catalog will commit a sub-transaction
every so often to allow the Zope cache to remove some of the changed
data from memory. If an error occurs, the entire tranaction, and
it's sub-transactions, will be rolled back.
Basicly, the threshold is a knob that allows you to tweak how much
memory you're going to allow Zope to consume, while the Catalog is
cataloging lots of objects in one transaction. Setting this
number higher will make ZCatalog commit a sub-transaction less
frequently and indexing will consume more memory. Setting this
number lower will make ZCAtalog commit a sub-transcation more
frequently and indexing will consume less memory.
Note that sub-transaction data is store wherever the python tempfile
module wants to put it. If you are indexing lots and lots of data
in one transaction, it is possible to fill up the temporary
partition on certain systems. Make sure you have ample memory and
tempfile space if you plan on indexing gigabytes of data.
ZCatalog Objects
ZCatalog objects provide a number of methods for DTML programmers
to manipulate and query a catalog. The first three such methods
are identical in parameters in operation, and differ only by
name. This is done for historical as well as convenience reasons. ZCatalog.query(query_object)
The query
method accepts one parameter, a query object, and
returns a list of result objects.
ZCatalog.searchResults(query_object)
The searchResults
method is identical to the query method.
It is a useful mnemonic method in DTML when all of your search
paremeters are in the DTML Namespace (see the DTML Namespace
How-To):
<dtml-in searchResults>
<dtml-var sequence-item>
</dtml-in>
ZCatalog(query_object)
Just calling a catalog object with a query object is identical
to calling the query method.
ZCatalog.get_path(document_id)
get_path
takes one integer as an argument, and returns a
path to the object that coresponsed to document_id
. This
path can be used by REQUEST.resolve_url to return the actual
object the path refers to (this is what get_object
does).
Note that returning this object means this it will be wrapped
in an Acquisition context of the ZCatalog.
ZCatalog.get_object(document_id)
Returns the object refered to by document_id
. This object
is wrapped in the Acquisition context of the ZCatalog.
ZCatalog.uniqueValuesFor(index)
Returns the unique values for the index index
. Note that
index
must be the name of an existing FieldIndex or
KeywordIndex object. Future indexes may support this method.
Text Indexes do not and will raise an error.
ZCatalog.catalog_object(unique_id, object)
This method tells the catalog to catalog object
with the
unique id unique_id
. The ZCatalog stores an internal
mapping between the unique id and a set of integers. Each
unique_id maps to exactly one unique integer:
/path/to/object/one -> 42
/path/to/object/two -> 68
Zope assumes that unique_id
is the absolute path to
object
. If you plan on passing anything other than
object.absolute_url()
as unique_id to this method, you
better know what your doing.
ZCatalog.uncatalog_object(unique_id)
Removes all references to the object whose unique id is
unique_id
. This is the inverse operation to
catalog_object
.
ZCatalog.schema()
Returns a list of meta-data table names.
ZCatalog.indexes()
Returns a list of index names.
ZCatalog.index_objects()
Returns a list of actual index objects.
Field Indexes
Field Indexes treat the value of a property atomically. This
means that the value can be an object such as a string, integer,
tuple, list, dictionary or other python object. When you search
a Field Index for a particular value, that value will only match
objects whose property value matches the search term exactly.
For example, consider an employee catalog with four employees
cataloged in it. Each employee has a property called last_name
and 'id':
id last_name
0001 magill
0002 lil
0003 nancy
To index these attributes, create two new indexes named id
and
last_name
and make them both Field Indexes
.
Example.
Fields indexes support range searching if the values support
comparison operations. This is done by specifying extra query
terms which control the range in which the query is made.
Field indexes can be queried to return a list of all unique values
stored in the index. This is useful, for example, when you know
that all the objects contained in a certain index map to only a
small collection of values. In Zope, this is often used to index
meta_type
. This allows you write a snippet of DTML which
returns a list of all known (cataloged) meta types:
<dtml-in "Catalog.uniqueValuesFor('meta_type')">
<dtml-var sequence-item>
</dtml-in>
Keyword Indexes
Keyword indexes index a sequence of atomic values. Querying the
index for a value will match any object which has that value
anywhere in the sequence of atomic values. This makes it easy to
create the popular 'keyword' approach to searchable databases.
Keyword indexes allow you to provide a heirarchical structure on
top of your cataloged data.
Text Indexes
Text Indexes treat the value of an object's attribute as string of
text. According to certain language specific actions, this text
is typically parsed into 'words' which are then stored in the
index. This makes searching for the ocourance of a word within a
document easy.
Text indexes are the most complex of all indexes. Consider the
following text::
"Bob is your Uncle."
For the purpose of text indexing, this is called the 'document'.
The Text index will transform this string into a sequence of
values in a language specific way. Currently, Zope supports only
english and european languages (see Appendix A).
The default engish specific action is to turn the string into::
['bob', 'uncle']
Note that 'is' and 'your' were removed. This is because they are
'stopwords', which are very common english words that are removed
for the sake of index efficiency. It is not very useful, for
example, to search for the word 'is' or 'the', since they will
probably match almost 100% of all english documents. A search for
those common terms would yeild the entire dataset. The same goes
for other language specific common words, such as the french
'les'.
The text index then takes each value sucked out of the document
string and maintains a list of word to document mappings::
'bob' -> 1,2,3
'uncle' -> 1,2
'zope' -> 2,3,4
This is a classic 'index'. Searching for the word 'bob' will
return documents 1, 2 and 3.
Further, the text index query language supports boolean
expressions such as::
Query Result
"bob AND uncle" 1,2
"bob OR zope" 1,2,3,4
"zope AND NOT uncle" 3,4
Expressions can also be nested in arbitrary levels of parenthesis::
"(bob AND uncle) OR zope" 1,2,3
Language Specifics: Vocabulary Objects
In the section on text indexes, we said that text indexes were a
mapping from words to the documents containing those words. This
is true in a sense, but ZCatalog goes to great lengths to make
searching and storing this information efficiently. For this
reason, the ZCatalog actually maps integer word ids to integer
document ids.
Because the word ids are itegers, they have no language specific
meaning. Another object, called a Vocabulary object, is used to
turn the meaningless integer into a meaningful 'word'.
Vocabulary objects map word ids to words. When you search for the
term 'bob', it is first looked up in the vocabulary to find out
what integer word id it maps to. Then the index is queried with
that integer. Whatever integer document ids contain that search
term are returned for the query.
Vocabulary objects are also in charge of determining how documents
are split up into words. Currently, Zope's Vocabulary object only
support splitting english and some european languages. The
'Splitter' takes a document and turns it into a list of values to
be indexed or used as search terms.
Vocabulary objects let you:
o Manage Synonyms
Because words map to integers, many words can map to the same
integer. This allows you to assign synonyms to words. For
example, a search for 'car' would also return documents
containing 'automobile'. This is done by mapping automobile
and car to the same integer.
o Manage Stopwords
Stopwords are common words that are removed from the document
before indexing. In english, such common words as 'is' and
'the' are not useful in most general indexing cases.
Vocabulary objects let you manage which words are removed from
a document.
o Control Splitting
Vocabulary objects let you control which character values you
consider part of a word and which character values you do
not. This lets you control how special or non-english
characters get treated by the splitter. For example, the
euro-english splitter may consider your special accent
character to be a space, and split important words into
nonsense. By customising the splitter, you can control how
words are parsed from your document.
o Support wildcard searching
Globbing Vocabulary objects support the notion of searching
for words by specifing a pattern. This allows searches of the
form::
"bob*"
"*bob"
"un?le"
"un*e"
"?n*e"
The wildcard characters '?' and '*' can be customized to be
any characters.
These wildcard characters can be customized to suit your needs
From the Vocabulary object managment screen:
screenshot
Stopwords and Synonyms
A 'stopword' is a word that is removed from a document before it
is indexed. In english, it is useful to remove certain very
common words such as 'is' and 'the'. A 'synonym' is a word that
has a similar meaning to another word. For example, the word
'car' is a synonym of the word 'automobile'.
Both stopwords and synonyms are very language specific. A set of
stopwords for the english language, for example, would have no
beneficial effect when removed from a document written in Dutch.
Similarly, synonyms depend on meaning, and meaning is itegrally
tied to language. 'car' and 'automobile' are maybe not even
words in other language, much less synonyms.
Because Vocabulary objects encapsulate language specific
information for a catalog, they are the obvious place to manage
your stopwords and synonyms.
Vocabulary objects provide a view to manage stopwords and
synonyms. This screen looks like this:
screenshot
Here, you see a text area that contains a sequence of lines::
the
is
he
she
automobile:car
meal:food
it
your
Each line contains either a stopword or a synonym. If the line
contains a colon, then the string to the right of the colon is the
word, the string to the left is the synonym that has the same
meaning as that word.
It is possible to use multiple synonyms together to create lists
of related words::
automobile:car
vehicle:car
transport:car
After editing to text area to suit your needs, click 'Submit'. If
you have made an error in the document, you will get an error
message to that effect.
Appendix A: Setting your locale
File contents
The ZCatalog Users GuideFirst Draft: Feb 1st, 2000
Author: Michel Pelletier ([email protected])
ZCatalog is a component of Zope. This Guide assumes that you have basic Zope skills. If you encounter material in this Guide that you do not understand you may find an explantation in one of the other Zope Guides: The Content Manager's Guide, the DTML Programer's Guide, or the The Zope Administrator's Guide.
If you continue to be confused, send feedback to the author, Michel Pelletier ([email protected]).
What ZCatalog does. ZCatalog is just like the card catalog in a library. Think of your all your Zope objects as books in a library. If you wanted to find all of the books by the author
Aldous Huxley
then you would walk up to the card catalog and look upAldous Huxley
in the author index. This will give you the location of all of the books by that author.The ZCatalog works exactly like that. You can walk up to the ZCatalog (in DTML) and ask it for all of the objects whose
author
property wasAldous Huxley
. Like a real library catalog, the ZCatalog must be built before it is searched. This can be done either by a brute force method; where the ZCatalog catalogs everything it can find, or it can be done by more selective means. For maximum cataloging flexibility, objects can also be taught how to index themselves (and unindex themselves).Zope allows you to build your web application in a very flexibly way by allowing you to organize your objects into simple, clear structures. Zope gives you the ability to create destroy objects programatically at any time. This means that your application could very well scale on the order of thousands of objects or more.
Consider an employee database that created one new employee object for each employee. If you were a small company, then a very simple DTML loop could look for the employee with the name 'Bob':
<dtml-in Employees> <dtml-if "last_name == 'Bob'"> Found Bob </dtml-if> </dtml-in>However, if your company had thousands of employees, then this loop would take a long time to find just one object.
Alternativly, you could create a ZCatalog to keep track of your employees. The catalog allows you to create indexes for various properties, for example,
last_name
of your Employee objects, and to search those indexes very quicky. The above loop, which could take many minutes to execute if you had thousands of employees, is returned in milliseconds by ZCatalog with this query:<dtml-in "ZCatalog({'last_name' : 'Bob'})"> Hi Bob! </dtml-in>Here we walk up to the catalog and pass it a python data structure called a dictionary. This dictionary maps the name
last_name
to the valueBob
. This is how you tell the ZCatalog what index you want to query, and what value you are looking for in the index. You can pass mulitple parameters to ZCatalog:<dtml-in "ZCatalog({'last_name' : 'Bob', 'last_modified_usage' :'range:min', 'last_modified' : DateTime('Feb 1, 2000') })"> New Bobs! </dtml-in>The ZCatalog takes a very object oriented view to cataloging objects. ZCatalog's are very flexible and can often be a little confusing. The ZCatalog managment interface controls how your catalog behaves.
What ZCatalog Returns A query to ZCatalog returns a sequence of record objects. Those record objects corespond to objects that are cataloged in the catalog. Record objects are NOT the objects that they refer to, they are just handy little objects that work like the index cards in a card catalog, they just store meta-information about the object, such as the time it was created and its title (what the record object actually concerns itself with is described in the section
The Meta-Data Table
).In addition to whatever meta-data the record object has, each record object has the following attributes:
- data_recordid
data_record_id_
is the document id that this record object refers to. The Catalog can be queried for the path to this object withZCatalog.getpath(data_record_id_)
. - data_recordscore
data_record_score_
is the score that this result record matched against the query. For text indexes, this is the number of occourances of a search term in the document. For field indexes, this is 1 if the record matched the query. For keyword indexes, the behavior is the same as field indexes, but it should actually return the number of keywords within the sequence that matched the query. - data_record_normalizedscore
data_record_normalized_score_
is the score of the record normalized with the rest of the result set to be between 0 and 1. For text indexes, this is 1 of the record matches the query term.The Managment Interface. In the Zope managment interface, you can create a ZCatalog anywhere in Zope. In order to play with ZCatalog, you will need some objects to actualy index. A fresh Zope install comes with a folder called
QuickStart
that contains some introductory material to Zope. This is a nice small document set to play with the ZCatalog.In your QuickStart folder, select
ZCatalog
from the add Menu and add a ZCatalog. You will see this screen:snapshot
Here you have a few options we will quickly examine. Like all Zope objects, your new ZCatalog must be given an id. For the purposes of this document, we'll call our catalog
Catalog
. Also like many other Zope objects, you can give your new ZCatalog a title. This is optional, you may put anything here you want. It is often useful, however, to title the Catalog in a way that meaningfully represents its contents, for example:Employees
if the Catalog mainly cataloged Employee objects.After title there is a checkbox that lets you select whether or not you want to select a Vocabulary. Vocabulary objects will be explained a little later. For now, leave this box unchecked. Note that this screen tells you that if you leave the box unchecked like we're going to, a Vocabulary object will be created for us.
After you have filled in this screen properly, click
Add ZCatalog
.Your managment interface will be taken back to the QuickStart folder. Notice that there is now a new object in that folder called
Catalog
.snapshot
To enter the catalog, click on its name or icon.
The Contents View When you click on a ZCatalog you are taken to it's contents view. From this view, a ZCatalog looks and acts much like a Folder. In fact, the ZCatalog built partialy from the same code that Folders come from. Here you have the familiar Zope Add menu, and you can any kind of Zope object you normally have access to add. You can even create a ZCatalog within your ZCatalog.
If you have been following along so far, you will see that your new Catalog contains only one Object named
Vocabulary
. This is the mysterious Vocabulary object which will described later.Note that the objects show here are NOT the objects that are cataloged in this catalog.
The Find Objects to Catalog view So that we can start having some fun right away, we're going to skip one view over to the right and jump straight to the
Find Objects to Catalog
View. This long winded view presents you will the following screen:snapshot
This screen lets you specify what kind of objects you want the Catalog to find. This screen is identical to the standard Zope
Find
interface. Like the find interface, this screen will make ZCatalog traverse through your Zope objects. However, in the case of the ZCatalog, the Catalog will index each object that it finds that matches the criteria you specify here.If you are in the QuickStart Catalog, just click the
Find
button and do a wide open search. This will cause your Catalog to index all of the contents of the QuickStart folder. If you are running on a slower machine, this may take up to a few minutes.You will be returned to the view we skipped over,
The Cataloged Objects View
, which is discussed next.Note that using the Find interface to Catalog objects is very ineficient and should only be used if you know that you are not going to be traversing over lots of objects. The more objects you traverse over, the longer the cataloging operation takes. While the ZCatalog is quite capable of searching through thousands and thousands of objects very quickly, actually indexing those objects is a much slower operation. If you attempt to index too many objects too quickly your in memory indexes soon get very large and Zope start aggresivly swapping objects to and from memory to the database. Of course, if your Zope process gets larger than your available memory, your operating system will soon start swapping bits of Zope out on its own. This can cause indexing to slow to a crawl.
Thus, the most efficient way to index content in Zope is to have that content index and unindex itself when it is created and destroyed. Since the time it takes to index only one object is very negiligible, this turns out to keep your machine running quicky fast even in high usage sites. Of course, the same problem will occour if you try to create too many of these smart objects too fast. A scenario like this is unlikly, however, in an editorial or publication based system. For higher write-intesive operations, larger scale solutions should be considered.
The Cataloged Objects View The Cataloged Objects View shows you all of the objects in the catalog one screenful at a time.
It is important to know that these catalog listings are NOT the Zope objects they refer to. They are just references to objects in your Zope system. If you deleted an object that is cataloged in a catalog, then the catalog will contain a reference that is no longer valid. Unless you also uncatalog the object before you delete it, you may get search results that point to objects that no longer exist.
At the top of the screen are two buttons that allows you to either update or clear the catalog. Update will go through each object reference in the catalog and try and update the index information for that object. For example, if an employee's last name changes, the ZCatalog can be updated to reflect that change. If the object no longer exists, updating it will remove it from the Catalog. Note that updating the entire catalog could take a long, long time if you have many objects.
The Clear Catalog button does just that, it clears all of the indexes and object references out of the Catalog. It does NOT delete the objects it refers to.
In addition to updating or clearing the entire catalog, you can individually choose objects for selection or deletion by selecting the checkbox just to the left of the link to that object:
screen shot
And clicking the
Update
orRemove
button on either the top or bottom of the listing.The Indexes View In order for ZCatalog to keep track of information about objects for you, you must tell it what kind of information you are interested in. You do this by creating indexes.
When you create an index, you give it a name and you specify a type. The name of the index is the property, attribute, or method you want the Catalog to use when indexing an object. For example, if all of your Employee objects had the attribute
last_name
, then you would want to create alast_name
index to index that value for every cataloged object.The Meta-data Table View The meta-data table allows the catalog to maintain a table of information about cataloged objects. For each object in the catalog, the meta-data table stores a sequence of values, one for each column in this table. This is useful, for example, if you want your search reports to include additional information about your objects such as their ids, titles, and URLs without having to wake the actual object up to get the information.
The meta-data table directly effects the shape of the resulting objects that come from Catalog queries.
The Status View The Status view shows information about your Catalog
Setting your sub-transaction threshold
The Status view provides you with the option to specify a sub-transaction threshold value. Zope is a transactional system, meaning that all changes made to zope happen within a transaction, including all changes made to the catalog. While the Catalog is indexing lots of information, lots of index objects are being changes. For speed, the changes made in a transaction are kept in memory at all times, this gets a bit dangerous when the Catalog is more than happy to eat up every bit of memory your computer has. Eventually, Zope will raise a MemoryError and the transaction will be rolled backed, forcing your OS's virtual memory pager into spastic fits.
In order to prevent this, the Catalog will commit a sub-transaction every so often to allow the Zope cache to remove some of the changed data from memory. If an error occurs, the entire tranaction, and it's sub-transactions, will be rolled back.
Basicly, the threshold is a knob that allows you to tweak how much memory you're going to allow Zope to consume, while the Catalog is cataloging lots of objects in one transaction. Setting this number higher will make ZCatalog commit a sub-transaction less frequently and indexing will consume more memory. Setting this number lower will make ZCAtalog commit a sub-transcation more frequently and indexing will consume less memory.
Note that sub-transaction data is store wherever the python tempfile module wants to put it. If you are indexing lots and lots of data in one transaction, it is possible to fill up the temporary partition on certain systems. Make sure you have ample memory and tempfile space if you plan on indexing gigabytes of data.
ZCatalog Objects ZCatalog objects provide a number of methods for DTML programmers to manipulate and query a catalog. The first three such methods are identical in parameters in operation, and differ only by name. This is done for historical as well as convenience reasons.
ZCatalog.query(query_object) The
query
method accepts one parameter, a query object, and returns a list of result objects.ZCatalog.searchResults(query_object) The
searchResults
method is identical to the query method. It is a useful mnemonic method in DTML when all of your search paremeters are in the DTML Namespace (see the DTML Namespace How-To):<dtml-in searchResults> <dtml-var sequence-item> </dtml-in>
ZCatalog(query_object) Just calling a catalog object with a query object is identical to calling the query method.
ZCatalog.get_path(document_id) get_path
takes one integer as an argument, and returns a path to the object that coresponsed todocument_id
. This path can be used by REQUEST.resolve_url to return the actual object the path refers to (this is whatget_object
does). Note that returning this object means this it will be wrapped in an Acquisition context of the ZCatalog.ZCatalog.get_object(document_id) Returns the object refered to by
document_id
. This object is wrapped in the Acquisition context of the ZCatalog.ZCatalog.uniqueValuesFor(index) Returns the unique values for the index
index
. Note thatindex
must be the name of an existing FieldIndex or KeywordIndex object. Future indexes may support this method. Text Indexes do not and will raise an error.ZCatalog.catalog_object(unique_id, object) This method tells the catalog to catalog
object
with the unique idunique_id
. The ZCatalog stores an internal mapping between the unique id and a set of integers. Each unique_id maps to exactly one unique integer:/path/to/object/one -> 42 /path/to/object/two -> 68
Zope assumes that
unique_id
is the absolute path toobject
. If you plan on passing anything other thanobject.absolute_url()
as unique_id to this method, you better know what your doing.ZCatalog.uncatalog_object(unique_id) Removes all references to the object whose unique id is
unique_id
. This is the inverse operation tocatalog_object
.ZCatalog.schema() Returns a list of meta-data table names.
ZCatalog.indexes() Returns a list of index names.
ZCatalog.index_objects() Returns a list of actual index objects.
Field Indexes Field Indexes treat the value of a property atomically. This means that the value can be an object such as a string, integer, tuple, list, dictionary or other python object. When you search a Field Index for a particular value, that value will only match objects whose property value matches the search term exactly.
For example, consider an employee catalog with four employees cataloged in it. Each employee has a property called
last_name
and 'id':id last_name 0001 magill 0002 lil 0003 nancy
To index these attributes, create two new indexes named
id
andlast_name
and make them bothField Indexes
.Example.
Fields indexes support range searching if the values support comparison operations. This is done by specifying extra query terms which control the range in which the query is made.
Field indexes can be queried to return a list of all unique values stored in the index. This is useful, for example, when you know that all the objects contained in a certain index map to only a small collection of values. In Zope, this is often used to index
meta_type
. This allows you write a snippet of DTML which returns a list of all known (cataloged) meta types:<dtml-in "Catalog.uniqueValuesFor('meta_type')"> <dtml-var sequence-item> </dtml-in> Keyword Indexes Keyword indexes index a sequence of atomic values. Querying the index for a value will match any object which has that value anywhere in the sequence of atomic values. This makes it easy to create the popular 'keyword' approach to searchable databases. Keyword indexes allow you to provide a heirarchical structure on top of your cataloged data. Text Indexes Text Indexes treat the value of an object's attribute as string of text. According to certain language specific actions, this text is typically parsed into 'words' which are then stored in the index. This makes searching for the ocourance of a word within a document easy. Text indexes are the most complex of all indexes. Consider the following text:: "Bob is your Uncle." For the purpose of text indexing, this is called the 'document'. The Text index will transform this string into a sequence of values in a language specific way. Currently, Zope supports only english and european languages (see Appendix A). The default engish specific action is to turn the string into:: ['bob', 'uncle'] Note that 'is' and 'your' were removed. This is because they are 'stopwords', which are very common english words that are removed for the sake of index efficiency. It is not very useful, for example, to search for the word 'is' or 'the', since they will probably match almost 100% of all english documents. A search for those common terms would yeild the entire dataset. The same goes for other language specific common words, such as the french 'les'. The text index then takes each value sucked out of the document string and maintains a list of word to document mappings:: 'bob' -> 1,2,3 'uncle' -> 1,2 'zope' -> 2,3,4 This is a classic 'index'. Searching for the word 'bob' will return documents 1, 2 and 3. Further, the text index query language supports boolean expressions such as:: Query Result "bob AND uncle" 1,2 "bob OR zope" 1,2,3,4 "zope AND NOT uncle" 3,4 Expressions can also be nested in arbitrary levels of parenthesis:: "(bob AND uncle) OR zope" 1,2,3 Language Specifics: Vocabulary Objects In the section on text indexes, we said that text indexes were a mapping from words to the documents containing those words. This is true in a sense, but ZCatalog goes to great lengths to make searching and storing this information efficiently. For this reason, the ZCatalog actually maps integer word ids to integer document ids. Because the word ids are itegers, they have no language specific meaning. Another object, called a Vocabulary object, is used to turn the meaningless integer into a meaningful 'word'. Vocabulary objects map word ids to words. When you search for the term 'bob', it is first looked up in the vocabulary to find out what integer word id it maps to. Then the index is queried with that integer. Whatever integer document ids contain that search term are returned for the query. Vocabulary objects are also in charge of determining how documents are split up into words. Currently, Zope's Vocabulary object only support splitting english and some european languages. The 'Splitter' takes a document and turns it into a list of values to be indexed or used as search terms. Vocabulary objects let you: o Manage Synonyms Because words map to integers, many words can map to the same integer. This allows you to assign synonyms to words. For example, a search for 'car' would also return documents containing 'automobile'. This is done by mapping automobile and car to the same integer. o Manage Stopwords Stopwords are common words that are removed from the document before indexing. In english, such common words as 'is' and 'the' are not useful in most general indexing cases. Vocabulary objects let you manage which words are removed from a document. o Control Splitting Vocabulary objects let you control which character values you consider part of a word and which character values you do not. This lets you control how special or non-english characters get treated by the splitter. For example, the euro-english splitter may consider your special accent character to be a space, and split important words into nonsense. By customising the splitter, you can control how words are parsed from your document. o Support wildcard searching Globbing Vocabulary objects support the notion of searching for words by specifing a pattern. This allows searches of the form:: "bob*" "*bob" "un?le" "un*e" "?n*e" The wildcard characters '?' and '*' can be customized to be any characters. These wildcard characters can be customized to suit your needs From the Vocabulary object managment screen: screenshot Stopwords and Synonyms A 'stopword' is a word that is removed from a document before it is indexed. In english, it is useful to remove certain very common words such as 'is' and 'the'. A 'synonym' is a word that has a similar meaning to another word. For example, the word 'car' is a synonym of the word 'automobile'. Both stopwords and synonyms are very language specific. A set of stopwords for the english language, for example, would have no beneficial effect when removed from a document written in Dutch. Similarly, synonyms depend on meaning, and meaning is itegrally tied to language. 'car' and 'automobile' are maybe not even words in other language, much less synonyms. Because Vocabulary objects encapsulate language specific information for a catalog, they are the obvious place to manage your stopwords and synonyms. Vocabulary objects provide a view to manage stopwords and synonyms. This screen looks like this: screenshot Here, you see a text area that contains a sequence of lines:: the is he she automobile:car meal:food it your Each line contains either a stopword or a synonym. If the line contains a colon, then the string to the right of the colon is the word, the string to the left is the synonym that has the same meaning as that word. It is possible to use multiple synonyms together to create lists of related words:: automobile:car vehicle:car transport:car After editing to text area to suit your needs, click 'Submit'. If you have made an error in the document, you will get an error message to that effect.
Appendix A: Setting your locale