The ZCatalog Users Guide First Draft: Feb 1st, 2000 Author: Michel Pelletier (michel@digicool.com) ZCatalog is a component of Zope. This Guide assumes that you have basic Zope skills. If you encounter material in this Guide that you do not understand you may find an explantation in one of the other Zope Guides: The Content Manager's Guide, the DTML Programer's Guide, or the The Zope Administrator's Guide. If you continue to be confused, send feedback to the author, Michel Pelletier (michel@digicool.com). What ZCatalog does. ZCatalog is just like the card catalog in a library. Think of your all your Zope objects as books in a library. If you wanted to find all of the books by the author 'Aldous Huxley' then you would walk up to the card catalog and look up 'Aldous Huxley' in the author index. This will give you the location of all of the books by that author. The ZCatalog works exactly like that. You can walk up to the ZCatalog (in DTML) and ask it for all of the objects whose 'author' property was 'Aldous Huxley'. Like a real library catalog, the ZCatalog must be built before it is searched. This can be done either by a brute force method; where the ZCatalog catalogs everything it can find, or it can be done by more selective means. For maximum cataloging flexibility, objects can also be taught how to index themselves (and unindex themselves). Zope allows you to build your web application in a very flexibly way by allowing you to organize your objects into simple, clear structures. Zope gives you the ability to create destroy objects programatically at any time. This means that your application could very well scale on the order of thousands of objects or more. Consider an employee database that created one new employee object for each employee. If you were a small company, then a very simple DTML loop could look for the employee with the name 'Bob':: Found Bob However, if your company had thousands of employees, then this loop would take a long time to find just one object. Alternativly, you could create a ZCatalog to keep track of your employees. The catalog allows you to create indexes for various properties, for example, 'last_name' of your Employee objects, and to search those indexes very quicky. The above loop, which could take many minutes to execute if you had thousands of employees, is returned in milliseconds by ZCatalog with this query:: Hi Bob! Here we walk up to the catalog and pass it a python data structure called a dictionary. This dictionary maps the name 'last_name' to the value 'Bob'. This is how you tell the ZCatalog what index you want to query, and what value you are looking for in the index. You can pass mulitple parameters to ZCatalog:: New Bobs! The ZCatalog takes a very object oriented view to cataloging objects. ZCatalog's are very flexible and can often be a little confusing. The ZCatalog managment interface controls how your catalog behaves. What ZCatalog Returns A query to ZCatalog returns a sequence of record objects. Those record objects corespond to objects that are cataloged in the catalog. Record objects are NOT the objects that they refer to, they are just handy little objects that work like the index cards in a card catalog, they just store meta-information about the object, such as the time it was created and its title (what the record object actually concerns itself with is described in the section 'The Meta-Data Table'). In addition to whatever meta-data the record object has, each record object has the following attributes: o data_record_id_ 'data_record_id_' is the document id that this record object refers to. The Catalog can be queried for the path to this object with 'ZCatalog.getpath(data_record_id_)'. o data_record_score_ 'data_record_score_' is the score that this result record matched against the query. For text indexes, this is the number of occourances of a search term in the document. For field indexes, this is 1 if the record matched the query. For keyword indexes, the behavior is the same as field indexes, but it *should* actually return the number of keywords within the sequence that matched the query. o data_record_normalized_score_ 'data_record_normalized_score_' is the score of the record normalized with the rest of the result set to be between 0 and 1. For text indexes, this is 1 of the record matches the query term. The Managment Interface. In the Zope managment interface, you can create a ZCatalog anywhere in Zope. In order to play with ZCatalog, you will need some objects to actualy index. A fresh Zope install comes with a folder called 'QuickStart' that contains some introductory material to Zope. This is a nice small document set to play with the ZCatalog. In your QuickStart folder, select 'ZCatalog' from the add Menu and add a ZCatalog. You will see this screen: snapshot Here you have a few options we will quickly examine. Like all Zope objects, your new ZCatalog must be given an id. For the purposes of this document, we'll call our catalog 'Catalog'. Also like many other Zope objects, you can give your new ZCatalog a title. This is optional, you may put anything here you want. It is often useful, however, to title the Catalog in a way that meaningfully represents its contents, for example: 'Employees' if the Catalog mainly cataloged Employee objects. After title there is a checkbox that lets you select whether or not you want to select a Vocabulary. Vocabulary objects will be explained a little later. For now, leave this box unchecked. Note that this screen tells you that if you leave the box unchecked like we're going to, a Vocabulary object will be created for us. After you have filled in this screen properly, click 'Add ZCatalog'. Your managment interface will be taken back to the QuickStart folder. Notice that there is now a new object in that folder called 'Catalog'. snapshot To enter the catalog, click on its name or icon. The Contents View When you click on a ZCatalog you are taken to it's contents view. From this view, a ZCatalog looks and acts much like a Folder. In fact, the ZCatalog built partialy from the same code that Folders come from. Here you have the familiar Zope Add menu, and you can any kind of Zope object you normally have access to add. You can even create a ZCatalog within your ZCatalog. If you have been following along so far, you will see that your new Catalog contains only one Object named 'Vocabulary'. This is the mysterious Vocabulary object which will described later. Note that the objects show here are NOT the objects that are cataloged in this catalog. The Find Objects to Catalog view So that we can start having some fun right away, we're going to skip one view over to the right and jump straight to the 'Find Objects to Catalog' View. This long winded view presents you will the following screen: snapshot This screen lets you specify what kind of objects you want the Catalog to find. This screen is identical to the standard Zope 'Find' interface. Like the find interface, this screen will make ZCatalog traverse through your Zope objects. However, in the case of the ZCatalog, the Catalog will index each object that it finds that matches the criteria you specify here. If you are in the QuickStart Catalog, just click the 'Find' button and do a wide open search. This will cause your Catalog to index all of the contents of the QuickStart folder. If you are running on a slower machine, this may take up to a few minutes. You will be returned to the view we skipped over, 'The Cataloged Objects View', which is discussed next. Note that using the Find interface to Catalog objects is very ineficient and should only be used if you know that you are not going to be traversing over lots of objects. The more objects you traverse over, the longer the cataloging operation takes. While the ZCatalog is quite capable of searching through thousands and thousands of objects very quickly, actually indexing those objects is a much slower operation. If you attempt to index too many objects too quickly your in memory indexes soon get very large and Zope start aggresivly swapping objects to and from memory to the database. Of course, if your Zope process gets larger than your available memory, your operating system will soon start swapping bits of Zope out on its own. This can cause indexing to slow to a crawl. Thus, the most efficient way to index content in Zope is to have that content index and unindex itself when it is created and destroyed. Since the time it takes to index only one object is very negiligible, this turns out to keep your machine running quicky fast even in high usage sites. Of course, the same problem will occour if you try to create too many of these smart objects too fast. A scenario like this is unlikly, however, in an editorial or publication based system. For higher write-intesive operations, larger scale solutions should be considered. The Cataloged Objects View The Cataloged Objects View shows you all of the objects in the catalog one screenful at a time. It is important to know that these catalog listings are NOT the Zope objects they refer to. They are just references to objects in your Zope system. If you deleted an object that is cataloged in a catalog, then the catalog will contain a reference that is no longer valid. Unless you also uncatalog the object before you delete it, you may get search results that point to objects that no longer exist. At the top of the screen are two buttons that allows you to either update or clear the catalog. Update will go through each object reference in the catalog and try and update the index information for that object. For example, if an employee's last name changes, the ZCatalog can be updated to reflect that change. If the object no longer exists, updating it will remove it from the Catalog. Note that updating the entire catalog could take a long, long time if you have many objects. The Clear Catalog button does just that, it clears all of the indexes and object references out of the Catalog. It does NOT delete the objects it refers to. In addition to updating or clearing the entire catalog, you can individually choose objects for selection or deletion by selecting the checkbox just to the left of the link to that object: screen shot And clicking the 'Update' or 'Remove' button on either the top or bottom of the listing. The Indexes View In order for ZCatalog to keep track of information about objects for you, you must tell it what kind of information you are interested in. You do this by creating indexes. When you create an index, you give it a name and you specify a type. The name of the index is the property, attribute, or method you want the Catalog to use when indexing an object. For example, if all of your Employee objects had the attribute 'last_name', then you would want to create a 'last_name' index to index that value for every cataloged object. The Meta-data Table View The meta-data table allows the catalog to maintain a table of information about cataloged objects. For each object in the catalog, the meta-data table stores a sequence of values, one for each column in this table. This is useful, for example, if you want your search reports to include additional information about your objects such as their ids, titles, and URLs without having to wake the actual object up to get the information. The meta-data table directly effects the shape of the resulting objects that come from Catalog queries. The Status View The Status view shows information about your Catalog Setting your sub-transaction threshold The Status view provides you with the option to specify a sub-transaction threshold value. Zope is a transactional system, meaning that all changes made to zope happen within a transaction, including all changes made to the catalog. While the Catalog is indexing lots of information, lots of index objects are being changes. For speed, the changes made in a transaction are kept in memory at all times, this gets a bit dangerous when the Catalog is more than happy to eat up every bit of memory your computer has. Eventually, Zope will raise a MemoryError and the transaction will be rolled backed, forcing your OS's virtual memory pager into spastic fits. In order to prevent this, the Catalog will commit a sub-transaction every so often to allow the Zope cache to remove some of the changed data from memory. If an error occurs, the entire tranaction, and it's sub-transactions, will be rolled back. Basicly, the threshold is a knob that allows you to tweak how much memory you're going to allow Zope to consume, while the Catalog is cataloging lots of objects *in one transaction*. Setting this number higher will make ZCatalog commit a sub-transaction less frequently and indexing will consume more memory. Setting this number lower will make ZCAtalog commit a sub-transcation more frequently and indexing will consume less memory. Note that sub-transaction data is store wherever the python tempfile module wants to put it. If you are indexing lots and lots of data in one transaction, it is possible to fill up the temporary partition on certain systems. Make sure you have ample memory and tempfile space if you plan on indexing gigabytes of data. ZCatalog Objects ZCatalog objects provide a number of methods for DTML programmers to manipulate and query a catalog. The first three such methods are identical in parameters in operation, and differ only by name. This is done for historical as well as convenience reasons. ZCatalog.query(query_object) The 'query' method accepts one parameter, a query object, and returns a list of result objects. ZCatalog.searchResults(query_object) The 'searchResults' method is identical to the query method. It is a useful mnemonic method in DTML when all of your search paremeters are in the DTML Namespace (see the DTML Namespace How-To):: ZCatalog(query_object) Just calling a catalog object with a query object is identical to calling the query method. ZCatalog.get_path(document_id) 'get_path' takes one integer as an argument, and returns a path to the object that coresponsed to 'document_id'. This path can be used by REQUEST.resolve_url to return the actual object the path refers to (this is what 'get_object' does). Note that returning this object means this it will be wrapped in an Acquisition context of the ZCatalog. ZCatalog.get_object(document_id) Returns the object refered to by 'document_id'. This object is wrapped in the Acquisition context of the ZCatalog. ZCatalog.uniqueValuesFor(index) Returns the unique values for the index 'index'. Note that 'index' must be the name of an existing FieldIndex or KeywordIndex object. Future indexes may support this method. Text Indexes do not and will raise an error. ZCatalog.catalog_object(unique_id, object) This method tells the catalog to catalog 'object' with the unique id 'unique_id'. The ZCatalog stores an internal mapping between the unique id and a set of integers. Each unique_id maps to exactly one unique integer: /path/to/object/one -> 42 /path/to/object/two -> 68 Zope assumes that 'unique_id' is the absolute path to 'object'. If you plan on passing anything other than 'object.absolute_url()' as unique_id to this method, you better know what your doing. ZCatalog.uncatalog_object(unique_id) Removes all references to the object whose unique id is 'unique_id'. This is the inverse operation to 'catalog_object'. ZCatalog.schema() Returns a list of meta-data table names. ZCatalog.indexes() Returns a list of index names. ZCatalog.index_objects() Returns a list of actual index objects. Field Indexes Field Indexes treat the value of a property atomically. This means that the value can be an object such as a string, integer, tuple, list, dictionary or other python object. When you search a Field Index for a particular value, that value will only match objects whose property value matches the search term exactly. For example, consider an employee catalog with four employees cataloged in it. Each employee has a property called 'last_name' and 'id':: id last_name 0001 magill 0002 lil 0003 nancy To index these attributes, create two new indexes named 'id' and 'last_name' and make them both 'Field Indexes'. Example. Fields indexes support range searching if the values support comparison operations. This is done by specifying extra query terms which control the range in which the query is made. Field indexes can be queried to return a list of all unique values stored in the index. This is useful, for example, when you know that all the objects contained in a certain index map to only a small collection of values. In Zope, this is often used to index 'meta_type'. This allows you write a snippet of DTML which returns a list of all known (cataloged) meta types:: Keyword Indexes Keyword indexes index a sequence of atomic values. Querying the index for a value will match any object which has that value anywhere in the sequence of atomic values. This makes it easy to create the popular 'keyword' approach to searchable databases. Keyword indexes allow you to provide a heirarchical structure on top of your cataloged data. Text Indexes Text Indexes treat the value of an object's attribute as string of text. According to certain language specific actions, this text is typically parsed into 'words' which are then stored in the index. This makes searching for the ocourance of a word within a document easy. Text indexes are the most complex of all indexes. Consider the following text:: "Bob is your Uncle." For the purpose of text indexing, this is called the 'document'. The Text index will transform this string into a sequence of values in a language specific way. Currently, Zope supports only english and european languages (see Appendix A). The default engish specific action is to turn the string into:: ['bob', 'uncle'] Note that 'is' and 'your' were removed. This is because they are 'stopwords', which are very common english words that are removed for the sake of index efficiency. It is not very useful, for example, to search for the word 'is' or 'the', since they will probably match almost 100% of all english documents. A search for those common terms would yeild the entire dataset. The same goes for other language specific common words, such as the french 'les'. The text index then takes each value sucked out of the document string and maintains a list of word to document mappings:: 'bob' -> 1,2,3 'uncle' -> 1,2 'zope' -> 2,3,4 This is a classic 'index'. Searching for the word 'bob' will return documents 1, 2 and 3. Further, the text index query language supports boolean expressions such as:: Query Result "bob AND uncle" 1,2 "bob OR zope" 1,2,3,4 "zope AND NOT uncle" 3,4 Expressions can also be nested in arbitrary levels of parenthesis:: "(bob AND uncle) OR zope" 1,2,3 Language Specifics: Vocabulary Objects In the section on text indexes, we said that text indexes were a mapping from words to the documents containing those words. This is true in a sense, but ZCatalog goes to great lengths to make searching and storing this information efficiently. For this reason, the ZCatalog actually maps integer word ids to integer document ids. Because the word ids are itegers, they have no language specific meaning. Another object, called a Vocabulary object, is used to turn the meaningless integer into a meaningful 'word'. Vocabulary objects map word ids to words. When you search for the term 'bob', it is first looked up in the vocabulary to find out what integer word id it maps to. Then the index is queried with that integer. Whatever integer document ids contain that search term are returned for the query. Vocabulary objects are also in charge of determining how documents are split up into words. Currently, Zope's Vocabulary object only support splitting english and some european languages. The 'Splitter' takes a document and turns it into a list of values to be indexed or used as search terms. Vocabulary objects let you: o Manage Synonyms Because words map to integers, many words can map to the same integer. This allows you to assign synonyms to words. For example, a search for 'car' would also return documents containing 'automobile'. This is done by mapping automobile and car to the same integer. o Manage Stopwords Stopwords are common words that are removed from the document before indexing. In english, such common words as 'is' and 'the' are not useful in most general indexing cases. Vocabulary objects let you manage which words are removed from a document. o Control Splitting Vocabulary objects let you control which character values you consider part of a word and which character values you do not. This lets you control how special or non-english characters get treated by the splitter. For example, the euro-english splitter may consider your special accent character to be a space, and split important words into nonsense. By customising the splitter, you can control how words are parsed from your document. o Support wildcard searching Globbing Vocabulary objects support the notion of searching for words by specifing a pattern. This allows searches of the form:: "bob*" "*bob" "un?le" "un*e" "?n*e" The wildcard characters '?' and '*' can be customized to be any characters. These wildcard characters can be customized to suit your needs From the Vocabulary object managment screen: screenshot Stopwords and Synonyms A 'stopword' is a word that is removed from a document before it is indexed. In english, it is useful to remove certain very common words such as 'is' and 'the'. A 'synonym' is a word that has a similar meaning to another word. For example, the word 'car' is a synonym of the word 'automobile'. Both stopwords and synonyms are very language specific. A set of stopwords for the english language, for example, would have no beneficial effect when removed from a document written in Dutch. Similarly, synonyms depend on meaning, and meaning is itegrally tied to language. 'car' and 'automobile' are maybe not even words in other language, much less synonyms. Because Vocabulary objects encapsulate language specific information for a catalog, they are the obvious place to manage your stopwords and synonyms. Vocabulary objects provide a view to manage stopwords and synonyms. This screen looks like this: screenshot Here, you see a text area that contains a sequence of lines:: the is he she automobile:car meal:food it your Each line contains either a stopword or a synonym. If the line contains a colon, then the string to the right of the colon is the word, the string to the left is the synonym that has the same meaning as that word. It is possible to use multiple synonyms together to create lists of related words:: automobile:car vehicle:car transport:car After editing to text area to suit your needs, click 'Submit'. If you have made an error in the document, you will get an error message to that effect. Appendix A: Setting your locale