README
FieldedTextIndex: A Zope plug-in index for ZCatalog
FieldedTextIndex is a derivative of ZCTextIndex, the built-in full-text indexer for Zope. As such, it has many of the same features such as relevance ranking, boolean queries, wildcards (globbing) and phrase matching.
- Note:
- Indexes made with version 0.1 of FieldedTextIndex cannot be used with version 0.2. You must recreate your indexes after upgrading to 0.2
What problems does it solve?
In Zope sites it is common to have many different types of content objects whose data is stored in various attributes (or fields) in the object. Schema driven content types are also becoming more and more common making it easier to create myriad content types with different data fields in a single site.
It is also common for Zope sites to offer a full-text search for their
content. This is often achieved by creating a method such as
SearchableText
which aggregates the content fields into a single
source which can be fed to a text index. Although this works well for
a simple search across all textual fields of the objects in the site,
you cannot narrow the search to specific fields, you can only search
all fields using the SearchableText index.
The obvious solution to this is to create new text indexes for each field you want to search individually. This creates three big issues:
- Every text index adds considerable overhead for indexing which naturally limits limits the number you can have.
- You must determine which fields are interesting to search individually when designing your application and changing this later is a software change and requires reindexing.
- Due to limitations in the ZCatalog query API, it is difficult to
perform searches across multiple text indexes such as
"Casey Duncan" in [first_name, last_name]
FieldedTextIndex solves these problems by extending the standard ZCTextIndex so that it can receive and index the textual data of an object's field attributes as a mapping of field names to field text. The index itself performs the aggregation of the fielded data and allows queries to be performed across all fields (like a standard text index) or any subset of the fields which have been encountered in the objects indexed.
Additionally, FieldedTextIndex can weight individual fields so that search terms found in those fields affect the result score differently. This allows you to make certain fields influence the relevance of results more or less than others.
Creating a FieldedTextIndex
FieldedTextIndexes require three pieces of information to construct:
- An index id, which must be unique for the ZCatalog
- A source name, which is the name of an attribute, method or script which returns the index source mapping from the content objects.
- The id of the ZCTextIndex Lexicon which processes and stores the words that are indexed. This lexicon can be shared amongst several indexes (both FieldedTextIndexes and ZCTextIndexes) if desired.
Creating an index source
The source name of the index specifies the name of an attribute, method
or script which returns a mapping of field name
to field text
. This
mapping can be a dictionary or any dictionary-like object which supports
the items()
method. It can also be a iterable sequence of pairs such as a
list of two-tuples ("field name", "field text")
. The order of the
sequence is not important, however each field name should occur only once.
An easy way to add an index source to an existing application or framework
(such as the CMF), is to create a Python script with the same id as the
source name of the index. When an object is indexed, it will be bound to
the context
variable in the Python script. The script can use the
context to create a dictionary mapping the names to the values of each
field. Here is a simple example for the default dublin core fields that
basic CMF content objects implement:
## Script (Python) "dc_fields" ##title=Source for FieldedTextIndex for CMF DublinCore objects source = {} for field in ("Title", "Creator", "Subject", "Description", "Publisher", "Contributors", "Type"): source[field] = getattr(context, field)() return source
FieldedTextIndex is designed to work with whatever schema system you may be using. By creating a simple script that collects the desired fields and returns the requisite mapping, you can index those fields using the index.
The above script doesn't really take advantage of the full capabilities of the index, however, since every object has the same indexed fields. The real power of the index is in its ability to index an unlimited number of different fields of different objects which have arbitrary schemas. A more advanced script might introspect the schema to determine which fields should be indexed for an object, or allow the fields to be specified by the object directly. As new objects with different fields are encountered, these fields will automatically be added and become searchable. No changes to the catalog configuration are necessary.
Querying the index
To perform a search across all indexed fields, you can simply call the
catalog passing the search string as the value for a keyword argument which
matches the source name of the index. For example, to search all the
fields of the index for dc_fields
you can use:
result = catalog(dc_fields="Some search string")
This makes it possible to use a FieldedTextIndex as a drop-in replacement for a ZCTextIndex. The query above returns the same results for both indexes (assuming they index the same data of course).
To perform a search limited to specific fields, use a dictionary as the
argument value instead of a string. The dict should contain the keys
query
and fields
. query
contains the search string and fields
contains a list of the field names to be searched:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title", "Description"]})
This would return only objects where the query terms occurred in the
fields Title
or Description
.
Specifying field weights in queries (New in 0.2)
It is also possible to weight individual fields differently in a query so that hits on certain fields affect the relevance score more than others. In practical terms, this allows you to make search hits on particular fields push the cooresponding objects higher in search results. It allows you to make hits on certain fields more important than others.
The field_weights
key in the query dictionary is used to specify the
weights to apply to each field. The value of field_weights
is a
dictionary with each field name and its integer weight as its respective
keys and values. The relevance score for the intermediate query results for
each field are multiplied by the weight before being combined with the
results for other fields:
result = catalog(dc_fields={"query":"Some search string", "field_weights":{"Title":3, "Subject":2}})
This would return objects where the query is found in any field. Matches
on Title
have their score multiplied by 3. Subject
matches are
multiplied by 2.
You can specify field_weights
independently of fields
. The value of
field_weights
does not affect the fields searched. If fields
is not
specified, then all fields are searched regardless of the value of
field_weights
. Fields not assigned a weight by field_weights
are
assigned a weight of one by default. If you specify weights for fields
that do not appear in the fields
list or are not the names of fields
known to the index, they are ignored:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title", "Description"], "field_weights":{"Title":3, "Subject":2}})
In this case, Title
is searched with a weight of 3 and Description
a weight of 1 (the default). Subject
is not searched since it does not
appear in fields
.
You can also specify zero or negative weights if desired. Zero weighted fields will be used to filter the results, but will not affect the score. Negatively weighted fields will reduce the score of results where terms occur in them. This can be used as a way to tweak the order of results to common queries. If undesired content is appearing high in the results of a query, a negatively weighted field with anti-keywords matching the query could be used to move the content down.
Specifying default weights
You can also specify weights to apply by default to all queries that do
not specify a value for field_weights
. To do this, go to the Indexes
tab of the ZCatalog and click on the FieldedTextIndex. Use the
Default Field Weights
tab to set the defaults for the index. Weights
are applied at query-time, so you do not need to reindex for the weights
to take affect.
Queries that specify their own value for field_weights
override any
defaults. Queries can pass an empty dictionary for field_weights
to
reset all field weights to one.
Creating a query form
Queries can also be generated directly from the web request like other
indexes. A query string or post-data can provide the query data structure
by using Zope's record
marshaling. Here is an example which lets
you search any combination of Title
, Description
or Creator
:
<form action="search_results"> <input name="SearchableFields.query:record" /><br /> <div tal:repeat="name python:('Title', 'Description', 'Creator')"> <input type="checkbox" name="SearchableFields.fields:record:list" tal:attributes="value name; id name;" /> <label tal:attributes="for name" tal:content="name">Name</label> </div> <input type="submit" /> </form>
Note that fields
must always be a list, hence the :list
at the end
of the checkbox names. The search_results
template can use a standard
ZCatalog query, which simply calls ZCatalog passing it the web request
formatting the result set as desired.
You can also determine the names of the fields that the index has
encountered by using ZCatalog's uniqueValuesFor()
method. Here is a
variation of the form which creates a multi-select box populated with
all of the searchable fields:
<form action="search_results" tal:define="fields python:here.portal_catalog.uniqueValuesFor('SearchableFields')"> <input name="SearchableFields.query:record" /><br /> <select name="SearchableFields.fields:record:list" multiple="multiple"> <option tal:repeat="name fields" tal:attributes="value name" tal:content="name">Name</option> </select><br /> <input type="submit" /> </form>
Conclusion
I hope you find this software useful. If you have a question, comment, feature request or find a bug please contact me at [email protected].
Copyright (c) 2003, Casey Duncan and Zope Corporation