Table of Contents

Module: TextIndex Zope-2.2.1-src/lib/python/SearchIndex/TextIndex.py

Text Index

Notes on a new text index design

The current inverted index algoirthm works well enough for our needs. Speed of the algorithm does not seem to be a problem, however, data management is a significant problem. In particular:

  • Process size grows unacceptably during mass indexing.

  • Data load and store seems to take too long. For example, clearing an inverted index and committing takes a significant amount of time.

  • The current trie data structure contributes significantly to the number of objects in the system.

  • Removal/update of documents is especially problematic. We have to either:

    • Unindex old version of an object before updating it. This is a real hassle for apps like sws.

    • Tool through entire index looking for object references. This is totally impractical.

Some observations of competition:

  • Xerox system can index "5-million word document in 256k". What does this mean?

    • Does the system save word positions as we do?

    • What is the index indexing?

    • What was the vocabulary of the system?

    Let\'s see. Assume a 10,000 word vocabulary. Then we use 25-bytes per entry. Hm.....

  • Verity has some sense of indexing in phases and packing index. Verity keeps the index in multiple chunks and a search may operate on multiple chunks. This means that we can add data without updating large records.

    This may be especially handy for mass updates, like we do in cv3. In a sense we do this in cv3 and sws. We index a large batch of documents to a temporary index and then merge changes in.

    If "temporary" index was integral to system, then maybe merger could be done as a background task....

Tree issues

Tree structures benefit small updates, because an update to an entry does not cause update of entire tree, however, each node in tree introduces overhead.

Trie structure currently introduces an excessive number of nodes. Typically, a node per two or three words. Trie has potential to reduce storage because key storage is shared between words.

Maybe an alternative to a Trie is some sort of nested BTree. Or maybe a Trie with some kind of binary-search-based indexing.

Suppose that:

  • database objects were at leaves of tree - vocabulary was finite - we don\'t remove a leaf when it becomes empty

Then:

  • After some point, tree objects no longer change

If this is case, then it doesn\'t make sense to optimize tree for change.

Additional notes

Stemming reduces the number of words substantially.

Proposal

new TextIndex

TextIndex

word -> textSearchResult

Implemented with:

InvertedIndex

word -> idSet

ResultIndex

id -> docData

where:

word

is a token, typically a word, but could be a name or a number

textSearchResult

id -> (score, positions)

id

integer, say 4-byte.

positions

sequence of integers.

score

numeric measure of relevence, f(numberOfWords, positions)

numberOfWords

number of words in source document.

idSet

set of ids

docData

numberOfWords, word->positions

Note that ids and positions are ints. We will build C extensions for efficiently storing and pickling structures with lots of ints. This should significantly improve space overhead and storage/retrieveal times, as well as storeage space.

Imported modules   
import BTree
from Globals import Persistent
import IIBTree
from Lexicon import Lexicon, query, stop_word_dict
from ResultList import ResultList
from Splitter import Splitter
from intSet import intSet
import operator
import regex
from string import strip
import ts_regex
Classes   
TextIndex

Table of Contents

This document was automatically generated on Mon Sep 4 07:33:06 2000 by HappyDoc version r0_6