CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex
CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text
stored as Unicode. It uses a simple, but workable, "hack" instead of trying
to do real word splitting from dictionaries. Compared to a dictionary based
word splitter, this results in a bigger index and more matches than necessary,
but it is a cheap price to pay for the reduced complexity.
Changes Summary
- Version 0.2
[email protected]
improves on the previous in a number of ways:
uses Unicode internally (not UTF-8), replaces
configuration file with lookups using
unicodedata module for looking up CJK
characters and symbols, adds unit tests, and
detailed English instructions for installation
etc.
- Version 0.1
[email protected]
original version.
Known Problems
- Text must (well, should) be stored as Unicode.
- Cannot search single characters.
- Could do a better job at identifying CJK characters.
- May match more than is strictly necessary due to algorithm used.
(See source code for details.)
Please join the
zopeasia project on SourceForge
to participate in the development
|
Latest Release: |
0.2
|
Last Updated: |
2003-03-09 21:15:21 |
Author: |
ZopeOrgSite
|
Categories: |
Internationalization, SoftwareProduct, ZCatalog, catalog, i18n |
Maturity: |
Stable |
|