Zope.org - ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

www.zope.org

old.zope.org
- /Products
- /Members

Log in

Forgot your password?

Folder Contents View List Releases DublinCore

ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex

CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text stored as Unicode. It uses a simple, but workable, "hack" instead of trying to do real word splitting from dictionaries. Compared to a dictionary based word splitter, this results in a bigger index and more matches than necessary, but it is a cheap price to pay for the reduced complexity.

Changes Summary

Version 0.2 [email protected] improves on the previous in a number of ways: uses Unicode internally (not UTF-8), replaces configuration file with lookups using unicodedata module for looking up CJK characters and symbols, adds unit tests, and detailed English instructions for installation etc.
Version 0.1 [email protected] original version.

Known Problems

Text must (well, should) be stored as Unicode.
Cannot search single characters.
Could do a better job at identifying CJK characters.
May match more than is strictly necessary due to algorithm used. (See source code for details.)

Please join the zopeasia project on SourceForge to participate in the development

Homepage:
http://sourceforge.net/projects/zopeasia/
Contact Email:
[email protected]

Latest Release:	0.2
Last Updated:	2003-03-09 21:15:21
Author:	ZopeOrgSite
Categories:	Internationalization, SoftwareProduct, ZCatalog, catalog, i18n
Maturity:	Stable

Available Releases

Version	Maturity	Platform	Released
0.2	Stable		2003-03-09 21:15:21
	CJKSplitter-0.2.tgz (4 K)	All