change log
v0.6
- support single Chinese character search
v0.5.1
- add CREDIT.txt and Licence infomation
v0.5
- use regular expression to compatible with defualt English white space splitter
- removed configuration file, much simpler code, easy to install, easy to use
- support multiple encodings: unicode/utf-8/gb18030/gbk/gb2312/mbcs/big5.
provide 3 splitters:
CJK splitter
: support unicode/utf-8 encoding. this encoding is compatible with version 0.1CJK GB splitter
: support unicode/gb18030/gbk/gb2312/mbcs encodings.CJK BIG5 splitter
: support unicode/big5/mbcs encodings
- unicode encoding is detected automatically. this make CJKSplitter compatible with Archtypes 1.2+ (string stored as unicode)
- better encoding handling to avoid exception (
replace
) - smaller index storage for CJK: index stored as unicode(2 byts) but not utf-8(3 bytes)
- support english globing
- precise CJK char recongnize (\u4E00-\u9FFF)
- maybe better performance, not tested
- better documentations (thanks bjorn!)
v0.2
this is bjorn's([email protected]) contributes
v0.1
initial release, support utf-8 encoding only.