Zope Unicode support version 0.6
Zope Unicode support version 0.6
This modification to Zope provides support for python 2.x Unicode strings in ZPublisher, property pages, and property sheets.
Copyright (c) 1999, 2000, 2001 Toby Dickenson
Permission to use this software in any way is granted without fee, provided that the copyright notice above appears in all copies. This software is provided "as is" without any warranty.
Send comments to Toby Dickenson, [email protected]
Installation
This patch was developed with Zope 2.4.1 and python 2.1. It does not work with earlier versions of either product.
Changes since 0.5
- Zope 2.4 support (the first release of Zope to officially support a version of python which includes the Unicode type needed by this patch)
- This patch has had much less testing than 0.4 or 0.5 because I am not doing any active Unicode development at the moment. However, Im not aware of any problems.....
- xml-rpc now supports Unicode values. This is not included with this patch, get version 0.9.9 of xmlrpclib.py from http://www.pythonware.com/products/xmlrpc/index.htm
Changes for Content Managers
Property pages and property sheets now include extra types ustring
,
utokens', utext
, and ulines
. These are unicode equivalents of
string
, tokens
, text
, and lines
.
Unicode strings can be mixed freely with plain strings in DTML. DTML will return a unicode string if any of its constituents are unicode, otherwise it will return a plain string as before.
When unicode strings are mixed with plain strings, the plain string is converted to unicode assuming that it contains characters in Zope's Default Character Encoding, discussed below.
ZPublisher has been changed to handle a unicode response. If the response is not unicode then it behaves exactly as before. However, if is Unicode then it applies the character encoding specified by the charset property in the Content-Type header. (This applies to all text/* content-types)
If you expect that your pages might include Unicode data, change your standard_html_header to something like the following example:
text/html; charset=UTF-8
)">
The "
If the Content-Type header does not include a charset property (or
if it is blank - ZPublisher guesses text/html
) then the unicode
string is encoded using the Zope's Default Character Encoding.
Changes for Forms
ZPublisher has special processing for field names of the form "name:type" (for example "age:int", or "address:string"). ZPublisher uses these extra tags to marshal the form values into the correct type.
This mechanism has been extended to include a specification of the character encoding used by the response. You need to know which encoding will be used by the browser and include an appropriate tag. "age:utf8:int" or "address:utf8:string". The tag parser insists that tags must only use alphanumberic characters or an underscore, so you might need to use a short form of the encoding name (such as UTF8 rather than UTF-8).
Four extra type converters have been added: Unicode equivalents of the
existing string types. ustring
, utokens
, utext
, and ulines
.
If the field name does not include a character encoding tag, then the
Default Character Encoding is assumed.
Character Encoding Used In Form Responses
As explained above, you need to know which character encoding will be used by the browser to submit responses to your forms, and include the name of that encoding in the name of your form controls.
The encoding used by a browser depends on the encoding used by the page containing the form, and the type of form.
- Forms submitted using GET, or using POST with
"application/x-www-form-urlencoded" (the default)
- Page uses an encoding of unicode
Forms are submitted using UTF8, as required by RFC 2718 2.2.5
- Page uses another regional 8 bit encoding
Forms are often submitted using the same encoding as the page. If you choose to use such an encoding then you should also verify how browsers behave.
- Page uses an encoding of unicode
- Forms submitted using "multipart/form-data"
According to HTML 4.01 (section 17.13.4) browsers should state which character encoding they are using for each field in a Content-Type header, however I have never seen a browser actually do this.
The current browsers (As of December 2000, when I last looked) appear to use the same encoding as the page containing the form.
You are right to think that this is harder than it really should be. A no-brainer policy is to use UTF8 for every page, in which case form responses are also always UTF8.
Zope's Default Encoding
Zope allows you to mix plain strings and unicode strings. This will automatically do the right thing if the plain strings are using a latin-1 character encoding (or a subset of latin-1, such as ascii).
This default encoding is used when:
- unicode strings are mixed with plain strings in DTML
- the response is a unicode string, but the content-type does not include a charset
- a browser submits a form in unicode, but the parameter is marshalled to string, lines, tokens, or text (or any other marshalling type converter that is not unicode-aware)
This is less strict than basic Python, which will raise an exception when combining unicode strings with plain strings that contain characters outside the ascii range.
Extensions to the DTML namespace
The DTML namespace (named _ in DTML expressions) now contains the following extra symbols, which are Python's new builtin functions of the same.
- unicode
- unichr
- ustr
Pages That Do Not Expect Unicode
There are many DTML pages that are not currently unicode aware, including most of Zope's management interface. These changes have been designed to allow these DTML pages to remain unchanged if they never see unicode data, and to degrade gracefully if they should encounter unicode data accidentally.
The following issues should not be a problem:
- If a unicode property containing characters outside the latin-1 range is used on a page that is not unicode-aware, those character will be replaced by a question mark. This currently allows standard zope properties (such as title) to be unicode, without updating all pages in the management interface that use it.
- There may be problems with using unicode properties on a page that does not contain latin-1 data, but which also does not set an appropriate content-type header.
- In some circumstances, Zope modifies the returned html to
include a
tag. This modification will only worth with character encodings that are a superset of ascii. (ie. Not UTF16).
The following problems remain unresolved:
- Python will throw an exception if non-ascii plain strings are compared to unicode strings. This will cause problems for ZCatalog if one index contains both non-ascii plain strings, and unicode strings. A workaround for this problem is to provide an external method which returns that property in unicode, then index the external method. Note that I think ZCatalog is already relying on dangerous ground in this area: http://classic.zope.org:8080/Collector/1219/view
- Python code that uses DTML may be broken when it returns a unicode string.