Changes to Zope Developers Guide, Chapter 2, Object Publishing
* Add the following before the section 'HTTP Responses' under 'Stringifying
the published object'
Character Encodings for Responses
If the published method returns an object of type 'string', a plain
8-bit character string, the publisher will use it directly as the body of the
response.
Things are different if the published method returns a unicode string,
because the publisher has to apply some character encoding. The published
method can choose which character encoding it uses by setting a
'Content-Type' response header which includes a 'charset' property
(setting response headers is explained later in this chapter). A
common choice of character encoding is UTF-8. To cause the publisher
to send unicode results as UTF-8 you need to set a
'Content-Type' header with the value 'text/html; charset=UTF-8'
If the 'Content-Type' header does not include a charser property (or if this
header has not been set by the published method) then the publisher will
choose a default character encoding. Today this default is ISO-8859-1
(also known as Latin-1) for compatability with old versions of Zope which
did not include Unicode support. At some time in the future this default
is likely to change to UTF-8.
* Inside the section 'Argument Conversion' is a list of type conversion
marshalling tags. Insert the following definition of 'ustring' under 'string'
ustring
Converts a variable to a Python unicode string.
* and insert this definition at the bottom of the list
ulines, utokens, utext
like lines, tokens, text, but using unicode strings instead of
plain strings.
* Insert this section before 'Method Arguments'
Character Encodings for Arguments
The publisher needs to know what character encoding was used by the browser
to encode form fields into the request. That depends on whether the form
was submitted using GET or POST (which the publisher can work out for itself)
and on the character encoding used by the page which contained the form
(for which the publisher needs your help).
In some cases you need to add a specification of the character encoding
to each fields type converter. The full details of how this works are
explained below, however most users do not need to deal with the full
details:
1 If your pages all use the UTF-8 character encoding (or at least all the
pages that contain forms) the browsers will always use UTF-8 for
arguments. You need to add ':utf8' into all argument type converts. For
example:
2 If your pages all use a character encoding which has ASCII as a subset
(such as Latin-1, UTF-8, etc) then you do not need to specify any
chatacter encoding for boolean, int, long, float, and date types.
You can also omit the character encoding type converter from string,
tokens, lines, and text types if you only need to handle ASCII characters
in that form field.
Character Encodings for Arguments; The Full Story
If you are not in one of those two easy categories, you first need
to determine which character encoding will be used by the browser to
encode the arguments in submitted forms.
1. Forms submitted using GET, or using POST with
"application/x-www-form-urlencoded" (the default)
1. Page uses an encoding of unicode:
Forms are submitted using UTF8, as required by RFC 2718 2.2.5
2. Page uses another regional 8 bit encoding:
Forms are often submitted using the same encoding as the
page. If you choose to use such an encoding then you should
also verify how browsers behave.
2. Forms submitted using "multipart/form-data":
According to HTML 4.01 (section 17.13.4) browsers should state which
character encoding they are using for each field in a Content-Type
header, however this is poorly supported. The current crop of
browsers appear to use the same encoding as the page containing
the form.
Every field needs that character encoding name appended to is converter.
The tag parser insists that tags must only use alphanumberic characters
or an underscore, so you might need to use a short form of the
encoding name from the Python 'encodings' library package (such
as utf8 rather than UTF-8).