Add markup to new section in codecs docs

commit: 131e4f71ba118da9c05f809b175256db50905f64 [log] [tgz]
author: Georg Brandl <georg@python.org> Mon Jan 23 21:33:48 2006 +0000
committer: Georg Brandl <georg@python.org> Mon Jan 23 21:33:48 2006 +0000
tree: a5292ce00b62da0a58b5cd9f785bc3df1f49a96f
parent: 296152e644e9974ed3152f1f5789d4da532eab47 [diff] [blame]
diff --git a/Doc/lib/libcodecs.tex b/Doc/lib/libcodecs.tex
index b306606..d5c0d9f 100644
--- a/Doc/lib/libcodecs.tex
+++ b/Doc/lib/libcodecs.tex

@@ -525,9 +525,10 @@
 \subsection{Encodings and Unicode\label{encodings-overview}}
 
 Unicode strings are stored internally as sequences of codepoints (to
-be precise as Py_UNICODE arrays). Depending on the way Python is
-compiled (either via --enable-unicode=ucs2 or --enable-unicode=ucs4,
-with the former being the default) Py_UNICODE is either a 16-bit or
+be precise as \ctype{Py_UNICODE} arrays). Depending on the way Python is
+compiled (either via \longprogramopt{enable-unicode=ucs2} or 
+\longprogramopt{enable-unicode=ucs4}, with the former being the default)
+\ctype{Py_UNICODE} is either a 16-bit or
 32-bit data type. Once a Unicode object is used outside of CPU and
 memory, CPU endianness and how these arrays are stored as bytes become
 an issue. Transforming a unicode object into a sequence of bytes is
@@ -535,20 +536,20 @@
 bytes is known as decoding. There are many different methods how this
 transformation can be done (these methods are also called encodings).
 The simplest method is to map the codepoints 0-255 to the bytes
-0x0-0xff. This means that a unicode object that contains codepoints
-above U+00FF can't be encoded with this method (which is called
-'latin-1' or 'iso-8859-1'). unicode.encode() will raise a
-UnicodeEncodeError that looks like this: UnicodeEncodeError: 'latin-1'
-codec can't encode character u'\u1234' in position 3: ordinal not in
-range(256)
+\code{0x0}-\code{0xff}. This means that a unicode object that contains 
+codepoints above \code{U+00FF} can't be encoded with this method (which 
+is called \code{'latin-1'} or \code{'iso-8859-1'}). unicode.encode() will 
+raise a UnicodeEncodeError that looks like this: \samp{UnicodeEncodeError:
+'latin-1' codec can't encode character u'\e u1234' in position 3: ordinal
+not in range(256)}.
 
 There's another group of encodings (the so called charmap encodings)
 that choose a different subset of all unicode code points and how
-these codepoints are mapped to the bytes 0x0-0xff. To see how this is
-done simply open e.g.  encodings/cp1252.py (which is an encoding that
-is used primarily on Windows).  There's a string constant with 256
-characters that shows you which character is mapped to which byte
-value.
+these codepoints are mapped to the bytes \code{0x0}-\code{0xff.}
+To see how this is done simply open e.g. \file{encodings/cp1252.py}
+(which is an encoding that is used primarily on Windows).
+There's a string constant with 256 characters that shows you which 
+character is mapped to which byte value.
 
 All of these encodings can only encode 256 of the 65536 (or 1114111)
 codepoints defined in unicode. A simple and straightforward way that
@@ -562,20 +563,20 @@
 by a CPU with a different endianness, then bytes have to be swapped
 though. To be able to detect the endianness of a UTF-16 byte sequence,
 there's the so called BOM (the "Byte Order Mark"). This is the Unicode
-character U+FEFF. This character will be prepended to every UTF-16
-byte sequence. The byte swapped version of this character (0xFFFE) is
+character \code{U+FEFF}. This character will be prepended to every UTF-16
+byte sequence. The byte swapped version of this character (\code{0xFFFE}) is
 an illegal character that may not appear in a Unicode text. So when
-the first character in an UTF-16 byte sequence appears to be a U+FFFE
+the first character in an UTF-16 byte sequence appears to be a \code{U+FFFE}
 the bytes have to be swapped on decoding. Unfortunately upto Unicode
-4.0 the character U+FEFF had a second purpose as a "ZERO WIDTH
-NO-BREAK SPACE": A character that has no width and doesn't allow a
+4.0 the character \code{U+FEFF} had a second purpose as a \samp{ZERO WIDTH
+NO-BREAK SPACE}: A character that has no width and doesn't allow a
 word to be split. It can e.g. be used to give hints to a ligature
-algorithm. With Unicode 4.0 using U+FEFF as a ZERO WIDTH NO-BREAK
-SPACE has been deprecated (with U+2060 (WORD JOINER) assuming this
-role). Nevertheless Unicode software still must be able to handle
-U+FEFF in both roles: As a BOM it's a device to determine the storage
+algorithm. With Unicode 4.0 using \code{U+FEFF} as a \samp{ZERO WIDTH NO-BREAK
+SPACE} has been deprecated (with \code{U+2060} (\samp{WORD JOINER}) assuming
+this role). Nevertheless Unicode software still must be able to handle
+\code{U+FEFF} in both roles: As a BOM it's a device to determine the storage
 layout of the encoded bytes, and vanishes once the byte sequence has
-been decoded into a Unicode string; as a ZERO WIDTH NO-BREAK SPACE
+been decoded into a Unicode string; as a \samp{ZERO WIDTH NO-BREAK SPACE}
 it's a normal character that will be decoded like any other.
 
 There's another encoding that is able to encoding the full range of
@@ -588,20 +589,20 @@
 character):
 
 \begin{tableii}{l|l}{textrm}{}{Range}{Encoding}
-\lineii{U-00000000 ... U-0000007F}{0xxxxxxx}
-\lineii{U-00000080 ... U-000007FF}{110xxxxx 10xxxxxx}
-\lineii{U-00000800 ... U-0000FFFF}{1110xxxx 10xxxxxx 10xxxxxx}
-\lineii{U-00010000 ... U-001FFFFF}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
-\lineii{U-00200000 ... U-03FFFFFF}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
-\lineii{U-04000000 ... U-7FFFFFFF}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00000000} ... \code{U-0000007F}}{0xxxxxxx}
+\lineii{\code{U-00000080} ... \code{U-000007FF}}{110xxxxx 10xxxxxx}
+\lineii{\code{U-00000800} ... \code{U-0000FFFF}}{1110xxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00010000} ... \code{U-001FFFFF}}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-00200000} ... \code{U-03FFFFFF}}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
+\lineii{\code{U-04000000} ... \code{U-7FFFFFFF}}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
 \end{tableii}
 
 The least significant bit of the Unicode character is the rightmost x
 bit.
 
-As UTF-8 is an 8bit encoding no BOM is required and any U+FEFF
+As UTF-8 is an 8bit encoding no BOM is required and any \code{U+FEFF}
 character in the decoded Unicode string (even if it's the first
-character) is treated as a ZERO WIDTH NO-BREAK SPACE.
+character) is treated as a \samp{ZERO WIDTH NO-BREAK SPACE}.
 
 Without external information it's impossible to reliably determine
 which encoding was used for encoding a Unicode string. Each charmap
@@ -609,14 +610,14 @@
 possible with UTF-8, as UTF-8 byte sequences have a structure that
 doesn't allow arbitrary byte sequence. To increase the reliability
 with which a UTF-8 encoding can be detected, Microsoft invented a
-variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad
+variant of UTF-8 (that Python 2.5 calls \code{"utf-8-sig"}) for its Notepad
 program: Before any of the Unicode characters is written to the file,
-a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef,
-0xbb, 0xbf) is written. As it's rather improbably that any charmap
-encoded file starts with these byte values (which would e.g. map to
+a UTF-8 encoded BOM (which looks like this as a byte sequence: \code{0xef},
+\code{0xbb}, \code{0xbf}) is written. As it's rather improbably that any
+charmap encoded file starts with these byte values (which would e.g. map to
 
-   LATIN SMALL LETTER I WITH DIAERESIS
-   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
+   LATIN SMALL LETTER I WITH DIAERESIS \\
+   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \\
    INVERTED QUESTION MARK
 
 in iso-8859-1), this increases the probability that a utf-8-sig
@@ -624,9 +625,9 @@
 BOM is not used to be able to determine the byte order used for
 generating the byte sequence, but as a signature that helps in
 guessing the encoding. On encoding the utf-8-sig codec will write
-0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding
-utf-8-sig will skip those three bytes if they appear as the first
-three bytes in the file.
+\code{0xef}, \code{0xbb}, \code{0xbf} as the first three bytes to the file.
+On decoding utf-8-sig will skip those three bytes if they appear as the
+first three bytes in the file.
 
 
 \subsection{Standard Encodings\label{standard-encodings}}
commit	131e4f71ba118da9c05f809b175256db50905f64	[log] [tgz]
author	Georg Brandl <georg@python.org>	Mon Jan 23 21:33:48 2006 +0000
committer	Georg Brandl <georg@python.org>	Mon Jan 23 21:33:48 2006 +0000
tree	a5292ce00b62da0a58b5cd9f785bc3df1f49a96f
parent	296152e644e9974ed3152f1f5789d4da532eab47 [diff] [blame]