Blame - Doc/library/codecs.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`codecs` --- Codec registry and base classes

2

=================================================

3

4

.. module:: codecs

5

:synopsis: Encode and decode data and streams.

6

.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>

7

.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>

8

.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>

.. index::

single: Unicode

single: Codecs

pair: Codecs; encode

pair: Codecs; decode

single: streams

pair: stackable; streams

18

19

This module defines base classes for standard Python codecs (encoders and

20

decoders) and provides access to the internal Python codec registry which

21

manages the codec and error handling lookup process.

22

23

It defines the following functions:

24

25

26

.. function:: register(search_function)

27

28

Register a codec search function. Search functions are expected to take one

29

argument, the encoding name in all lower case letters, and return a

30

:class:`CodecInfo` object having the following attributes:

31

32

* ``name`` The name of the encoding;

33

Walter Dörwald

62073e0

2008-10-23 13:21:33 +0000

[diff] [blame]

34

* ``encode`` The stateless encoding function;

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

35

Walter Dörwald

62073e0

2008-10-23 13:21:33 +0000

[diff] [blame]

36

* ``decode`` The stateless decoding function;

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

37

38

* ``incrementalencoder`` An incremental encoder class or factory function;

39

40

* ``incrementaldecoder`` An incremental decoder class or factory function;

41

42

* ``streamwriter`` A stream writer class or factory function;

43

44

* ``streamreader`` A stream reader class or factory function.

45

46

The various functions or classes take the following arguments:

47

Walter Dörwald

62073e0

2008-10-23 13:21:33 +0000

[diff] [blame]

48

*encode* and *decode*: These must be functions or methods which have the same

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

49

interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see

50

Codec Interface). The functions/methods are expected to work in a stateless

51

mode.

52

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

53

*incrementalencoder* and *incrementaldecoder*: These have to be factory

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

54

functions providing the following interface:

55

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

56

``factory(errors='strict')``

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

58

The factory functions must return objects providing the interfaces defined by

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

59

the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

60

respectively. Incremental codecs can maintain state.

61

62

*streamreader* and *streamwriter*: These have to be factory functions providing

63

the following interface:

64

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

65

``factory(stream, errors='strict')``

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

66

67

The factory functions must return objects providing the interfaces defined by

68

the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.

69

Stream codecs can maintain state.

70

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

71

Possible values for errors are

72

73

* ``'strict'``: raise an exception in case of an encoding error

74

* ``'replace'``: replace malformed data with a suitable replacement marker,

75

such as ``'?'`` or ``'\ufffd'``

76

* ``'ignore'``: ignore malformed data and continue without further notice

77

* ``'xmlcharrefreplace'``: replace with the appropriate XML character

78

reference (for encoding only)

79

* ``'backslashreplace'``: replace with backslashed escape sequences (for

Ezio Melotti

e33721e

2010-02-27 13:54:27 +0000

[diff] [blame]

80

encoding only)

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

81

* ``'surrogateescape'``: replace with surrogate U+DCxx, see :pep:`383`

82

83

as well as any other error handling name defined via :func:`register_error`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

84

85

In case a search function cannot find a given encoding, it should return

``None``.

.. function:: lookup(encoding)

90

91

Looks up the codec info in the Python codec registry and returns a

92

:class:`CodecInfo` object as defined above.

93

94

Encodings are first looked up in the registry's cache. If not found, the list of

95

registered search functions is scanned. If no :class:`CodecInfo` object is

96

found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object

97

is stored in the cache and returned to the caller.

98

99

To simplify access to the various codecs, the module provides these additional

100

functions which use :func:`lookup` for the codec lookup:

101

102

103

.. function:: getencoder(encoding)

104

105

Look up the codec for the given encoding and return its encoder function.

106

107

Raises a :exc:`LookupError` in case the encoding cannot be found.

108

109

110

.. function:: getdecoder(encoding)

111

112

Look up the codec for the given encoding and return its decoder function.

113

114

Raises a :exc:`LookupError` in case the encoding cannot be found.

115

116

117

.. function:: getincrementalencoder(encoding)

118

119

Look up the codec for the given encoding and return its incremental encoder

120

class or factory function.

121

122

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

123

doesn't support an incremental encoder.

124

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

125

126

.. function:: getincrementaldecoder(encoding)

127

128

Look up the codec for the given encoding and return its incremental decoder

129

class or factory function.

130

131

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

132

doesn't support an incremental decoder.

133

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

134

135

.. function:: getreader(encoding)

136

137

Look up the codec for the given encoding and return its StreamReader class or

138

factory function.

139

140

Raises a :exc:`LookupError` in case the encoding cannot be found.

141

142

143

.. function:: getwriter(encoding)

144

145

Look up the codec for the given encoding and return its StreamWriter class or

146

factory function.

147

148

Raises a :exc:`LookupError` in case the encoding cannot be found.

149

150

151

.. function:: register_error(name, error_handler)

152

153

Register the error handling function *error_handler* under the name *name*.

154

*error_handler* will be called during encoding and decoding in case of an error,

155

when *name* is specified as the errors parameter.

156

157

For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`

158

instance, which contains information about the location of the error. The error

159

handler must either raise this or a different exception or return a tuple with a

160

replacement for the unencodable part of the input and a position where encoding

161

should continue. The encoder will encode the replacement and continue encoding

162

the original input at the specified position. Negative position values will be

163

treated as being relative to the end of the input string. If the resulting

164

position is out of bound an :exc:`IndexError` will be raised.

165

166

Decoding and translating works similar, except :exc:`UnicodeDecodeError` or

167

:exc:`UnicodeTranslateError` will be passed to the handler and that the

168

replacement from the error handler will be put into the output directly.

169

170

171

.. function:: lookup_error(name)

172

173

Return the error handler previously registered under the name *name*.

174

175

Raises a :exc:`LookupError` in case the handler cannot be found.

176

177

178

.. function:: strict_errors(exception)

179

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

180

Implements the ``strict`` error handling: each encoding or decoding error

181

raises a :exc:`UnicodeError`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

182

183

184

.. function:: replace_errors(exception)

185

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

186

Implements the ``replace`` error handling: malformed data is replaced with a

187

suitable replacement character such as ``'?'`` in bytestrings and

188

``'\ufffd'`` in Unicode strings.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

189

190

191

.. function:: ignore_errors(exception)

192

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

193

Implements the ``ignore`` error handling: malformed data is ignored and

194

encoding or decoding is continued without further notice.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

195

196

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

197

.. function:: xmlcharrefreplace_errors(exception)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

198

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

199

Implements the ``xmlcharrefreplace`` error handling (for encoding only): the

200

unencodable character is replaced by an appropriate XML character reference.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

201

202

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

203

.. function:: backslashreplace_errors(exception)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

204

Georg Brandl

2009-10-27 15:28:25 +0000

[diff] [blame]

205

Implements the ``backslashreplace`` error handling (for encoding only): the

206

unencodable character is replaced by a backslashed escape sequence.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

207

208

To simplify working with encoded files or stream, the module also defines these

utility functions:

.. function:: open(filename, mode[, encoding[, errors[, buffering]]])

213

214

Open an encoded file using the given *mode* and return a wrapped version

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

215

providing transparent encoding/decoding. The default file mode is ``'r'``

216

meaning to open the file in read mode.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. note::

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

220

The wrapped version's methods will accept and return strings only. Bytes

221

arguments will be rejected.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

222

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

223

.. note::

224

225

Files are always opened in binary mode, even if no binary mode was

226

specified. This is done to avoid data loss due to encodings using 8-bit

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

227

values. This means that no automatic conversion of ``b'\n'`` is done

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

228

on reading and writing.

229

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

230

*encoding* specifies the encoding which is to be used for the file.

231

232

*errors* may be given to define the error handling. It defaults to ``'strict'``

233

which causes a :exc:`ValueError` to be raised in case an encoding error occurs.

234

235

*buffering* has the same meaning as for the built-in :func:`open` function. It

236

defaults to line buffered.

237

238

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

239

.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

240

241

Return a wrapped version of file which provides transparent encoding

242

translation.

243

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

244

Bytes written to the wrapped file are interpreted according to the given

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

245

*data_encoding* and then written to the original file as bytes using the

246

*file_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

247

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

248

If *file_encoding* is not given, it defaults to *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

249

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

250

*errors* may be given to define the error handling. It defaults to

251

``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding

252

error occurs.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

253

254

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

255

.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

256

257

Uses an incremental encoder to iteratively encode the input provided by

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

258

*iterator*. This function is a :term:`generator`. *errors* (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

259

other keyword argument) is passed through to the incremental encoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

260

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

261

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

262

.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

263

264

Uses an incremental decoder to iteratively decode the input provided by

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

265

*iterator*. This function is a :term:`generator`. *errors* (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

266

other keyword argument) is passed through to the incremental decoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

267

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

268

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

269

The module also provides the following constants which are useful for reading

270

and writing to platform dependent files:

.. data:: BOM

BOM_BE

BOM_LE

BOM_UTF8

BOM_UTF16

BOM_UTF16_BE

BOM_UTF16_LE

BOM_UTF32

BOM_UTF32_BE

BOM_UTF32_LE

These constants define various encodings of the Unicode byte order mark (BOM)

285

used in UTF-16 and UTF-32 data streams to indicate the byte order used in the

286

stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either

287

:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's

288

native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,

289

:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for

290

:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32

encodings.

.. _codec-base-classes:

Codec Base Classes

------------------

The :mod:`codecs` module defines a set of base classes which define the

Georg Brandl

f08a9dd

2008-06-10 16:57:31 +0000

[diff] [blame]

300

interface and can also be used to easily write your own codecs for use in

301

Python.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

302

303

Each codec has to define four interfaces to make it usable as codec in Python:

304

stateless encoder, stateless decoder, stream reader and stream writer. The

305

stream reader and writers typically reuse the stateless encoder/decoder to

306

implement the file protocols.

307

308

The :class:`Codec` class defines the interface for stateless encoders/decoders.

309

310

To simplify and standardize error handling, the :meth:`encode` and

311

:meth:`decode` methods may implement different error handling schemes by

312

providing the *errors* string argument. The following string values are defined

313

and implemented by all standard Python codecs:

314

315

+-------------------------+-----------------------------------------------+

316

| Value | Meaning |

317

+=========================+===============================================+

318

| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |

319

| | this is the default. |

320

+-------------------------+-----------------------------------------------+

321

| ``'ignore'`` | Ignore the character and continue with the |

322

| | next. |

323

+-------------------------+-----------------------------------------------+

324

| ``'replace'`` | Replace with a suitable replacement |

325

| | character; Python will use the official |

326

| | U+FFFD REPLACEMENT CHARACTER for the built-in |

327

| | Unicode codecs on decoding and '?' on |

328

| | encoding. |

329

+-------------------------+-----------------------------------------------+

330

| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |

331

| | reference (only for encoding). |

332

+-------------------------+-----------------------------------------------+

333

| ``'backslashreplace'`` | Replace with backslashed escape sequences |

334

| | (only for encoding). |

335

+-------------------------+-----------------------------------------------+

Martin v. Löwis

3d2eca0

2009-06-29 06:35:26 +0000

[diff] [blame]

336

| ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined|

337

| | in :pep:`383`. |

Martin v. Löwis

011e842

2009-05-05 04:43:17 +0000

[diff] [blame]

338

+-------------------------+-----------------------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

339

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

340

In addition, the following error handlers are specific to a single codec:

341

Martin v. Löwis

e0a2b72

2009-05-10 08:08:56 +0000

[diff] [blame]

342

+-------------------+---------+-------------------------------------------+

343

| Value | Codec | Meaning |

344

+===================+=========+===========================================+

345

|``'surrogatepass'``| utf-8 | Allow encoding and decoding of surrogate |

346

| | | codes in UTF-8. |

347

+-------------------+---------+-------------------------------------------+

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

348

349

.. versionadded:: 3.1

Martin v. Löwis

43c5778

2009-05-10 08:15:24 +0000

[diff] [blame]

350

The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

351

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

352

The set of allowed values can be extended via :meth:`register_error`.

.. _codec-objects:

Codec Objects

^^^^^^^^^^^^^

The :class:`Codec` class defines these methods which also define the function

361

interfaces of the stateless encoder and decoder:

362

363

364

.. method:: Codec.encode(input[, errors])

365

366

Encodes the object *input* and returns a tuple (output object, length consumed).

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

367

Encoding converts a string object to a bytes object using a particular

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

368

character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).

369

370

*errors* defines the error handling to apply. It defaults to ``'strict'``

371

handling.

372

373

The method may not store state in the :class:`Codec` instance. Use

374

:class:`StreamCodec` for codecs which have to keep state in order to make

375

encoding/decoding efficient.

376

377

The encoder must be able to handle zero length input and return an empty object

378

of the output object type in this situation.

379

380

381

.. method:: Codec.decode(input[, errors])

382

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

383

Decodes the object *input* and returns a tuple (output object, length

384

consumed). Decoding converts a bytes object encoded using a particular

385

character set encoding to a string object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

386

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

387

*input* must be a bytes object or one which provides the read-only character

388

buffer interface -- for example, buffer objects and memory mapped files.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

389

390

*errors* defines the error handling to apply. It defaults to ``'strict'``

391

handling.

392

393

The method may not store state in the :class:`Codec` instance. Use

394

:class:`StreamCodec` for codecs which have to keep state in order to make

395

encoding/decoding efficient.

396

397

The decoder must be able to handle zero length input and return an empty object

398

of the output object type in this situation.

399

400

The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide

401

the basic interface for incremental encoding and decoding. Encoding/decoding the

402

input isn't done with one call to the stateless encoder/decoder function, but

403

with multiple calls to the :meth:`encode`/:meth:`decode` method of the

404

incremental encoder/decoder. The incremental encoder/decoder keeps track of the

405

encoding/decoding process during method calls.

406

407

The joined output of calls to the :meth:`encode`/:meth:`decode` method is the

408

same as if all the single inputs were joined into one, and this input was

409

encoded/decoded with the stateless encoder/decoder.

410

411

412

.. _incremental-encoder-objects:

413

414

IncrementalEncoder Objects

415

^^^^^^^^^^^^^^^^^^^^^^^^^^

416

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

417

The :class:`IncrementalEncoder` class is used for encoding an input in multiple

418

steps. It defines the following methods which every incremental encoder must

419

define in order to be compatible with the Python codec registry.

420

421

422

.. class:: IncrementalEncoder([errors])

423

424

Constructor for an :class:`IncrementalEncoder` instance.

425

426

All incremental encoders must provide this constructor interface. They are free

427

to add additional keyword arguments, but only the ones defined here are used by

428

the Python codec registry.

429

430

The :class:`IncrementalEncoder` may implement different error handling schemes

431

by providing the *errors* keyword argument. These parameters are predefined:

432

433

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

434

435

* ``'ignore'`` Ignore the character and continue with the next.

436

437

* ``'replace'`` Replace with a suitable replacement character

438

439

* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference

440

441

* ``'backslashreplace'`` Replace with backslashed escape sequences.

442

443

The *errors* argument will be assigned to an attribute of the same name.

444

Assigning to this attribute makes it possible to switch between different error

445

handling strategies during the lifetime of the :class:`IncrementalEncoder`

446

object.

447

448

The set of allowed values for the *errors* argument can be extended with

449

:func:`register_error`.

450

451

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

452

.. method:: encode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

453

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

454

Encodes *object* (taking the current state of the encoder into account)

455

and returns the resulting encoded object. If this is the last call to

456

:meth:`encode` *final* must be true (the default is false).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

457

458

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

459

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

460

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

461

Reset the encoder to the initial state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

462

463

464

.. method:: IncrementalEncoder.getstate()

465

466

Return the current state of the encoder which must be an integer. The

467

implementation should make sure that ``0`` is the most common state. (States

468

that are more complicated than integers can be converted into an integer by

469

marshaling/pickling the state and encoding the bytes of the resulting string

470

into an integer).

471

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

472

473

.. method:: IncrementalEncoder.setstate(state)

474

475

Set the state of the encoder to *state*. *state* must be an encoder state

476

returned by :meth:`getstate`.

477

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

478

479

.. _incremental-decoder-objects:

480

481

IncrementalDecoder Objects

482

^^^^^^^^^^^^^^^^^^^^^^^^^^

483

484

The :class:`IncrementalDecoder` class is used for decoding an input in multiple

485

steps. It defines the following methods which every incremental decoder must

486

define in order to be compatible with the Python codec registry.

487

488

489

.. class:: IncrementalDecoder([errors])

490

491

Constructor for an :class:`IncrementalDecoder` instance.

492

493

All incremental decoders must provide this constructor interface. They are free

494

to add additional keyword arguments, but only the ones defined here are used by

495

the Python codec registry.

496

497

The :class:`IncrementalDecoder` may implement different error handling schemes

498

by providing the *errors* keyword argument. These parameters are predefined:

499

500

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

501

502

* ``'ignore'`` Ignore the character and continue with the next.

503

504

* ``'replace'`` Replace with a suitable replacement character.

505

506

The *errors* argument will be assigned to an attribute of the same name.

507

Assigning to this attribute makes it possible to switch between different error

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

508

handling strategies during the lifetime of the :class:`IncrementalDecoder`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

509

object.

510

511

The set of allowed values for the *errors* argument can be extended with

512

:func:`register_error`.

513

514

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

515

.. method:: decode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

516

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

517

Decodes *object* (taking the current state of the decoder into account)

518

and returns the resulting decoded object. If this is the last call to

519

:meth:`decode` *final* must be true (the default is false). If *final* is

520

true the decoder must decode the input completely and must flush all

521

buffers. If this isn't possible (e.g. because of incomplete byte sequences

522

at the end of the input) it must initiate error handling just like in the

523

stateless case (which might raise an exception).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

524

525

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

526

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

527

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

528

Reset the decoder to the initial state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

529

530

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

531

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

532

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

533

Return the current state of the decoder. This must be a tuple with two

534

items, the first must be the buffer containing the still undecoded

535

input. The second must be an integer and can be additional state

536

info. (The implementation should make sure that ``0`` is the most common

537

additional state info.) If this additional state info is ``0`` it must be

538

possible to set the decoder to the state which has no input buffered and

539

``0`` as the additional state info, so that feeding the previously

540

buffered input to the decoder returns it to the previous state without

541

producing any output. (Additional state info that is more complicated than

542

integers can be converted into an integer by marshaling/pickling the info

543

and encoding the bytes of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

544

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

545

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

546

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

547

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

548

Set the state of the encoder to *state*. *state* must be a decoder state

549

returned by :meth:`getstate`.

550

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

551

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

552

The :class:`StreamWriter` and :class:`StreamReader` classes provide generic

553

working interfaces which can be used to implement new encoding submodules very

554

easily. See :mod:`encodings.utf_8` for an example of how this is done.

555

556

557

.. _stream-writer-objects:

StreamWriter Objects

^^^^^^^^^^^^^^^^^^^^

The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the

563

following methods which every stream writer must define in order to be

564

compatible with the Python codec registry.

565

566

567

.. class:: StreamWriter(stream[, errors])

568

569

Constructor for a :class:`StreamWriter` instance.

570

571

All stream writers must provide this constructor interface. They are free to add

572

additional keyword arguments, but only the ones defined here are used by the

573

Python codec registry.

574

575

*stream* must be a file-like object open for writing binary data.

576

577

The :class:`StreamWriter` may implement different error handling schemes by

578

providing the *errors* keyword argument. These parameters are predefined:

579

580

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

581

582

* ``'ignore'`` Ignore the character and continue with the next.

583

584

* ``'replace'`` Replace with a suitable replacement character

585

586

* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference

587

588

* ``'backslashreplace'`` Replace with backslashed escape sequences.

589

590

The *errors* argument will be assigned to an attribute of the same name.

591

Assigning to this attribute makes it possible to switch between different error

592

handling strategies during the lifetime of the :class:`StreamWriter` object.

593

594

The set of allowed values for the *errors* argument can be extended with

595

:func:`register_error`.

596

597

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

598

.. method:: write(object)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

599

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

600

Writes the object's contents encoded to the stream.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

601

602

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

603

.. method:: writelines(list)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

604

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

605

Writes the concatenated list of strings to the stream (possibly by reusing

606

the :meth:`write` method).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

607

608

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

609

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

610

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

611

Flushes and resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

612

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

613

Calling this method should ensure that the data on the output is put into

614

a clean state that allows appending of new fresh data without having to

615

rescan the whole stream to recover state.

616

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

617

618

In addition to the above methods, the :class:`StreamWriter` must also inherit

619

all other methods and attributes from the underlying stream.

620

621

622

.. _stream-reader-objects:

StreamReader Objects

^^^^^^^^^^^^^^^^^^^^

The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the

628

following methods which every stream reader must define in order to be

629

compatible with the Python codec registry.

630

631

632

.. class:: StreamReader(stream[, errors])

633

634

Constructor for a :class:`StreamReader` instance.

635

636

All stream readers must provide this constructor interface. They are free to add

637

additional keyword arguments, but only the ones defined here are used by the

638

Python codec registry.

639

640

*stream* must be a file-like object open for reading (binary) data.

641

642

The :class:`StreamReader` may implement different error handling schemes by

643

providing the *errors* keyword argument. These parameters are defined:

644

645

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

646

647

* ``'ignore'`` Ignore the character and continue with the next.

648

649

* ``'replace'`` Replace with a suitable replacement character.

650

651

The *errors* argument will be assigned to an attribute of the same name.

652

Assigning to this attribute makes it possible to switch between different error

653

handling strategies during the lifetime of the :class:`StreamReader` object.

654

655

The set of allowed values for the *errors* argument can be extended with

656

:func:`register_error`.

657

658

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

659

.. method:: read([size[, chars, [firstline]]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

660

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

661

Decodes data from the stream and returns the resulting object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

662

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

663

*chars* indicates the number of characters to read from the

664

stream. :func:`read` will never return more than *chars* characters, but

665

it might return less, if there are not enough characters available.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

666

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

667

*size* indicates the approximate maximum number of bytes to read from the

668

stream for decoding purposes. The decoder can modify this setting as

669

appropriate. The default value -1 indicates to read and decode as much as

670

possible. *size* is intended to prevent having to decode huge files in

671

one step.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

672

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

673

*firstline* indicates that it would be sufficient to only return the first

674

line, if there are decoding errors on later lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

675

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

676

The method should use a greedy read strategy meaning that it should read

677

as much data as is allowed within the definition of the encoding and the

678

given size, e.g. if optional encoding endings or state markers are

679

available on the stream, these should be read too.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

680

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

681

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

682

.. method:: readline([size[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

683

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

684

Read one line from the input stream and return the decoded data.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

685

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

686

*size*, if given, is passed as size argument to the stream's

687

:meth:`readline` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

688

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

689

If *keepends* is false line-endings will be stripped from the lines

690

returned.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

691

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

692

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

693

.. method:: readlines([sizehint[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

694

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

695

Read all lines available on the input stream and return them as a list of

696

lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

697

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

698

Line-endings are implemented using the codec's decoder method and are

699

included in the list entries if *keepends* is true.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

700

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

701

*sizehint*, if given, is passed as the *size* argument to the stream's

702

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

703

704

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

705

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

706

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

707

Resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

708

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

709

Note that no stream repositioning should take place. This method is

710

primarily intended to be able to recover from decoding errors.

711

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

712

713

In addition to the above methods, the :class:`StreamReader` must also inherit

714

all other methods and attributes from the underlying stream.

715

716

The next two base classes are included for convenience. They are not needed by

717

the codec registry, but may provide useful in practice.

718

719

720

.. _stream-reader-writer:

721

722

StreamReaderWriter Objects

723

^^^^^^^^^^^^^^^^^^^^^^^^^^

724

725

The :class:`StreamReaderWriter` allows wrapping streams which work in both read

726

and write modes.

727

728

The design is such that one can use the factory functions returned by the

729

:func:`lookup` function to construct the instance.

730

731

732

.. class:: StreamReaderWriter(stream, Reader, Writer, errors)

733

734

Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like

735

object. *Reader* and *Writer* must be factory functions or classes providing the

736

:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling

737

is done in the same way as defined for the stream readers and writers.

738

739

:class:`StreamReaderWriter` instances define the combined interfaces of

740

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

741

methods and attributes from the underlying stream.

742

743

744

.. _stream-recoder-objects:

745

746

StreamRecoder Objects

747

^^^^^^^^^^^^^^^^^^^^^

748

749

The :class:`StreamRecoder` provide a frontend - backend view of encoding data

750

which is sometimes useful when dealing with different encoding environments.

751

752

The design is such that one can use the factory functions returned by the

753

:func:`lookup` function to construct the instance.

754

755

756

.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)

757

758

Creates a :class:`StreamRecoder` instance which implements a two-way conversion:

759

*encode* and *decode* work on the frontend (the input to :meth:`read` and output

760

of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and

761

writing to the stream).

762

763

You can use these objects to do transparent direct recodings from e.g. Latin-1

764

to UTF-8 and back.

765

766

*stream* must be a file-like object.

767

768

*encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,

769

*Writer* must be factory functions or classes providing objects of the

770

:class:`StreamReader` and :class:`StreamWriter` interface respectively.

771

772

*encode* and *decode* are needed for the frontend translation, *Reader* and

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

773

*Writer* for the backend translation.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

774

775

Error handling is done in the same way as defined for the stream readers and

776

writers.

777

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

778

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

779

:class:`StreamRecoder` instances define the combined interfaces of

780

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

781

methods and attributes from the underlying stream.

782

783

784

.. _encodings-overview:

785

786

Encodings and Unicode

787

---------------------

788

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

789

Strings are stored internally as sequences of codepoints (to be precise

Georg Brandl

60203b4

2010-10-06 10:11:56 +0000

[diff] [blame]

790

as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either

Éric Araujo

713d303

2010-11-18 16:38:46 +0000

[diff] [blame]

791

via ``--without-wide-unicode`` or ``--with-wide-unicode``, with the

Georg Brandl

60203b4

2010-10-06 10:11:56 +0000

[diff] [blame]

792

former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

793

type. Once a string object is used outside of CPU and memory, CPU endianness

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

794

and how these arrays are stored as bytes become an issue. Transforming a

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

795

string object into a sequence of bytes is called encoding and recreating the

796

string object from the sequence of bytes is known as decoding. There are many

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

797

different methods for how this transformation can be done (these methods are

798

also called encodings). The simplest method is to map the codepoints 0-255 to

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

799

the bytes ``0x0``-``0xff``. This means that a string object that contains

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

800

codepoints above ``U+00FF`` can't be encoded with this method (which is called

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

801

``'latin-1'`` or ``'iso-8859-1'``). :func:`str.encode` will raise a

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

802

:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

803

codec can't encode character '\u1234' in position 3: ordinal not in

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

804

range(256)``.

805

806

There's another group of encodings (the so called charmap encodings) that choose

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

807

a different subset of all Unicode code points and how these codepoints are

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

808

mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open

809

e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on

810

Windows). There's a string constant with 256 characters that shows you which

811

character is mapped to which byte value.

812

813

All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

814

defined in Unicode. A simple and straightforward way that can store each Unicode

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

815

code point, is to store each codepoint as two consecutive bytes. There are two

816

possibilities: Store the bytes in big endian or in little endian order. These

817

two encodings are called UTF-16-BE and UTF-16-LE respectively. Their

818

disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you

819

will always have to swap bytes on encoding and decoding. UTF-16 avoids this

820

problem: Bytes will always be in natural endianness. When these bytes are read

821

by a CPU with a different endianness, then bytes have to be swapped though. To

822

be able to detect the endianness of a UTF-16 byte sequence, there's the so

823

called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.

824

This character will be prepended to every UTF-16 byte sequence. The byte swapped

825

version of this character (``0xFFFE``) is an illegal character that may not

826

appear in a Unicode text. So when the first character in an UTF-16 byte sequence

827

appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.

828

Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as

829

a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow

830

a word to be split. It can e.g. be used to give hints to a ligature algorithm.

831

With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been

832

deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless

833

Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM

834

it's a device to determine the storage layout of the encoded bytes, and vanishes

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

835

once the byte sequence has been decoded into a string; as a ``ZERO WIDTH

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

836

NO-BREAK SPACE`` it's a normal character that will be decoded like any other.

837

838

There's another encoding that is able to encoding the full range of Unicode

839

characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues

840

with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two

841

parts: Marker bits (the most significant bits) and payload bits. The marker bits

842

are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are

843

encoded like this (with x being payload bits, which when concatenated give the

844

Unicode character):

845

846

+-----------------------------------+----------------------------------------------+

847

| Range | Encoding |

848

+===================================+==============================================+

849

| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |

850

+-----------------------------------+----------------------------------------------+

851

| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |

852

+-----------------------------------+----------------------------------------------+

853

| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |

854

+-----------------------------------+----------------------------------------------+

855

| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |

856

+-----------------------------------+----------------------------------------------+

857

| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |

858

+-----------------------------------+----------------------------------------------+

859

| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |

860

| | 10xxxxxx |

861

+-----------------------------------+----------------------------------------------+

862

863

The least significant bit of the Unicode character is the rightmost x bit.

864

865

As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

866

the decoded string (even if it's the first character) is treated as a ``ZERO

867

WIDTH NO-BREAK SPACE``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

868

869

Without external information it's impossible to reliably determine which

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

870

encoding was used for encoding a string. Each charmap encoding can

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

871

decode any random byte sequence. However that's not possible with UTF-8, as

872

UTF-8 byte sequences have a structure that doesn't allow arbitrary byte

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

873

sequences. To increase the reliability with which a UTF-8 encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

874

detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls

875

``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters

876

is written to the file, a UTF-8 encoded BOM (which looks like this as a byte

877

sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable

878

that any charmap encoded file starts with these byte values (which would e.g.

879

map to

880

881

| LATIN SMALL LETTER I WITH DIAERESIS

882

| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

883

| INVERTED QUESTION MARK

884

885

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be

886

correctly guessed from the byte sequence. So here the BOM is not used to be able

887

to determine the byte order used for generating the byte sequence, but as a

888

signature that helps in guessing the encoding. On encoding the utf-8-sig codec

889

will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On

890

decoding utf-8-sig will skip those three bytes if they appear as the first three

bytes in the file.

.. _standard-encodings:

Standard Encodings

------------------

Python comes with a number of codecs built-in, either implemented as C functions

900

or with dictionaries as mapping tables. The following table lists the codecs by

901

name, together with a few common aliases, and the languages for which the

902

encoding is likely used. Neither the list of aliases nor the list of languages

903

is meant to be exhaustive. Notice that spelling alternatives that only differ in

Georg Brandl

a6053b4

2009-09-01 08:11:14 +0000

[diff] [blame]

904

case or use a hyphen instead of an underscore are also valid aliases; therefore,

905

e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

906

907

Many of the character sets support the same languages. They vary in individual

908

characters (e.g. whether the EURO SIGN is supported or not), and in the

909

assignment of characters to code positions. For the European languages in

910

particular, the following variants typically exist:

911

912

* an ISO 8859 codeset

913

914

* a Microsoft Windows code page, which is typically derived from a 8859 codeset,

915

but replaces control characters with additional graphic characters

916

917

* an IBM EBCDIC code page

918

919

* an IBM PC code page, which is ASCII compatible

920

921

+-----------------+--------------------------------+--------------------------------+

922

| Codec | Aliases | Languages |

923

+=================+================================+================================+

924

| ascii | 646, us-ascii | English |

925

+-----------------+--------------------------------+--------------------------------+

926

| big5 | big5-tw, csbig5 | Traditional Chinese |

927

+-----------------+--------------------------------+--------------------------------+

928

| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |

929

+-----------------+--------------------------------+--------------------------------+

930

| cp037 | IBM037, IBM039 | English |

931

+-----------------+--------------------------------+--------------------------------+

932

| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |

933

+-----------------+--------------------------------+--------------------------------+

934

| cp437 | 437, IBM437 | English |

935

+-----------------+--------------------------------+--------------------------------+

936

| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |

937

| | IBM500 | |

938

+-----------------+--------------------------------+--------------------------------+

Amaury Forgeot d'Arc

ae6388d

2009-07-15 19:21:18 +0000

[diff] [blame]

939

| cp720 | | Arabic |

940

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

941

| cp737 | | Greek |

942

+-----------------+--------------------------------+--------------------------------+

943

| cp775 | IBM775 | Baltic languages |

944

+-----------------+--------------------------------+--------------------------------+

945

| cp850 | 850, IBM850 | Western Europe |

946

+-----------------+--------------------------------+--------------------------------+

947

| cp852 | 852, IBM852 | Central and Eastern Europe |

948

+-----------------+--------------------------------+--------------------------------+

949

| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |

950

| | | Macedonian, Russian, Serbian |

951

+-----------------+--------------------------------+--------------------------------+

952

| cp856 | | Hebrew |

953

+-----------------+--------------------------------+--------------------------------+

954

| cp857 | 857, IBM857 | Turkish |

955

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

5a6214a

2010-06-27 22:41:29 +0000

[diff] [blame]

956

| cp858 | 858, IBM858 | Western Europe |

957

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

958

| cp860 | 860, IBM860 | Portuguese |

959

+-----------------+--------------------------------+--------------------------------+

960

| cp861 | 861, CP-IS, IBM861 | Icelandic |

961

+-----------------+--------------------------------+--------------------------------+

962

| cp862 | 862, IBM862 | Hebrew |

963

+-----------------+--------------------------------+--------------------------------+

964

| cp863 | 863, IBM863 | Canadian |

965

+-----------------+--------------------------------+--------------------------------+

966

| cp864 | IBM864 | Arabic |

967

+-----------------+--------------------------------+--------------------------------+

968

| cp865 | 865, IBM865 | Danish, Norwegian |

969

+-----------------+--------------------------------+--------------------------------+

970

| cp866 | 866, IBM866 | Russian |

971

+-----------------+--------------------------------+--------------------------------+

972

| cp869 | 869, CP-GR, IBM869 | Greek |

973

+-----------------+--------------------------------+--------------------------------+

974

| cp874 | | Thai |

975

+-----------------+--------------------------------+--------------------------------+

976

| cp875 | | Greek |

977

+-----------------+--------------------------------+--------------------------------+

978

| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |

979

+-----------------+--------------------------------+--------------------------------+

980

| cp949 | 949, ms949, uhc | Korean |

981

+-----------------+--------------------------------+--------------------------------+

982

| cp950 | 950, ms950 | Traditional Chinese |

983

+-----------------+--------------------------------+--------------------------------+

984

| cp1006 | | Urdu |

985

+-----------------+--------------------------------+--------------------------------+

986

| cp1026 | ibm1026 | Turkish |

987

+-----------------+--------------------------------+--------------------------------+

988

| cp1140 | ibm1140 | Western Europe |

989

+-----------------+--------------------------------+--------------------------------+

990

| cp1250 | windows-1250 | Central and Eastern Europe |

991

+-----------------+--------------------------------+--------------------------------+

992

| cp1251 | windows-1251 | Bulgarian, Byelorussian, |

993

| | | Macedonian, Russian, Serbian |

994

+-----------------+--------------------------------+--------------------------------+

995

| cp1252 | windows-1252 | Western Europe |

996

+-----------------+--------------------------------+--------------------------------+

997

| cp1253 | windows-1253 | Greek |

998

+-----------------+--------------------------------+--------------------------------+

999

| cp1254 | windows-1254 | Turkish |

1000

+-----------------+--------------------------------+--------------------------------+

1001

| cp1255 | windows-1255 | Hebrew |

1002

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

4ac9ce4

2009-10-04 14:49:41 +0000

[diff] [blame]

1003

| cp1256 | windows-1256 | Arabic |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1004

+-----------------+--------------------------------+--------------------------------+

1005

| cp1257 | windows-1257 | Baltic languages |

1006

+-----------------+--------------------------------+--------------------------------+

1007

| cp1258 | windows-1258 | Vietnamese |

1008

+-----------------+--------------------------------+--------------------------------+

1009

| euc_jp | eucjp, ujis, u-jis | Japanese |

1010

+-----------------+--------------------------------+--------------------------------+

1011

| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |

1012

+-----------------+--------------------------------+--------------------------------+

1013

| euc_jisx0213 | eucjisx0213 | Japanese |

1014

+-----------------+--------------------------------+--------------------------------+

1015

| euc_kr | euckr, korean, ksc5601, | Korean |

1016

| | ks_c-5601, ks_c-5601-1987, | |

1017

| | ksx1001, ks_x-1001 | |

1018

+-----------------+--------------------------------+--------------------------------+

1019

| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |

1020

| | cn, euccn, eucgb2312-cn, | |

1021

| | gb2312-1980, gb2312-80, iso- | |

1022

| | ir-58 | |

1023

+-----------------+--------------------------------+--------------------------------+

1024

| gbk | 936, cp936, ms936 | Unified Chinese |

1025

+-----------------+--------------------------------+--------------------------------+

1026

| gb18030 | gb18030-2000 | Unified Chinese |

1027

+-----------------+--------------------------------+--------------------------------+

1028

| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |

1029

+-----------------+--------------------------------+--------------------------------+

1030

| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |

1031

| | iso-2022-jp | |

1032

+-----------------+--------------------------------+--------------------------------+

1033

| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |

1034

+-----------------+--------------------------------+--------------------------------+

1035

| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |

1036

| | | Chinese, Western Europe, Greek |

1037

+-----------------+--------------------------------+--------------------------------+

1038

| iso2022_jp_2004 | iso2022jp-2004, | Japanese |

1039

| | iso-2022-jp-2004 | |

1040

+-----------------+--------------------------------+--------------------------------+

1041

| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |

1042

+-----------------+--------------------------------+--------------------------------+

1043

| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |

1044

+-----------------+--------------------------------+--------------------------------+

1045

| iso2022_kr | csiso2022kr, iso2022kr, | Korean |

1046

| | iso-2022-kr | |

1047

+-----------------+--------------------------------+--------------------------------+

1048

| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |

1049

| | cp819, latin, latin1, L1 | |

1050

+-----------------+--------------------------------+--------------------------------+

1051

| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |

1052

+-----------------+--------------------------------+--------------------------------+

1053

| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |

1054

+-----------------+--------------------------------+--------------------------------+

Christian Heimes

c3f30c4

2008-02-22 16:37:40 +0000

[diff] [blame]

1055

| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1056

+-----------------+--------------------------------+--------------------------------+

1057

| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |

1058

| | | Macedonian, Russian, Serbian |

1059

+-----------------+--------------------------------+--------------------------------+

1060

| iso8859_6 | iso-8859-6, arabic | Arabic |

1061

+-----------------+--------------------------------+--------------------------------+

1062

| iso8859_7 | iso-8859-7, greek, greek8 | Greek |

1063

+-----------------+--------------------------------+--------------------------------+

1064

| iso8859_8 | iso-8859-8, hebrew | Hebrew |

1065

+-----------------+--------------------------------+--------------------------------+

1066

| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |

1067

+-----------------+--------------------------------+--------------------------------+

1068

| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |

1069

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1070

| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1071

+-----------------+--------------------------------+--------------------------------+

1072

| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |

1073

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1074

| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |

1075

+-----------------+--------------------------------+--------------------------------+

1076

| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1077

+-----------------+--------------------------------+--------------------------------+

1078

| johab | cp1361, ms1361 | Korean |

1079

+-----------------+--------------------------------+--------------------------------+

1080

| koi8_r | | Russian |

1081

+-----------------+--------------------------------+--------------------------------+

1082

| koi8_u | | Ukrainian |

1083

+-----------------+--------------------------------+--------------------------------+

1084

| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |

1085

| | | Macedonian, Russian, Serbian |

1086

+-----------------+--------------------------------+--------------------------------+

1087

| mac_greek | macgreek | Greek |

1088

+-----------------+--------------------------------+--------------------------------+

1089

| mac_iceland | maciceland | Icelandic |

1090

+-----------------+--------------------------------+--------------------------------+

1091

| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |

1092

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

23110e7

2010-08-21 02:54:44 +0000

[diff] [blame]

1093

| mac_roman | macroman, macintosh | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1094

+-----------------+--------------------------------+--------------------------------+

1095

| mac_turkish | macturkish | Turkish |

1096

+-----------------+--------------------------------+--------------------------------+

1097

| ptcp154 | csptcp154, pt154, cp154, | Kazakh |

1098

| | cyrillic-asian | |

1099

+-----------------+--------------------------------+--------------------------------+

1100

| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |

1101

| | s_jis | |

1102

+-----------------+--------------------------------+--------------------------------+

1103

| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |

1104

| | sjis2004 | |

1105

+-----------------+--------------------------------+--------------------------------+

1106

| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |

1107

| | s_jisx0213 | |

1108

+-----------------+--------------------------------+--------------------------------+

Walter Dörwald

41980ca

2007-08-16 21:55:45 +0000

[diff] [blame]

1109

| utf_32 | U32, utf32 | all languages |

1110

+-----------------+--------------------------------+--------------------------------+

1111

| utf_32_be | UTF-32BE | all languages |

1112

+-----------------+--------------------------------+--------------------------------+

1113

| utf_32_le | UTF-32LE | all languages |

1114

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1115

| utf_16 | U16, utf16 | all languages |

1116

+-----------------+--------------------------------+--------------------------------+

1117

| utf_16_be | UTF-16BE | all languages (BMP only) |

1118

+-----------------+--------------------------------+--------------------------------+

1119

| utf_16_le | UTF-16LE | all languages (BMP only) |

1120

+-----------------+--------------------------------+--------------------------------+

1121

| utf_7 | U7, unicode-1-1-utf-7 | all languages |

1122

+-----------------+--------------------------------+--------------------------------+

1123

| utf_8 | U8, UTF, utf8 | all languages |

1124

+-----------------+--------------------------------+--------------------------------+

1125

| utf_8_sig | | all languages |

1126

+-----------------+--------------------------------+--------------------------------+

1127

Georg Brandl

226878c

2007-08-31 10:15:37 +0000

[diff] [blame]

1128

.. XXX fix here, should be in above table

1129

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1130

+--------------------+---------+---------------------------+

1131

| Codec | Aliases | Purpose |

1132

+====================+=========+===========================+

1133

| idna | | Implements :rfc:`3490`, |

1134

| | | see also |

1135

| | | :mod:`encodings.idna` |

1136

+--------------------+---------+---------------------------+

1137

| mbcs | dbcs | Windows only: Encode |

1138

| | | operand according to the |

1139

| | | ANSI codepage (CP_ACP) |

1140

+--------------------+---------+---------------------------+

1141

| palmos | | Encoding of PalmOS 3.5 |

1142

+--------------------+---------+---------------------------+

1143

| punycode | | Implements :rfc:`3492` |

1144

+--------------------+---------+---------------------------+

1145

| raw_unicode_escape | | Produce a string that is |

1146

| | | suitable as raw Unicode |

1147

| | | literal in Python source |

1148

| | | code |

1149

+--------------------+---------+---------------------------+

1150

| undefined | | Raise an exception for |

1151

| | | all conversions. Can be |

1152

| | | used as the system |

1153

| | | encoding if no automatic |

1154

| | | coercion between byte and |

1155

| | | Unicode strings is |

1156

| | | desired. |

1157

+--------------------+---------+---------------------------+

1158

| unicode_escape | | Produce a string that is |

1159

| | | suitable as Unicode |

1160

| | | literal in Python source |

1161

| | | code |

1162

+--------------------+---------+---------------------------+

1163

| unicode_internal | | Return the internal |

1164

| | | representation of the |

1165

| | | operand |

1166

+--------------------+---------+---------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1167

Georg Brandl

0252462

2010-12-02 18:06:51 +0000

[diff] [blame^]

1168

The following codecs provide bytes-to-bytes mappings. They can be used with

1169

:meth:`bytes.transform` and :meth:`bytes.untransform`.

1170

1171

+--------------------+---------------------------+---------------------------+

1172

| Codec | Aliases | Purpose |

1173

+====================+===========================+===========================+

1174

| base64_codec | base64, base-64 | Convert operand to MIME |

1175

| | | base64 |

1176

+--------------------+---------------------------+---------------------------+

1177

| bz2_codec | bz2 | Compress the operand |

1178

| | | using bz2 |

1179

+--------------------+---------------------------+---------------------------+

1180

| hex_codec | hex | Convert operand to |

1181

| | | hexadecimal |

1182

| | | representation, with two |

1183

| | | digits per byte |

1184

+--------------------+---------------------------+---------------------------+

1185

| quopri_codec | quopri, quoted-printable, | Convert operand to MIME |

1186

| | quotedprintable | quoted printable |

1187

+--------------------+---------------------------+---------------------------+

1188

| uu_codec | uu | Convert the operand using |

1189

| | | uuencode |

1190

+--------------------+---------------------------+---------------------------+

1191

| zlib_codec | zip, zlib | Compress the operand |

1192

| | | using gzip |

1193

+--------------------+---------------------------+---------------------------+

1194

1195

The following codecs provide string-to-string mappings. They can be used with

1196

:meth:`str.transform` and :meth:`str.untransform`.

1197

1198

+--------------------+---------------------------+---------------------------+

1199

| Codec | Aliases | Purpose |

1200

+====================+===========================+===========================+

1201

| rot_13 | rot13 | Returns the Caesar-cypher |

1202

| | | encryption of the operand |

1203

+--------------------+---------------------------+---------------------------+

1204

1205

.. versionadded:: 3.2

1206

bytes-to-bytes and string-to-string codecs.

1207

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1208

1209

:mod:`encodings.idna` --- Internationalized Domain Names in Applications

1210

------------------------------------------------------------------------

1211

1212

.. module:: encodings.idna

1213

:synopsis: Internationalized Domain Names implementation

1214

.. moduleauthor:: Martin v. Löwis

1215

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1216

This module implements :rfc:`3490` (Internationalized Domain Names in

1217

Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for

1218

Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding

1219

and :mod:`stringprep`.

1220

1221

These RFCs together define a protocol to support non-ASCII characters in domain

1222

names. A domain name containing non-ASCII characters (such as

1223

``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding

1224

(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain

1225

name is then used in all places where arbitrary characters are not allowed by

1226

the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so

1227

on. This conversion is carried out in the application; if possible invisible to

1228

the user: The application should transparently convert Unicode domain labels to

1229

IDNA on the wire, and convert back ACE labels to Unicode before presenting them

1230

to the user.

1231

1232

Python supports this conversion in several ways: The ``idna`` codec allows to

1233

convert between Unicode and the ACE. Furthermore, the :mod:`socket` module

1234

transparently converts Unicode host names to ACE, so that applications need not

1235

be concerned about converting host names themselves when they pass them to the

1236

socket module. On top of that, modules that have host names as function

Georg Brandl

2442015

2008-05-26 16:32:26 +0000

[diff] [blame]

1237

parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host

1238

names (:mod:`http.client` then also transparently sends an IDNA hostname in the

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1239

:mailheader:`Host` field if it sends that field at all).

1240

1241

When receiving host names from the wire (such as in reverse name lookup), no

1242

automatic conversion to Unicode is performed: Applications wishing to present

1243

such host names to the user should decode them to Unicode.

1244

1245

The module :mod:`encodings.idna` also implements the nameprep procedure, which

1246

performs certain normalizations on host names, to achieve case-insensitivity of

1247

international domain names, and to unify similar characters. The nameprep

1248

functions can be used directly if desired.

1249

1250

1251

.. function:: nameprep(label)

1252

1253

Return the nameprepped version of *label*. The implementation currently assumes

1254

query strings, so ``AllowUnassigned`` is true.

1255

1256

1257

.. function:: ToASCII(label)

1258

1259

Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is

assumed to be false.

.. function:: ToUnicode(label)

1264

1265

Convert a label to Unicode, as specified in :rfc:`3490`.

1266

1267

Victor Stinner

554f3f0

2010-06-16 23:33:54 +0000

[diff] [blame]

1268

:mod:`encodings.mbcs` --- Windows ANSI codepage

1269

-----------------------------------------------

1270

1271

.. module:: encodings.mbcs

1272

:synopsis: Windows ANSI codepage

1273

1274

Encode operand according to the ANSI codepage (CP_ACP). This codec only

1275

supports ``'strict'`` and ``'replace'`` error handlers to encode, and

1276

``'strict'`` and ``'ignore'`` error handlers to decode.

1277

1278

Availability: Windows only.

1279

1280

.. versionchanged:: 3.2

1281

Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used

1282

to encode, and ``'ignore'`` to decode.

1283

1284

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1285

:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature

1286

-------------------------------------------------------------

1287

1288

.. module:: encodings.utf_8_sig

1289

:synopsis: UTF-8 codec with BOM signature

1290

.. moduleauthor:: Walter Dörwald

1291

Georg Brandl