Blame - Doc/library/codecs.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`codecs` --- Codec registry and base classes

2

=================================================

3

4

.. module:: codecs

5

:synopsis: Encode and decode data and streams.

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

6

Antoine Pitrou

fbd4f80

2012-08-11 16:51:50 +0200

[diff] [blame]

7

.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>

8

.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

9

.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>

10

Andrew Kuchling

2e3743c

2014-03-19 16:23:01 -0400

[diff] [blame]

11

**Source code:** :source:`Lib/codecs.py`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. index::

single: Unicode

single: Codecs

pair: Codecs; encode

pair: Codecs; decode

single: streams

pair: stackable; streams

20

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

21

--------------

22

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

23

This module defines base classes for standard Python codecs (encoders and

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

24

decoders) and provides access to the internal Python codec registry, which

25

manages the codec and error handling lookup process. Most standard codecs

26

are :term:`text encodings <text encoding>`, which encode text to bytes,

27

but there are also codecs provided that encode text to text, and bytes to

28

bytes. Custom codecs may encode and decode between arbitrary types, but some

29

module features are restricted to use specifically with

30

:term:`text encodings <text encoding>`, or with codecs that encode to

31

:class:`bytes`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

32

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

33

The module defines the following functions for encoding and decoding with

34

any codec:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

35

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

36

.. function:: encode(obj, encoding='utf-8', errors='strict')

37

38

Encodes *obj* using the codec registered for *encoding*.

39

40

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

41

default error handler is ``'strict'`` meaning that encoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

42

:exc:`ValueError` (or a more codec specific subclass, such as

43

:exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more

44

information on codec error handling.

45

46

.. function:: decode(obj, encoding='utf-8', errors='strict')

47

48

Decodes *obj* using the codec registered for *encoding*.

49

50

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

51

default error handler is ``'strict'`` meaning that decoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

52

:exc:`ValueError` (or a more codec specific subclass, such as

53

:exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more

54

information on codec error handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

55

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

56

The full details for each codec can also be looked up directly:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

58

.. function:: lookup(encoding)

59

60

Looks up the codec info in the Python codec registry and returns a

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

61

:class:`CodecInfo` object as defined below.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

62

63

Encodings are first looked up in the registry's cache. If not found, the list of

64

registered search functions is scanned. If no :class:`CodecInfo` object is

65

found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object

66

is stored in the cache and returned to the caller.

67

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

68

.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

69

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

70

Codec details when looking up the codec registry. The constructor

71

arguments are stored in attributes of the same name:

.. attribute:: name

The name of the encoding.

77

78

79

.. attribute:: encode

80

decode

81

82

The stateless encoding and decoding functions. These must be

83

functions or methods which have the same interface as

84

the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec

85

instances (see :ref:`Codec Interface <codec-objects>`).

86

The functions or methods are expected to work in a stateless mode.

87

88

89

.. attribute:: incrementalencoder

90

incrementaldecoder

91

92

Incremental encoder and decoder classes or factory functions.

93

These have to provide the interface defined by the base classes

94

:class:`IncrementalEncoder` and :class:`IncrementalDecoder`,

95

respectively. Incremental codecs can maintain state.

96

97

98

.. attribute:: streamwriter

99

streamreader

100

101

Stream writer and reader classes or factory functions. These have to

102

provide the interface defined by the base classes

103

:class:`StreamWriter` and :class:`StreamReader`, respectively.

104

Stream codecs can maintain state.

105

106

To simplify access to the various codec components, the module provides

107

these additional functions which use :func:`lookup` for the codec lookup:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

108

109

.. function:: getencoder(encoding)

110

111

Look up the codec for the given encoding and return its encoder function.

112

113

Raises a :exc:`LookupError` in case the encoding cannot be found.

114

115

116

.. function:: getdecoder(encoding)

117

118

Look up the codec for the given encoding and return its decoder function.

119

120

Raises a :exc:`LookupError` in case the encoding cannot be found.

121

122

123

.. function:: getincrementalencoder(encoding)

124

125

Look up the codec for the given encoding and return its incremental encoder

126

class or factory function.

127

128

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

129

doesn't support an incremental encoder.

130

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

131

132

.. function:: getincrementaldecoder(encoding)

133

134

Look up the codec for the given encoding and return its incremental decoder

135

class or factory function.

136

137

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

138

doesn't support an incremental decoder.

139

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

140

141

.. function:: getreader(encoding)

142

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

143

Look up the codec for the given encoding and return its :class:`StreamReader`

144

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

145

146

Raises a :exc:`LookupError` in case the encoding cannot be found.

147

148

149

.. function:: getwriter(encoding)

150

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

151

Look up the codec for the given encoding and return its :class:`StreamWriter`

152

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

153

154

Raises a :exc:`LookupError` in case the encoding cannot be found.

155

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

156

Custom codecs are made available by registering a suitable codec search

157

function:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

158

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

159

.. function:: register(search_function)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

160

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

161

Register a codec search function. Search functions are expected to take one

162

argument, being the encoding name in all lower case letters, and return a

163

:class:`CodecInfo` object. In case a search function cannot find

164

a given encoding, it should return ``None``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. note::

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

168

Search function registration is not currently reversible,

169

which may cause problems in some cases, such as unit testing or

170

module reloading.

171

172

While the builtin :func:`open` and the associated :mod:`io` module are the

173

recommended approach for working with encoded text files, this module

174

provides additional utility functions and classes that allow the use of a

175

wider range of codecs when working with binary files:

176

Alexey Izbyshev

a267056

2018-10-20 03:22:31 +0300

[diff] [blame]

177

.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=-1)

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

178

179

Open an encoded file using the given *mode* and return an instance of

180

:class:`StreamReaderWriter`, providing transparent encoding/decoding.

181

The default file mode is ``'r'``, meaning to open the file in read mode.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

182

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

183

.. note::

184

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

185

Underlying encoded files are always opened in binary mode.

186

No automatic conversion of ``'\n'`` is done on reading and writing.

187

The *mode* argument may be any binary mode acceptable to the built-in

188

:func:`open` function; the ``'b'`` is automatically added.

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

189

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

190

*encoding* specifies the encoding which is to be used for the file.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

191

Any encoding that encodes to and decodes from bytes is allowed, and

192

the data types supported by the file methods depend on the codec used.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

193

194

*errors* may be given to define the error handling. It defaults to ``'strict'``

195

which causes a :exc:`ValueError` to be raised in case an encoding error occurs.

196

Alexey Izbyshev

a267056

2018-10-20 03:22:31 +0300

[diff] [blame]

197

*buffering* has the same meaning as for the built-in :func:`open` function.

198

It defaults to -1 which means that the default buffer size will be used.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

199

200

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

201

.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

202

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

203

Return a :class:`StreamRecoder` instance, a wrapped version of *file*

204

which provides transparent transcoding. The original file is closed

205

when the wrapped version is closed.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

206

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

207

Data written to the wrapped file is decoded according to the given

208

*data_encoding* and then written to the original file as bytes using

209

*file_encoding*. Bytes read from the original file are decoded

210

according to *file_encoding*, and the result is encoded

211

using *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

212

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

213

If *file_encoding* is not given, it defaults to *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

214

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

215

*errors* may be given to define the error handling. It defaults to

216

``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding

217

error occurs.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

218

219

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

220

.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

221

222

Uses an incremental encoder to iteratively encode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

223

*iterator*. This function is a :term:`generator`.

224

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

225

other keyword argument) is passed through to the incremental encoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

226

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

227

This function requires that the codec accept text :class:`str` objects

228

to encode. Therefore it does not support bytes-to-bytes encoders such as

229

``base64_codec``.

230

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

231

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

232

.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

233

234

Uses an incremental decoder to iteratively decode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

235

*iterator*. This function is a :term:`generator`.

236

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

237

other keyword argument) is passed through to the incremental decoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

238

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

239

This function requires that the codec accept :class:`bytes` objects

240

to decode. Therefore it does not support text-to-text encoders such as

241

``rot_13``, although ``rot_13`` may be used equivalently with

242

:func:`iterencode`.

243

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

244

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

245

The module also provides the following constants which are useful for reading

246

and writing to platform dependent files:

.. data:: BOM

BOM_BE

BOM_LE

BOM_UTF8

BOM_UTF16

BOM_UTF16_BE

BOM_UTF16_LE

BOM_UTF32

BOM_UTF32_BE

BOM_UTF32_LE

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

260

These constants define various byte sequences,

261

being Unicode byte order marks (BOMs) for several encodings. They are

262

used in UTF-16 and UTF-32 data streams to indicate the byte order used,

263

and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

264

:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's

265

native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,

266

:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for

267

:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32

encodings.

.. _codec-base-classes:

Codec Base Classes

------------------

The :mod:`codecs` module defines a set of base classes which define the

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

277

interfaces for working with codec objects, and can also be used as the basis

278

for custom codec implementations.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

279

280

Each codec has to define four interfaces to make it usable as codec in Python:

281

stateless encoder, stateless decoder, stream reader and stream writer. The

282

stream reader and writers typically reuse the stateless encoder/decoder to

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

283

implement the file protocols. Codec authors also need to define how the

284

codec will handle encoding and decoding errors.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

285

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

286

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

287

.. _surrogateescape:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

.. _error-handlers:

Error Handlers

^^^^^^^^^^^^^^

To simplify and standardize error handling,

294

codecs may implement different error handling schemes by

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

295

accepting the *errors* string argument. The following string values are

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

296

defined and implemented by all standard Python codecs:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

297

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

298

.. tabularcolumns:: |l|L|

299

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

300

+-------------------------+-----------------------------------------------+

301

| Value | Meaning |

302

+=========================+===============================================+

303

| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

304

| | this is the default. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

305

| | :func:`strict_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

306

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

307

| ``'ignore'`` | Ignore the malformed data and continue |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

308

| | without further notice. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

309

| | :func:`ignore_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

310

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

311

312

The following error handlers are only applicable to

313

:term:`text encodings <text encoding>`:

314

Serhiy Storchaka

913876d

2018-10-28 13:41:26 +0200

[diff] [blame]

315

.. index::

316

single: ? (question mark); replacement character

317

single: \ (backslash); escape sequence

318

single: \x; escape sequence

319

single: \u; escape sequence

320

single: \U; escape sequence

321

single: \N; escape sequence

322

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

323

+-------------------------+-----------------------------------------------+

324

| Value | Meaning |

325

+=========================+===============================================+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

326

| ``'replace'`` | Replace with a suitable replacement |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

327

| | marker; Python will use the official |

328

| | ``U+FFFD`` REPLACEMENT CHARACTER for the |

329

| | built-in codecs on decoding, and '?' on |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

330

| | encoding. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

331

| | :func:`replace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

332

+-------------------------+-----------------------------------------------+

333

| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

334

| | reference (only for encoding). Implemented |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

335

| | in :func:`xmlcharrefreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

336

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

337

| ``'backslashreplace'`` | Replace with backslashed escape sequences. |

338

| | Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

339

| | :func:`backslashreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

340

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

341

| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

342

| | (only for encoding). Implemented in |

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

343

| | :func:`namereplace_errors`. |

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

344

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

345

| ``'surrogateescape'`` | On decoding, replace byte with individual |

346

| | surrogate code ranging from ``U+DC80`` to |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

347

| | ``U+DCFF``. This code will then be turned |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

348

| | back into the same byte when the |

349

| | ``'surrogateescape'`` error handler is used |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

350

| | when encoding the data. (See :pep:`383` for |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

351

| | more.) |

Martin v. Löwis

011e842

2009-05-05 04:43:17 +0000

[diff] [blame]

352

+-------------------------+-----------------------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

353

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

354

In addition, the following error handler is specific to the given codecs:

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

355

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

356

+-------------------+------------------------+-------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

357

| Value | Codecs | Meaning |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

358

+===================+========================+===========================================+

359

|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

360

| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

361

| | utf-32-be, utf-32-le | presence of surrogates as an error. |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

362

+-------------------+------------------------+-------------------------------------------+

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

363

364

.. versionadded:: 3.1

Martin v. Löwis

43c5778

2009-05-10 08:15:24 +0000

[diff] [blame]

365

The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

366

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

367

.. versionchanged:: 3.4

368

The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.

369

Berker Peksag

87f6c22

2014-11-25 18:59:20 +0200

[diff] [blame]

370

.. versionadded:: 3.5

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

371

The ``'namereplace'`` error handler.

372

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

373

.. versionchanged:: 3.5

374

The ``'backslashreplace'`` error handlers now works with decoding and

375

translating.

376

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

377

The set of allowed values can be extended by registering a new named error

378

handler:

379

380

.. function:: register_error(name, error_handler)

381

382

Register the error handling function *error_handler* under the name *name*.

383

The *error_handler* argument will be called during encoding and decoding

384

in case of an error, when *name* is specified as the errors parameter.

385

386

For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`

387

instance, which contains information about the location of the error. The

388

error handler must either raise this or a different exception, or return a

389

tuple with a replacement for the unencodable part of the input and a position

390

where encoding should continue. The replacement may be either :class:`str` or

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

391

:class:`bytes`. If the replacement is bytes, the encoder will simply copy

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

392

them into the output buffer. If the replacement is a string, the encoder will

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

393

encode the replacement. Encoding continues on original input at the

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

394

specified position. Negative position values will be treated as being

395

relative to the end of the input string. If the resulting position is out of

396

bound an :exc:`IndexError` will be raised.

397

398

Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or

399

:exc:`UnicodeTranslateError` will be passed to the handler and that the

400

replacement from the error handler will be put into the output directly.

401

402

403

Previously registered error handlers (including the standard error handlers)

404

can be looked up by name:

405

406

.. function:: lookup_error(name)

407

408

Return the error handler previously registered under the name *name*.

409

410

Raises a :exc:`LookupError` in case the handler cannot be found.

411

412

The following standard error handlers are also made available as module level

413

functions:

414

415

.. function:: strict_errors(exception)

416

417

Implements the ``'strict'`` error handling: each encoding or

418

decoding error raises a :exc:`UnicodeError`.

419

420

421

.. function:: replace_errors(exception)

422

423

Implements the ``'replace'`` error handling (for :term:`text encodings

424

<text encoding>` only): substitutes ``'?'`` for encoding errors

425

(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement

Georg Brandl

7e91af3

2015-02-25 13:05:53 +0100

[diff] [blame]

426

character) for decoding errors.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

427

428

429

.. function:: ignore_errors(exception)

430

431

Implements the ``'ignore'`` error handling: malformed data is ignored and

432

encoding or decoding is continued without further notice.

433

434

435

.. function:: xmlcharrefreplace_errors(exception)

436

437

Implements the ``'xmlcharrefreplace'`` error handling (for encoding with

438

:term:`text encodings <text encoding>` only): the

439

unencodable character is replaced by an appropriate XML character reference.

440

441

442

.. function:: backslashreplace_errors(exception)

443

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

444

Implements the ``'backslashreplace'`` error handling (for

445

:term:`text encodings <text encoding>` only): malformed data is

446

replaced by a backslashed escape sequence.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

447

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

448

.. function:: namereplace_errors(exception)

449

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

450

Implements the ``'namereplace'`` error handling (for encoding with

451

:term:`text encodings <text encoding>` only): the

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

452

unencodable character is replaced by a ``\N{...}`` escape sequence.

453

454

.. versionadded:: 3.5

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. _codec-objects:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

459

Stateless Encoding and Decoding

460

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

461

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

462

The base :class:`Codec` class defines these methods which also define the

463

function interfaces of the stateless encoder and decoder:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

464

465

466

.. method:: Codec.encode(input[, errors])

467

468

Encodes the object *input* and returns a tuple (output object, length consumed).

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

469

For instance, :term:`text encoding` converts

470

a string object to a bytes object using a particular

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

471

character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).

472

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

473

The *errors* argument defines the error handling to apply.

474

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

475

476

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

477

:class:`StreamWriter` for codecs which have to keep state in order to make

478

encoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

479

480

The encoder must be able to handle zero length input and return an empty object

481

of the output object type in this situation.

482

483

484

.. method:: Codec.decode(input[, errors])

485

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

486

Decodes the object *input* and returns a tuple (output object, length

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

487

consumed). For instance, for a :term:`text encoding`, decoding converts

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

488

a bytes object encoded using a particular

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

489

character set encoding to a string object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

490

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

491

For text encodings and bytes-to-bytes codecs,

492

*input* must be a bytes object or one which provides the read-only

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

493

buffer interface -- for example, buffer objects and memory mapped files.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

494

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

495

The *errors* argument defines the error handling to apply.

496

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

497

498

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

499

:class:`StreamReader` for codecs which have to keep state in order to make

500

decoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

501

502

The decoder must be able to handle zero length input and return an empty object

503

of the output object type in this situation.

504

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

505

506

Incremental Encoding and Decoding

507

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

508

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

509

The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide

510

the basic interface for incremental encoding and decoding. Encoding/decoding the

511

input isn't done with one call to the stateless encoder/decoder function, but

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

512

with multiple calls to the

513

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of

514

the incremental encoder/decoder. The incremental encoder/decoder keeps track of

515

the encoding/decoding process during method calls.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

516

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

517

The joined output of calls to the

518

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is

519

the same as if all the single inputs were joined into one, and this input was

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

520

encoded/decoded with the stateless encoder/decoder.

521

522

523

.. _incremental-encoder-objects:

524

525

IncrementalEncoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

526

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

527

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

528

The :class:`IncrementalEncoder` class is used for encoding an input in multiple

529

steps. It defines the following methods which every incremental encoder must

530

define in order to be compatible with the Python codec registry.

531

532

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

533

.. class:: IncrementalEncoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

534

535

Constructor for an :class:`IncrementalEncoder` instance.

536

537

All incremental encoders must provide this constructor interface. They are free

538

to add additional keyword arguments, but only the ones defined here are used by

539

the Python codec registry.

540

541

The :class:`IncrementalEncoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

542

by providing the *errors* keyword argument. See :ref:`error-handlers` for

543

possible values.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

544

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

545

The *errors* argument will be assigned to an attribute of the same name.

546

Assigning to this attribute makes it possible to switch between different error

547

handling strategies during the lifetime of the :class:`IncrementalEncoder`

548

object.

549

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

550

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

551

.. method:: encode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

552

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

553

Encodes *object* (taking the current state of the encoder into account)

554

and returns the resulting encoded object. If this is the last call to

555

:meth:`encode` *final* must be true (the default is false).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

556

557

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

558

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

559

Victor Stinner

e15dce3

2011-05-30 22:56:00 +0200

[diff] [blame]

560

Reset the encoder to the initial state. The output is discarded: call

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

561

``.encode(object, final=True)``, passing an empty byte or text string

562

if necessary, to reset the encoder and to get the output.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

563

564

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

565

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

566

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

567

Return the current state of the encoder which must be an integer. The

568

implementation should make sure that ``0`` is the most common

569

state. (States that are more complicated than integers can be converted

570

into an integer by marshaling/pickling the state and encoding the bytes

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

571

of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

572

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

573

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

574

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

575

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

576

Set the state of the encoder to *state*. *state* must be an encoder state

577

returned by :meth:`getstate`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

578

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

579

580

.. _incremental-decoder-objects:

581

582

IncrementalDecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

583

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

584

585

The :class:`IncrementalDecoder` class is used for decoding an input in multiple

586

steps. It defines the following methods which every incremental decoder must

587

define in order to be compatible with the Python codec registry.

588

589

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

590

.. class:: IncrementalDecoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

591

592

Constructor for an :class:`IncrementalDecoder` instance.

593

594

All incremental decoders must provide this constructor interface. They are free

595

to add additional keyword arguments, but only the ones defined here are used by

596

the Python codec registry.

597

598

The :class:`IncrementalDecoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

599

by providing the *errors* keyword argument. See :ref:`error-handlers` for

600

possible values.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

601

602

The *errors* argument will be assigned to an attribute of the same name.

603

Assigning to this attribute makes it possible to switch between different error

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

604

handling strategies during the lifetime of the :class:`IncrementalDecoder`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

605

object.

606

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

607

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

608

.. method:: decode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

609

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

610

Decodes *object* (taking the current state of the decoder into account)

611

and returns the resulting decoded object. If this is the last call to

612

:meth:`decode` *final* must be true (the default is false). If *final* is

613

true the decoder must decode the input completely and must flush all

614

buffers. If this isn't possible (e.g. because of incomplete byte sequences

615

at the end of the input) it must initiate error handling just like in the

616

stateless case (which might raise an exception).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

617

618

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

619

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

620

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

621

Reset the decoder to the initial state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

622

623

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

624

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

625

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

626

Return the current state of the decoder. This must be a tuple with two

627

items, the first must be the buffer containing the still undecoded

628

input. The second must be an integer and can be additional state

629

info. (The implementation should make sure that ``0`` is the most common

630

additional state info.) If this additional state info is ``0`` it must be

631

possible to set the decoder to the state which has no input buffered and

632

``0`` as the additional state info, so that feeding the previously

633

buffered input to the decoder returns it to the previous state without

634

producing any output. (Additional state info that is more complicated than

635

integers can be converted into an integer by marshaling/pickling the info

636

and encoding the bytes of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

637

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

638

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

639

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

640

Christopher Thorne

b5e2959

2019-04-11 07:09:29 +0100

[diff] [blame]

641

Set the state of the decoder to *state*. *state* must be a decoder state

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

642

returned by :meth:`getstate`.

643

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

644

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

645

Stream Encoding and Decoding

646

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

647

648

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

649

The :class:`StreamWriter` and :class:`StreamReader` classes provide generic

650

working interfaces which can be used to implement new encoding submodules very

651

easily. See :mod:`encodings.utf_8` for an example of how this is done.

652

653

654

.. _stream-writer-objects:

655

656

StreamWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

657

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

658

659

The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the

660

following methods which every stream writer must define in order to be

661

compatible with the Python codec registry.

662

663

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

664

.. class:: StreamWriter(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

665

666

Constructor for a :class:`StreamWriter` instance.

667

668

All stream writers must provide this constructor interface. They are free to add

669

additional keyword arguments, but only the ones defined here are used by the

670

Python codec registry.

671

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

672

The *stream* argument must be a file-like object open for writing

673

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

674

675

The :class:`StreamWriter` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

676

providing the *errors* keyword argument. See :ref:`error-handlers` for

677

the standard error handlers the underlying stream codec may support.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

678

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

679

The *errors* argument will be assigned to an attribute of the same name.

680

Assigning to this attribute makes it possible to switch between different error

681

handling strategies during the lifetime of the :class:`StreamWriter` object.

682

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

683

.. method:: write(object)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

684

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

685

Writes the object's contents encoded to the stream.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

686

687

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

688

.. method:: writelines(list)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

689

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

690

Writes the concatenated list of strings to the stream (possibly by reusing

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

691

the :meth:`write` method). The standard bytes-to-bytes codecs

692

do not support this method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

693

694

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

695

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

696

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

697

Flushes and resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

698

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

699

Calling this method should ensure that the data on the output is put into

700

a clean state that allows appending of new fresh data without having to

701

rescan the whole stream to recover state.

702

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

703

704

In addition to the above methods, the :class:`StreamWriter` must also inherit

705

all other methods and attributes from the underlying stream.

706

707

708

.. _stream-reader-objects:

709

710

StreamReader Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

711

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

712

713

The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the

714

following methods which every stream reader must define in order to be

715

compatible with the Python codec registry.

716

717

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

718

.. class:: StreamReader(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

719

720

Constructor for a :class:`StreamReader` instance.

721

722

All stream readers must provide this constructor interface. They are free to add

723

additional keyword arguments, but only the ones defined here are used by the

724

Python codec registry.

725

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

726

The *stream* argument must be a file-like object open for reading

727

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

728

729

The :class:`StreamReader` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

730

providing the *errors* keyword argument. See :ref:`error-handlers` for

731

the standard error handlers the underlying stream codec may support.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

732

733

The *errors* argument will be assigned to an attribute of the same name.

734

Assigning to this attribute makes it possible to switch between different error

735

handling strategies during the lifetime of the :class:`StreamReader` object.

736

737

The set of allowed values for the *errors* argument can be extended with

738

:func:`register_error`.

739

740

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

741

.. method:: read([size[, chars, [firstline]]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

742

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

743

Decodes data from the stream and returns the resulting object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

744

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

745

The *chars* argument indicates the number of decoded

746

code points or bytes to return. The :func:`read` method will

747

never return more data than requested, but it might return less,

748

if there is not enough available.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

749

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

750

The *size* argument indicates the approximate maximum

751

number of encoded bytes or code points to read

752

for decoding. The decoder can modify this setting as

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

753

appropriate. The default value -1 indicates to read and decode as much as

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

754

possible. This parameter is intended to

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

755

prevent having to decode huge files in one step.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

756

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

757

The *firstline* flag indicates that

758

it would be sufficient to only return the first

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

759

line, if there are decoding errors on later lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

760

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

761

The method should use a greedy read strategy meaning that it should read

762

as much data as is allowed within the definition of the encoding and the

763

given size, e.g. if optional encoding endings or state markers are

764

available on the stream, these should be read too.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

765

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

766

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

767

.. method:: readline([size[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

768

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

769

Read one line from the input stream and return the decoded data.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

770

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

771

*size*, if given, is passed as size argument to the stream's

Serhiy Storchaka

cca40ff

2013-07-11 18:26:13 +0300

[diff] [blame]

772

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

773

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

774

If *keepends* is false line-endings will be stripped from the lines

775

returned.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

776

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

777

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

778

.. method:: readlines([sizehint[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

779

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

780

Read all lines available on the input stream and return them as a list of

781

lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

782

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

783

Line-endings are implemented using the codec's :meth:`decode` method and

784

are included in the list entries if *keepends* is true.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

785

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

786

*sizehint*, if given, is passed as the *size* argument to the stream's

787

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

788

789

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

790

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

791

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

792

Resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

793

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

794

Note that no stream repositioning should take place. This method is

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

795

primarily intended to be able to recover from decoding errors.

796

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

797

798

In addition to the above methods, the :class:`StreamReader` must also inherit

799

all other methods and attributes from the underlying stream.

800

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

801

.. _stream-reader-writer:

802

803

StreamReaderWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

804

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

805

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

806

The :class:`StreamReaderWriter` is a convenience class that allows wrapping

807

streams which work in both read and write modes.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

808

809

The design is such that one can use the factory functions returned by the

810

:func:`lookup` function to construct the instance.

811

812

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

813

.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

814

815

Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like

816

object. *Reader* and *Writer* must be factory functions or classes providing the

817

:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling

818

is done in the same way as defined for the stream readers and writers.

819

820

:class:`StreamReaderWriter` instances define the combined interfaces of

821

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

822

methods and attributes from the underlying stream.

823

824

825

.. _stream-recoder-objects:

826

827

StreamRecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

828

~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

829

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

830

The :class:`StreamRecoder` translates data from one encoding to another,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

831

which is sometimes useful when dealing with different encoding environments.

832

833

The design is such that one can use the factory functions returned by the

834

:func:`lookup` function to construct the instance.

835

836

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

837

.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

838

839

Creates a :class:`StreamRecoder` instance which implements a two-way conversion:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

840

*encode* and *decode* work on the frontend — the data visible to

841

code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*

842

work on the backend — the data in *stream*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

843

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

844

You can use these objects to do transparent transcodings, e.g., from Latin-1

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

845

to UTF-8 and back.

846

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

847

The *stream* argument must be a file-like object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

848

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

849

The *encode* and *decode* arguments must

850

adhere to the :class:`Codec` interface. *Reader* and

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

851

*Writer* must be factory functions or classes providing objects of the

852

:class:`StreamReader` and :class:`StreamWriter` interface respectively.

853

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

854

Error handling is done in the same way as defined for the stream readers and

855

writers.

856

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

857

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

858

:class:`StreamRecoder` instances define the combined interfaces of

859

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

860

methods and attributes from the underlying stream.

861

862

863

.. _encodings-overview:

864

865

Encodings and Unicode

866

---------------------

867

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

868

Strings are stored internally as sequences of code points in

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

869

range ``0x0``--``0x10FFFF``. (See :pep:`393` for

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

870

more details about the implementation.)

871

Once a string object is used outside of CPU and memory, endianness

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

872

and how these arrays are stored as bytes become an issue. As with other

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

873

codecs, serialising a string into a sequence of bytes is known as *encoding*,

874

and recreating the string from the sequence of bytes is known as *decoding*.

875

There are a variety of different text serialisation codecs, which are

876

collectivity referred to as :term:`text encodings <text encoding>`.

877

878

The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

879

the code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

880

object that contains code points above ``U+00FF`` can't be encoded with this

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

881

codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks

882

like the following (although the details of the error message may differ):

883

``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in

884

position 3: ordinal not in range(256)``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

885

886

There's another group of encodings (the so called charmap encodings) that choose

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

887

a different subset of all Unicode code points and how these code points are

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

888

mapped to the bytes ``0x0``--``0xff``. To see how this is done simply open

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

889

e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on

890

Windows). There's a string constant with 256 characters that shows you which

891

character is mapped to which byte value.

892

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

893

All of these encodings can only encode 256 of the 1114112 code points

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

894

defined in Unicode. A simple and straightforward way that can store each Unicode

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

895

code point, is to store each code point as four consecutive bytes. There are two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

896

possibilities: store the bytes in big endian or in little endian order. These

897

two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their

898

disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you

899

will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this

900

problem: bytes will always be in natural endianness. When these bytes are read

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

901

by a CPU with a different endianness, then bytes have to be swapped though. To

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

902

be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,

903

there's the so called BOM ("Byte Order Mark"). This is the Unicode character

904

``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``

905

byte sequence. The byte swapped version of this character (``0xFFFE``) is an

906

illegal character that may not appear in a Unicode text. So when the

907

first character in an ``UTF-16`` or ``UTF-32`` byte sequence

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

908

appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

909

Unfortunately the character ``U+FEFF`` had a second purpose as

910

a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

911

a word to be split. It can e.g. be used to give hints to a ligature algorithm.

912

With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been

913

deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

914

Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

915

it's a device to determine the storage layout of the encoded bytes, and vanishes

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

916

once the byte sequence has been decoded into a string; as a ``ZERO WIDTH

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

917

NO-BREAK SPACE`` it's a normal character that will be decoded like any other.

918

919

There's another encoding that is able to encoding the full range of Unicode

920

characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues

921

with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

922

parts: marker bits (the most significant bits) and payload bits. The marker bits

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

923

are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

924

encoded like this (with x being payload bits, which when concatenated give the

925

Unicode character):

926

927

+-----------------------------------+----------------------------------------------+

928

| Range | Encoding |

929

+===================================+==============================================+

930

| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |

931

+-----------------------------------+----------------------------------------------+

932

| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |

933

+-----------------------------------+----------------------------------------------+

934

| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |

935

+-----------------------------------+----------------------------------------------+

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

936

| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

937

+-----------------------------------+----------------------------------------------+

938

939

The least significant bit of the Unicode character is the rightmost x bit.

940

941

As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

942

the decoded string (even if it's the first character) is treated as a ``ZERO

943

WIDTH NO-BREAK SPACE``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

944

945

Without external information it's impossible to reliably determine which

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

946

encoding was used for encoding a string. Each charmap encoding can

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

947

decode any random byte sequence. However that's not possible with UTF-8, as

948

UTF-8 byte sequences have a structure that doesn't allow arbitrary byte

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

949

sequences. To increase the reliability with which a UTF-8 encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

950

detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls

951

``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters

952

is written to the file, a UTF-8 encoded BOM (which looks like this as a byte

953

sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable

954

that any charmap encoded file starts with these byte values (which would e.g.

955

map to

956

957

| LATIN SMALL LETTER I WITH DIAERESIS

958

| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

959

| INVERTED QUESTION MARK

960

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

961

in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

962

correctly guessed from the byte sequence. So here the BOM is not used to be able

963

to determine the byte order used for generating the byte sequence, but as a

964

signature that helps in guessing the encoding. On encoding the utf-8-sig codec

965

will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

966

decoding ``utf-8-sig`` will skip those three bytes if they appear as the first

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

967

three bytes in the file. In UTF-8, the use of the BOM is discouraged and

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

968

should generally be avoided.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

969

970

971

.. _standard-encodings:

Standard Encodings

------------------

Python comes with a number of codecs built-in, either implemented as C functions

977

or with dictionaries as mapping tables. The following table lists the codecs by

978

name, together with a few common aliases, and the languages for which the

979

encoding is likely used. Neither the list of aliases nor the list of languages

980

is meant to be exhaustive. Notice that spelling alternatives that only differ in

Georg Brandl

a6053b4

2009-09-01 08:11:14 +0000

[diff] [blame]

981

case or use a hyphen instead of an underscore are also valid aliases; therefore,

982

e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

983

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

984

.. impl-detail::

985

986

Some common encodings can bypass the codecs lookup machinery to

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

987

improve performance. These optimization opportunities are only

Ville Skyttä

297fd87

2017-12-15 12:19:23 +0200

[diff] [blame]

988

recognized by CPython for a limited set of (case insensitive)

989

aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs

990

(Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and

991

the same using underscores instead of dashes. Using alternative

992

aliases for these encodings may result in slower execution.

993

994

.. versionchanged:: 3.6

995

Optimization opportunity recognized for us-ascii.

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

996

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

997

Many of the character sets support the same languages. They vary in individual

998

characters (e.g. whether the EURO SIGN is supported or not), and in the

999

assignment of characters to code positions. For the European languages in

1000

particular, the following variants typically exist:

1001

1002

* an ISO 8859 codeset

1003

Martin Panter

4c35964

2016-05-08 13:53:41 +0000

[diff] [blame]

1004

* a Microsoft Windows code page, which is typically derived from an 8859 codeset,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1005

but replaces control characters with additional graphic characters

1006

1007

* an IBM EBCDIC code page

1008

1009

* an IBM PC code page, which is ASCII compatible

1010

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1011

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1012

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1013

+-----------------+--------------------------------+--------------------------------+

1014

| Codec | Aliases | Languages |

1015

+=================+================================+================================+

1016

| ascii | 646, us-ascii | English |

1017

+-----------------+--------------------------------+--------------------------------+

1018

| big5 | big5-tw, csbig5 | Traditional Chinese |

1019

+-----------------+--------------------------------+--------------------------------+

1020

| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |

1021

+-----------------+--------------------------------+--------------------------------+

1022

| cp037 | IBM037, IBM039 | English |

1023

+-----------------+--------------------------------+--------------------------------+

R David Murray

47d083c

2014-03-07 21:00:34 -0500

[diff] [blame]

1024

| cp273 | 273, IBM273, csIBM273 | German |

1025

| | | |

1026

| | | .. versionadded:: 3.4 |

1027

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1028

| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |

1029

+-----------------+--------------------------------+--------------------------------+

1030

| cp437 | 437, IBM437 | English |

1031

+-----------------+--------------------------------+--------------------------------+

1032

| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |

1033

| | IBM500 | |

1034

+-----------------+--------------------------------+--------------------------------+

Amaury Forgeot d'Arc

ae6388d

2009-07-15 19:21:18 +0000

[diff] [blame]

1035

| cp720 | | Arabic |

1036

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1037

| cp737 | | Greek |

1038

+-----------------+--------------------------------+--------------------------------+

1039

| cp775 | IBM775 | Baltic languages |

1040

+-----------------+--------------------------------+--------------------------------+

1041

| cp850 | 850, IBM850 | Western Europe |

1042

+-----------------+--------------------------------+--------------------------------+

1043

| cp852 | 852, IBM852 | Central and Eastern Europe |

1044

+-----------------+--------------------------------+--------------------------------+

1045

| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |

1046

| | | Macedonian, Russian, Serbian |

1047

+-----------------+--------------------------------+--------------------------------+

1048

| cp856 | | Hebrew |

1049

+-----------------+--------------------------------+--------------------------------+

1050

| cp857 | 857, IBM857 | Turkish |

1051

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

5a6214a

2010-06-27 22:41:29 +0000

[diff] [blame]

1052

| cp858 | 858, IBM858 | Western Europe |

1053

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1054

| cp860 | 860, IBM860 | Portuguese |

1055

+-----------------+--------------------------------+--------------------------------+

1056

| cp861 | 861, CP-IS, IBM861 | Icelandic |

1057

+-----------------+--------------------------------+--------------------------------+

1058

| cp862 | 862, IBM862 | Hebrew |

1059

+-----------------+--------------------------------+--------------------------------+

1060

| cp863 | 863, IBM863 | Canadian |

1061

+-----------------+--------------------------------+--------------------------------+

1062

| cp864 | IBM864 | Arabic |

1063

+-----------------+--------------------------------+--------------------------------+

1064

| cp865 | 865, IBM865 | Danish, Norwegian |

1065

+-----------------+--------------------------------+--------------------------------+

1066

| cp866 | 866, IBM866 | Russian |

1067

+-----------------+--------------------------------+--------------------------------+

1068

| cp869 | 869, CP-GR, IBM869 | Greek |

1069

+-----------------+--------------------------------+--------------------------------+

1070

| cp874 | | Thai |

1071

+-----------------+--------------------------------+--------------------------------+

1072

| cp875 | | Greek |

1073

+-----------------+--------------------------------+--------------------------------+

1074

| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |

1075

+-----------------+--------------------------------+--------------------------------+

1076

| cp949 | 949, ms949, uhc | Korean |

1077

+-----------------+--------------------------------+--------------------------------+

1078

| cp950 | 950, ms950 | Traditional Chinese |

1079

+-----------------+--------------------------------+--------------------------------+

1080

| cp1006 | | Urdu |

1081

+-----------------+--------------------------------+--------------------------------+

1082

| cp1026 | ibm1026 | Turkish |

1083

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

be0c325

2013-11-23 18:52:23 +0200

[diff] [blame]

1084

| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |

1085

| | | |

1086

| | | .. versionadded:: 3.4 |

1087

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1088

| cp1140 | ibm1140 | Western Europe |

1089

+-----------------+--------------------------------+--------------------------------+

1090

| cp1250 | windows-1250 | Central and Eastern Europe |

1091

+-----------------+--------------------------------+--------------------------------+

1092

| cp1251 | windows-1251 | Bulgarian, Byelorussian, |

1093

| | | Macedonian, Russian, Serbian |

1094

+-----------------+--------------------------------+--------------------------------+

1095

| cp1252 | windows-1252 | Western Europe |

1096

+-----------------+--------------------------------+--------------------------------+

1097

| cp1253 | windows-1253 | Greek |

1098

+-----------------+--------------------------------+--------------------------------+

1099

| cp1254 | windows-1254 | Turkish |

1100

+-----------------+--------------------------------+--------------------------------+

1101

| cp1255 | windows-1255 | Hebrew |

1102

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

4ac9ce4

2009-10-04 14:49:41 +0000

[diff] [blame]

1103

| cp1256 | windows-1256 | Arabic |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1104

+-----------------+--------------------------------+--------------------------------+

1105

| cp1257 | windows-1257 | Baltic languages |

1106

+-----------------+--------------------------------+--------------------------------+

1107

| cp1258 | windows-1258 | Vietnamese |

1108

+-----------------+--------------------------------+--------------------------------+

1109

| euc_jp | eucjp, ujis, u-jis | Japanese |

1110

+-----------------+--------------------------------+--------------------------------+

1111

| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |

1112

+-----------------+--------------------------------+--------------------------------+

1113

| euc_jisx0213 | eucjisx0213 | Japanese |

1114

+-----------------+--------------------------------+--------------------------------+

1115

| euc_kr | euckr, korean, ksc5601, | Korean |

1116

| | ks_c-5601, ks_c-5601-1987, | |

1117

| | ksx1001, ks_x-1001 | |

1118

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

3f819ca

2018-10-31 02:26:06 +0200

[diff] [blame]

1119

| gb2312 | chinese, csiso58gb231280, | Simplified Chinese |

1120

| | euc-cn, euccn, eucgb2312-cn, | |

1121

| | gb2312-1980, gb2312-80, | |

1122

| | iso-ir-58 | |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1123

+-----------------+--------------------------------+--------------------------------+

1124

| gbk | 936, cp936, ms936 | Unified Chinese |

1125

+-----------------+--------------------------------+--------------------------------+

1126

| gb18030 | gb18030-2000 | Unified Chinese |

1127

+-----------------+--------------------------------+--------------------------------+

1128

| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |

1129

+-----------------+--------------------------------+--------------------------------+

1130

| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |

1131

| | iso-2022-jp | |

1132

+-----------------+--------------------------------+--------------------------------+

1133

| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |

1134

+-----------------+--------------------------------+--------------------------------+

1135

| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |

1136

| | | Chinese, Western Europe, Greek |

1137

+-----------------+--------------------------------+--------------------------------+

1138

| iso2022_jp_2004 | iso2022jp-2004, | Japanese |

1139

| | iso-2022-jp-2004 | |

1140

+-----------------+--------------------------------+--------------------------------+

1141

| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |

1142

+-----------------+--------------------------------+--------------------------------+

1143

| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |

1144

+-----------------+--------------------------------+--------------------------------+

1145

| iso2022_kr | csiso2022kr, iso2022kr, | Korean |

1146

| | iso-2022-kr | |

1147

+-----------------+--------------------------------+--------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1148

| latin_1 | iso-8859-1, iso8859-1, 8859, | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1149

| | cp819, latin, latin1, L1 | |

1150

+-----------------+--------------------------------+--------------------------------+

1151

| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |

1152

+-----------------+--------------------------------+--------------------------------+

1153

| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |

1154

+-----------------+--------------------------------+--------------------------------+

Christian Heimes

c3f30c4

2008-02-22 16:37:40 +0000

[diff] [blame]

1155

| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1156

+-----------------+--------------------------------+--------------------------------+

1157

| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |

1158

| | | Macedonian, Russian, Serbian |

1159

+-----------------+--------------------------------+--------------------------------+

1160

| iso8859_6 | iso-8859-6, arabic | Arabic |

1161

+-----------------+--------------------------------+--------------------------------+

1162

| iso8859_7 | iso-8859-7, greek, greek8 | Greek |

1163

+-----------------+--------------------------------+--------------------------------+

1164

| iso8859_8 | iso-8859-8, hebrew | Hebrew |

1165

+-----------------+--------------------------------+--------------------------------+

1166

| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |

1167

+-----------------+--------------------------------+--------------------------------+

1168

| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |

1169

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

bfd9767

2015-09-24 09:04:05 +0200

[diff] [blame]

1170

| iso8859_11 | iso-8859-11, thai | Thai languages |

1171

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1172

| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1173

+-----------------+--------------------------------+--------------------------------+

1174

| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |

1175

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1176

| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |

1177

+-----------------+--------------------------------+--------------------------------+

1178

| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1179

+-----------------+--------------------------------+--------------------------------+

1180

| johab | cp1361, ms1361 | Korean |

1181

+-----------------+--------------------------------+--------------------------------+

1182

| koi8_r | | Russian |

1183

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

f0eeedf

2015-05-12 23:24:19 +0300

[diff] [blame]

1184

| koi8_t | | Tajik |

1185

| | | |

1186

| | | .. versionadded:: 3.5 |

1187

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1188

| koi8_u | | Ukrainian |

1189

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

ad8a1c3

2015-05-12 23:16:55 +0300

[diff] [blame]

1190

| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh |

1191

| | | |

1192

| | | .. versionadded:: 3.5 |

1193

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1194

| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |

1195

| | | Macedonian, Russian, Serbian |

1196

+-----------------+--------------------------------+--------------------------------+

1197

| mac_greek | macgreek | Greek |

1198

+-----------------+--------------------------------+--------------------------------+

1199

| mac_iceland | maciceland | Icelandic |

1200

+-----------------+--------------------------------+--------------------------------+

1201

| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |

1202

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

23110e7

2010-08-21 02:54:44 +0000

[diff] [blame]

1203

| mac_roman | macroman, macintosh | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1204

+-----------------+--------------------------------+--------------------------------+

1205

| mac_turkish | macturkish | Turkish |

1206

+-----------------+--------------------------------+--------------------------------+

1207

| ptcp154 | csptcp154, pt154, cp154, | Kazakh |

1208

| | cyrillic-asian | |

1209

+-----------------+--------------------------------+--------------------------------+

1210

| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |

1211

| | s_jis | |

1212

+-----------------+--------------------------------+--------------------------------+

1213

| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |

1214

| | sjis2004 | |

1215

+-----------------+--------------------------------+--------------------------------+

1216

| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |

1217

| | s_jisx0213 | |

1218

+-----------------+--------------------------------+--------------------------------+

Walter Dörwald

41980ca

2007-08-16 21:55:45 +0000

[diff] [blame]

1219

| utf_32 | U32, utf32 | all languages |

1220

+-----------------+--------------------------------+--------------------------------+

1221

| utf_32_be | UTF-32BE | all languages |

1222

+-----------------+--------------------------------+--------------------------------+

1223

| utf_32_le | UTF-32LE | all languages |

1224

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1225

| utf_16 | U16, utf16 | all languages |

1226

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1227

| utf_16_be | UTF-16BE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1228

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1229

| utf_16_le | UTF-16LE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1230

+-----------------+--------------------------------+--------------------------------+

1231

| utf_7 | U7, unicode-1-1-utf-7 | all languages |

1232

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

3aef48e

2019-05-13 10:42:31 +0200

[diff] [blame]

1233

| utf_8 | U8, UTF, utf8, cp65001 | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1234

+-----------------+--------------------------------+--------------------------------+

1235

| utf_8_sig | | all languages |

1236

+-----------------+--------------------------------+--------------------------------+

1237

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1238

.. versionchanged:: 3.4

1239

The utf-16\* and utf-32\* encoders no longer allow surrogate code points

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1240

(``U+D800``--``U+DFFF``) to be encoded.

1241

The utf-32\* decoders no longer decode

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1242

byte sequences that correspond to surrogate code points.

1243

Victor Stinner

3aef48e

2019-05-13 10:42:31 +0200

[diff] [blame]

1244

.. versionchanged:: 3.8

1245

``cp65001`` is now an alias to ``utf_8``.

1246

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1247

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1248

Python Specific Encodings

1249

-------------------------

1250

1251

A number of predefined codecs are specific to Python, so their codec names have

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1252

no meaning outside Python. These are listed in the tables below based on the

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1253

expected input and output types (note that while text encodings are the most

1254

common use case for codecs, the underlying codec infrastructure supports

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1255

arbitrary data transforms rather than just text encodings). For asymmetric

1256

codecs, the stated meaning describes the encoding direction.

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1257

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

Text Encodings

^^^^^^^^^^^^^^

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1261

The following codecs provide :class:`str` to :class:`bytes` encoding and

1262

:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text

1263

encodings.

Georg Brandl

226878c

2007-08-31 10:15:37 +0000

[diff] [blame]

1264

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1265

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1266

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1267

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1268

| Codec | Aliases | Meaning |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1269

+====================+=========+===========================+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1270

| idna | | Implement :rfc:`3490`, |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1271

| | | see also |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1272

| | | :mod:`encodings.idna`. |

1273

| | | Only ``errors='strict'`` |

1274

| | | is supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1275

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1276

| mbcs | ansi, | Windows only: Encode the |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1277

| | dbcs | operand according to the |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1278

| | | ANSI codepage (CP_ACP). |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1279

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1280

| oem | | Windows only: Encode the |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1281

| | | operand according to the |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1282

| | | OEM codepage (CP_OEMCP). |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1283

| | | |

1284

| | | .. versionadded:: 3.6 |

1285

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1286

| palmos | | Encoding of PalmOS 3.5. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1287

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1288

| punycode | | Implement :rfc:`3492`. |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1289

| | | Stateful codecs are not |

1290

| | | supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1291

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1292

| raw_unicode_escape | | Latin-1 encoding with |

1293

| | | ``\uXXXX`` and |

1294

| | | ``\UXXXXXXXX`` for other |

1295

| | | code points. Existing |

1296

| | | backslashes are not |

1297

| | | escaped in any way. |

1298

| | | It is used in the Python |

1299

| | | pickle protocol. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1300

+--------------------+---------+---------------------------+

1301

| undefined | | Raise an exception for |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1302

| | | all conversions, even |

1303

| | | empty strings. The error |

1304

| | | handler is ignored. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1305

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1306

| unicode_escape | | Encoding suitable as the |

1307

| | | contents of a Unicode |

1308

| | | literal in ASCII-encoded |

1309

| | | Python source code, |

1310

| | | except that quotes are |

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1311

| | | not escaped. Decode |

1312

| | | from Latin-1 source code. |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1313

| | | Beware that Python source |

1314

| | | code actually uses UTF-8 |

1315

| | | by default. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1316

+--------------------+---------+---------------------------+

Inada Naoki

6a16b18

2019-03-18 15:44:11 +0900

[diff] [blame]

1317

1318

.. versionchanged:: 3.8

1319

"unicode_internal" codec is removed.

1320

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1321

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1322

.. _binary-transforms:

Binary Transforms

^^^^^^^^^^^^^^^^^

The following codecs provide binary transforms: :term:`bytes-like object`

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1328

to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1329

(which only produces :class:`str` output).

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1330

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1331

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1332

.. tabularcolumns:: |l|L|L|L|

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1333

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1334

+----------------------+------------------+------------------------------+------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1335

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1336

+======================+==================+==============================+==============================+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

| | | | |

+----------------------+------------------+------------------------------+------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1348

1349

| | | bz2. | :meth:`bz2.decompress` |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1350

+----------------------+------------------+------------------------------+------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1351

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

1352

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1353

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1354

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1355

+----------------------+------------------+------------------------------+------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1356

1357

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

1358

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1359

+----------------------+------------------+------------------------------+------------------------------+

1360

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1361

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1362

+----------------------+------------------+------------------------------+------------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1363

1364

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1365

+----------------------+------------------+------------------------------+------------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1366

Nick Coghlan

fdf239a

2013-10-03 00:43:22 +1000

[diff] [blame]

1367

.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,

1368

``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for

1369

decoding

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1370

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1371

.. versionadded:: 3.2

1372

Restoration of the binary transforms.

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1373

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1374

.. versionchanged:: 3.4

1375

Restoration of the aliases for the binary transforms.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1376

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1377

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

.. _text-transforms:

Text Transforms

^^^^^^^^^^^^^^^

The following codec provides a text transform: a :class:`str` to :class:`str`

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1384

mapping. It is not supported by :meth:`str.encode` (which only produces

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1385

:class:`bytes` output).

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1386

1387

.. tabularcolumns:: |l|l|L|

1388

1389

+--------------------+---------+---------------------------+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1390

| Codec | Aliases | Meaning |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1391

+====================+=========+===========================+

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1392

| rot_13 | rot13 | Return the Caesar-cypher |

1393

| | | encryption of the |

1394

| | | operand. |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1395

+--------------------+---------+---------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1396

1397

.. versionadded:: 3.2

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1398

Restoration of the ``rot_13`` text transform.

1399

1400

.. versionchanged:: 3.4

1401

Restoration of the ``rot13`` alias.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1402

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1403

1404

:mod:`encodings.idna` --- Internationalized Domain Names in Applications

1405

------------------------------------------------------------------------

1406

1407

.. module:: encodings.idna

1408

:synopsis: Internationalized Domain Names implementation

1409

.. moduleauthor:: Martin v. Löwis

1410

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1411

This module implements :rfc:`3490` (Internationalized Domain Names in

1412

Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for

1413

Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding

1414

and :mod:`stringprep`.

1415

1416

These RFCs together define a protocol to support non-ASCII characters in domain

1417

names. A domain name containing non-ASCII characters (such as

1418

``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding

1419

(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain

1420

name is then used in all places where arbitrary characters are not allowed by

1421

the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so

1422

on. This conversion is carried out in the application; if possible invisible to

1423

the user: The application should transparently convert Unicode domain labels to

1424

IDNA on the wire, and convert back ACE labels to Unicode before presenting them

1425

to the user.

1426

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1427

Python supports this conversion in several ways: the ``idna`` codec performs

1428

conversion between Unicode and ACE, separating an input string into labels

Serhiy Storchaka

0a36ac1

2018-05-31 07:39:00 +0300

[diff] [blame]

1429

based on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>`

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1430

and converting each label to ACE as required, and conversely separating an input

1431

byte string into labels based on the ``.`` separator and converting any ACE

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1432

labels found into unicode. Furthermore, the :mod:`socket` module

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1433

transparently converts Unicode host names to ACE, so that applications need not

1434

be concerned about converting host names themselves when they pass them to the

1435

socket module. On top of that, modules that have host names as function

Georg Brandl

2442015

2008-05-26 16:32:26 +0000

[diff] [blame]

1436

parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host

1437

names (:mod:`http.client` then also transparently sends an IDNA hostname in the

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1438

:mailheader:`Host` field if it sends that field at all).

1439

1440

When receiving host names from the wire (such as in reverse name lookup), no

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1441

automatic conversion to Unicode is performed: applications wishing to present

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1442

such host names to the user should decode them to Unicode.

1443

1444

The module :mod:`encodings.idna` also implements the nameprep procedure, which

1445

performs certain normalizations on host names, to achieve case-insensitivity of

1446

international domain names, and to unify similar characters. The nameprep

1447

functions can be used directly if desired.

1448

1449

1450

.. function:: nameprep(label)

1451

1452

Return the nameprepped version of *label*. The implementation currently assumes

1453

query strings, so ``AllowUnassigned`` is true.

1454

1455

1456

.. function:: ToASCII(label)

1457

1458

Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is

assumed to be false.

.. function:: ToUnicode(label)

1463

1464

Convert a label to Unicode, as specified in :rfc:`3490`.

1465

1466

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1467

:mod:`encodings.mbcs` --- Windows ANSI codepage

1468

-----------------------------------------------

1469

1470

.. module:: encodings.mbcs

1471

:synopsis: Windows ANSI codepage

1472

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1473

This module implements the ANSI codepage (CP_ACP).

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1474

Cheryl Sabella

2d6097d

2018-10-12 10:55:20 -0400

[diff] [blame]

1475

.. availability:: Windows only.

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1476

Victor Stinner

3a50e70

2011-10-18 21:21:00 +0200

[diff] [blame]

1477

.. versionchanged:: 3.3

1478

Support any error handler.

1479

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1480

.. versionchanged:: 3.2

1481

Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used

1482

to encode, and ``'ignore'`` to decode.

1483

1484

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1485

:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature

1486

-------------------------------------------------------------

1487

1488

.. module:: encodings.utf_8_sig

1489

:synopsis: UTF-8 codec with BOM signature

1490

.. moduleauthor:: Walter Dörwald

1491

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1492

This module implements a variant of the UTF-8 codec. On encoding, a UTF-8 encoded

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1493

BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this

Miss Islington (bot)

2019-10-01 14:02:29 -0700

[diff] [blame]

1494

is only done once (on the first write to the byte stream). On decoding, an

Georg Brandl