Blame - Doc/library/codecs.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`codecs` --- Codec registry and base classes

2

=================================================

3

4

.. module:: codecs

5

:synopsis: Encode and decode data and streams.

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

6

Antoine Pitrou

fbd4f80

2012-08-11 16:51:50 +0200

[diff] [blame]

7

.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>

8

.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

9

.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>

10

Andrew Kuchling

2e3743c

2014-03-19 16:23:01 -0400

[diff] [blame]

11

**Source code:** :source:`Lib/codecs.py`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. index::

single: Unicode

single: Codecs

pair: Codecs; encode

pair: Codecs; decode

single: streams

pair: stackable; streams

20

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

21

--------------

22

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

23

This module defines base classes for standard Python codecs (encoders and

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

24

decoders) and provides access to the internal Python codec registry, which

25

manages the codec and error handling lookup process. Most standard codecs

26

are :term:`text encodings <text encoding>`, which encode text to bytes,

27

but there are also codecs provided that encode text to text, and bytes to

28

bytes. Custom codecs may encode and decode between arbitrary types, but some

29

module features are restricted to use specifically with

30

:term:`text encodings <text encoding>`, or with codecs that encode to

31

:class:`bytes`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

32

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

33

The module defines the following functions for encoding and decoding with

34

any codec:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

35

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

36

.. function:: encode(obj, encoding='utf-8', errors='strict')

37

38

Encodes *obj* using the codec registered for *encoding*.

39

40

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

41

default error handler is ``'strict'`` meaning that encoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

42

:exc:`ValueError` (or a more codec specific subclass, such as

43

:exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more

44

information on codec error handling.

45

46

.. function:: decode(obj, encoding='utf-8', errors='strict')

47

48

Decodes *obj* using the codec registered for *encoding*.

49

50

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

51

default error handler is ``'strict'`` meaning that decoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

52

:exc:`ValueError` (or a more codec specific subclass, such as

53

:exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more

54

information on codec error handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

55

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

56

The full details for each codec can also be looked up directly:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

58

.. function:: lookup(encoding)

59

60

Looks up the codec info in the Python codec registry and returns a

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

61

:class:`CodecInfo` object as defined below.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

62

63

Encodings are first looked up in the registry's cache. If not found, the list of

64

registered search functions is scanned. If no :class:`CodecInfo` object is

65

found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object

66

is stored in the cache and returned to the caller.

67

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

68

.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

69

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

70

Codec details when looking up the codec registry. The constructor

71

arguments are stored in attributes of the same name:

.. attribute:: name

The name of the encoding.

77

78

79

.. attribute:: encode

80

decode

81

82

The stateless encoding and decoding functions. These must be

83

functions or methods which have the same interface as

84

the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec

85

instances (see :ref:`Codec Interface <codec-objects>`).

86

The functions or methods are expected to work in a stateless mode.

87

88

89

.. attribute:: incrementalencoder

90

incrementaldecoder

91

92

Incremental encoder and decoder classes or factory functions.

93

These have to provide the interface defined by the base classes

94

:class:`IncrementalEncoder` and :class:`IncrementalDecoder`,

95

respectively. Incremental codecs can maintain state.

96

97

98

.. attribute:: streamwriter

99

streamreader

100

101

Stream writer and reader classes or factory functions. These have to

102

provide the interface defined by the base classes

103

:class:`StreamWriter` and :class:`StreamReader`, respectively.

104

Stream codecs can maintain state.

105

106

To simplify access to the various codec components, the module provides

107

these additional functions which use :func:`lookup` for the codec lookup:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

108

109

.. function:: getencoder(encoding)

110

111

Look up the codec for the given encoding and return its encoder function.

112

113

Raises a :exc:`LookupError` in case the encoding cannot be found.

114

115

116

.. function:: getdecoder(encoding)

117

118

Look up the codec for the given encoding and return its decoder function.

119

120

Raises a :exc:`LookupError` in case the encoding cannot be found.

121

122

123

.. function:: getincrementalencoder(encoding)

124

125

Look up the codec for the given encoding and return its incremental encoder

126

class or factory function.

127

128

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

129

doesn't support an incremental encoder.

130

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

131

132

.. function:: getincrementaldecoder(encoding)

133

134

Look up the codec for the given encoding and return its incremental decoder

135

class or factory function.

136

137

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

138

doesn't support an incremental decoder.

139

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

140

141

.. function:: getreader(encoding)

142

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

143

Look up the codec for the given encoding and return its :class:`StreamReader`

144

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

145

146

Raises a :exc:`LookupError` in case the encoding cannot be found.

147

148

149

.. function:: getwriter(encoding)

150

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

151

Look up the codec for the given encoding and return its :class:`StreamWriter`

152

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

153

154

Raises a :exc:`LookupError` in case the encoding cannot be found.

155

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

156

Custom codecs are made available by registering a suitable codec search

157

function:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

158

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

159

.. function:: register(search_function)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

160

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

161

Register a codec search function. Search functions are expected to take one

162

argument, being the encoding name in all lower case letters, and return a

163

:class:`CodecInfo` object. In case a search function cannot find

164

a given encoding, it should return ``None``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

165

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

166

Hai Shi

d332e7b

2020-09-29 05:41:11 +0800

[diff] [blame]

167

.. function:: unregister(search_function)

168

169

Unregister a codec search function and clear the registry's cache.

170

If the search function is not registered, do nothing.

171

172

.. versionadded:: 3.10

173

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

174

175

While the builtin :func:`open` and the associated :mod:`io` module are the

176

recommended approach for working with encoded text files, this module

177

provides additional utility functions and classes that allow the use of a

178

wider range of codecs when working with binary files:

179

Alexey Izbyshev

a267056

2018-10-20 03:22:31 +0300

[diff] [blame]

180

.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=-1)

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

181

182

Open an encoded file using the given *mode* and return an instance of

183

:class:`StreamReaderWriter`, providing transparent encoding/decoding.

184

The default file mode is ``'r'``, meaning to open the file in read mode.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

185

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

186

.. note::

187

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

188

Underlying encoded files are always opened in binary mode.

189

No automatic conversion of ``'\n'`` is done on reading and writing.

190

The *mode* argument may be any binary mode acceptable to the built-in

191

:func:`open` function; the ``'b'`` is automatically added.

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

192

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

193

*encoding* specifies the encoding which is to be used for the file.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

194

Any encoding that encodes to and decodes from bytes is allowed, and

195

the data types supported by the file methods depend on the codec used.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

196

197

*errors* may be given to define the error handling. It defaults to ``'strict'``

198

which causes a :exc:`ValueError` to be raised in case an encoding error occurs.

199

Alexey Izbyshev

a267056

2018-10-20 03:22:31 +0300

[diff] [blame]

200

*buffering* has the same meaning as for the built-in :func:`open` function.

201

It defaults to -1 which means that the default buffer size will be used.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

202

203

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

204

.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

205

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

206

Return a :class:`StreamRecoder` instance, a wrapped version of *file*

207

which provides transparent transcoding. The original file is closed

208

when the wrapped version is closed.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

209

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

210

Data written to the wrapped file is decoded according to the given

211

*data_encoding* and then written to the original file as bytes using

212

*file_encoding*. Bytes read from the original file are decoded

213

according to *file_encoding*, and the result is encoded

214

using *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

215

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

216

If *file_encoding* is not given, it defaults to *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

217

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

218

*errors* may be given to define the error handling. It defaults to

219

``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding

220

error occurs.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

221

222

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

223

.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

224

225

Uses an incremental encoder to iteratively encode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

226

*iterator*. This function is a :term:`generator`.

227

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

228

other keyword argument) is passed through to the incremental encoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

229

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

230

This function requires that the codec accept text :class:`str` objects

231

to encode. Therefore it does not support bytes-to-bytes encoders such as

232

``base64_codec``.

233

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

234

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

235

.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

236

237

Uses an incremental decoder to iteratively decode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

238

*iterator*. This function is a :term:`generator`.

239

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

240

other keyword argument) is passed through to the incremental decoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

241

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

242

This function requires that the codec accept :class:`bytes` objects

243

to decode. Therefore it does not support text-to-text encoders such as

244

``rot_13``, although ``rot_13`` may be used equivalently with

245

:func:`iterencode`.

246

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

247

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

248

The module also provides the following constants which are useful for reading

249

and writing to platform dependent files:

.. data:: BOM

BOM_BE

BOM_LE

BOM_UTF8

BOM_UTF16

BOM_UTF16_BE

BOM_UTF16_LE

BOM_UTF32

BOM_UTF32_BE

BOM_UTF32_LE

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

263

These constants define various byte sequences,

264

being Unicode byte order marks (BOMs) for several encodings. They are

265

used in UTF-16 and UTF-32 data streams to indicate the byte order used,

266

and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

267

:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's

268

native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,

269

:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for

270

:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32

encodings.

.. _codec-base-classes:

Codec Base Classes

------------------

The :mod:`codecs` module defines a set of base classes which define the

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

280

interfaces for working with codec objects, and can also be used as the basis

281

for custom codec implementations.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

282

283

Each codec has to define four interfaces to make it usable as codec in Python:

284

stateless encoder, stateless decoder, stream reader and stream writer. The

285

stream reader and writers typically reuse the stateless encoder/decoder to

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

286

implement the file protocols. Codec authors also need to define how the

287

codec will handle encoding and decoding errors.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

288

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

289

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

290

.. _surrogateescape:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

.. _error-handlers:

Error Handlers

^^^^^^^^^^^^^^

To simplify and standardize error handling,

297

codecs may implement different error handling schemes by

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

298

accepting the *errors* string argument. The following string values are

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

299

defined and implemented by all standard Python codecs:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

300

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

301

.. tabularcolumns:: |l|L|

302

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

303

+-------------------------+-----------------------------------------------+

304

| Value | Meaning |

305

+=========================+===============================================+

306

| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

307

| | this is the default. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

308

| | :func:`strict_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

309

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

310

| ``'ignore'`` | Ignore the malformed data and continue |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

311

| | without further notice. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

312

| | :func:`ignore_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

313

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

314

315

The following error handlers are only applicable to

316

:term:`text encodings <text encoding>`:

317

Serhiy Storchaka

913876d

2018-10-28 13:41:26 +0200

[diff] [blame]

318

.. index::

319

single: ? (question mark); replacement character

320

single: \ (backslash); escape sequence

321

single: \x; escape sequence

322

single: \u; escape sequence

323

single: \U; escape sequence

324

single: \N; escape sequence

325

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

326

+-------------------------+-----------------------------------------------+

327

| Value | Meaning |

328

+=========================+===============================================+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

329

| ``'replace'`` | Replace with a suitable replacement |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

330

| | marker; Python will use the official |

331

| | ``U+FFFD`` REPLACEMENT CHARACTER for the |

332

| | built-in codecs on decoding, and '?' on |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

333

| | encoding. Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

334

| | :func:`replace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

335

+-------------------------+-----------------------------------------------+

336

| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

337

| | reference (only for encoding). Implemented |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

338

| | in :func:`xmlcharrefreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

339

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

340

| ``'backslashreplace'`` | Replace with backslashed escape sequences. |

341

| | Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

342

| | :func:`backslashreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

343

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

344

| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

345

| | (only for encoding). Implemented in |

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

346

| | :func:`namereplace_errors`. |

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

347

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

348

| ``'surrogateescape'`` | On decoding, replace byte with individual |

349

| | surrogate code ranging from ``U+DC80`` to |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

350

| | ``U+DCFF``. This code will then be turned |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

351

| | back into the same byte when the |

352

| | ``'surrogateescape'`` error handler is used |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

353

| | when encoding the data. (See :pep:`383` for |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

354

| | more.) |

Martin v. Löwis

011e842

2009-05-05 04:43:17 +0000

[diff] [blame]

355

+-------------------------+-----------------------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

356

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

357

In addition, the following error handler is specific to the given codecs:

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

358

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

359

+-------------------+------------------------+-------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

360

| Value | Codecs | Meaning |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

361

+===================+========================+===========================================+

362

|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

363

| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

364

| | utf-32-be, utf-32-le | presence of surrogates as an error. |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

365

+-------------------+------------------------+-------------------------------------------+

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

366

367

.. versionadded:: 3.1

Martin v. Löwis

43c5778

2009-05-10 08:15:24 +0000

[diff] [blame]

368

The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

369

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

370

.. versionchanged:: 3.4

371

The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.

372

Berker Peksag

87f6c22

2014-11-25 18:59:20 +0200

[diff] [blame]

373

.. versionadded:: 3.5

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

374

The ``'namereplace'`` error handler.

375

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

376

.. versionchanged:: 3.5

377

The ``'backslashreplace'`` error handlers now works with decoding and

378

translating.

379

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

380

The set of allowed values can be extended by registering a new named error

381

handler:

382

383

.. function:: register_error(name, error_handler)

384

385

Register the error handling function *error_handler* under the name *name*.

386

The *error_handler* argument will be called during encoding and decoding

387

in case of an error, when *name* is specified as the errors parameter.

388

389

For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`

390

instance, which contains information about the location of the error. The

391

error handler must either raise this or a different exception, or return a

392

tuple with a replacement for the unencodable part of the input and a position

393

where encoding should continue. The replacement may be either :class:`str` or

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

394

:class:`bytes`. If the replacement is bytes, the encoder will simply copy

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

395

them into the output buffer. If the replacement is a string, the encoder will

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

396

encode the replacement. Encoding continues on original input at the

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

397

specified position. Negative position values will be treated as being

398

relative to the end of the input string. If the resulting position is out of

399

bound an :exc:`IndexError` will be raised.

400

401

Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or

402

:exc:`UnicodeTranslateError` will be passed to the handler and that the

403

replacement from the error handler will be put into the output directly.

404

405

406

Previously registered error handlers (including the standard error handlers)

407

can be looked up by name:

408

409

.. function:: lookup_error(name)

410

411

Return the error handler previously registered under the name *name*.

412

413

Raises a :exc:`LookupError` in case the handler cannot be found.

414

415

The following standard error handlers are also made available as module level

416

functions:

417

418

.. function:: strict_errors(exception)

419

420

Implements the ``'strict'`` error handling: each encoding or

421

decoding error raises a :exc:`UnicodeError`.

422

423

424

.. function:: replace_errors(exception)

425

426

Implements the ``'replace'`` error handling (for :term:`text encodings

427

<text encoding>` only): substitutes ``'?'`` for encoding errors

428

(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement

Georg Brandl

7e91af3

2015-02-25 13:05:53 +0100

[diff] [blame]

429

character) for decoding errors.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

430

431

432

.. function:: ignore_errors(exception)

433

434

Implements the ``'ignore'`` error handling: malformed data is ignored and

435

encoding or decoding is continued without further notice.

436

437

438

.. function:: xmlcharrefreplace_errors(exception)

439

440

Implements the ``'xmlcharrefreplace'`` error handling (for encoding with

441

:term:`text encodings <text encoding>` only): the

442

unencodable character is replaced by an appropriate XML character reference.

443

444

445

.. function:: backslashreplace_errors(exception)

446

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

447

Implements the ``'backslashreplace'`` error handling (for

448

:term:`text encodings <text encoding>` only): malformed data is

449

replaced by a backslashed escape sequence.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

450

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

451

.. function:: namereplace_errors(exception)

452

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

453

Implements the ``'namereplace'`` error handling (for encoding with

454

:term:`text encodings <text encoding>` only): the

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

455

unencodable character is replaced by a ``\N{...}`` escape sequence.

456

457

.. versionadded:: 3.5

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. _codec-objects:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

462

Stateless Encoding and Decoding

463

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

464

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

465

The base :class:`Codec` class defines these methods which also define the

466

function interfaces of the stateless encoder and decoder:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

467

468

469

.. method:: Codec.encode(input[, errors])

470

471

Encodes the object *input* and returns a tuple (output object, length consumed).

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

472

For instance, :term:`text encoding` converts

473

a string object to a bytes object using a particular

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

474

character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).

475

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

476

The *errors* argument defines the error handling to apply.

477

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

478

479

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

480

:class:`StreamWriter` for codecs which have to keep state in order to make

481

encoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

482

483

The encoder must be able to handle zero length input and return an empty object

484

of the output object type in this situation.

485

486

487

.. method:: Codec.decode(input[, errors])

488

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

489

Decodes the object *input* and returns a tuple (output object, length

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

490

consumed). For instance, for a :term:`text encoding`, decoding converts

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

491

a bytes object encoded using a particular

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

492

character set encoding to a string object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

493

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

494

For text encodings and bytes-to-bytes codecs,

495

*input* must be a bytes object or one which provides the read-only

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

496

buffer interface -- for example, buffer objects and memory mapped files.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

497

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

498

The *errors* argument defines the error handling to apply.

499

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

500

501

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

502

:class:`StreamReader` for codecs which have to keep state in order to make

503

decoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

504

505

The decoder must be able to handle zero length input and return an empty object

506

of the output object type in this situation.

507

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

508

509

Incremental Encoding and Decoding

510

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

511

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

512

The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide

513

the basic interface for incremental encoding and decoding. Encoding/decoding the

514

input isn't done with one call to the stateless encoder/decoder function, but

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

515

with multiple calls to the

516

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of

517

the incremental encoder/decoder. The incremental encoder/decoder keeps track of

518

the encoding/decoding process during method calls.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

519

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

520

The joined output of calls to the

521

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is

522

the same as if all the single inputs were joined into one, and this input was

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

523

encoded/decoded with the stateless encoder/decoder.

524

525

526

.. _incremental-encoder-objects:

527

528

IncrementalEncoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

529

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

530

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

531

The :class:`IncrementalEncoder` class is used for encoding an input in multiple

532

steps. It defines the following methods which every incremental encoder must

533

define in order to be compatible with the Python codec registry.

534

535

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

536

.. class:: IncrementalEncoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

537

538

Constructor for an :class:`IncrementalEncoder` instance.

539

540

All incremental encoders must provide this constructor interface. They are free

541

to add additional keyword arguments, but only the ones defined here are used by

542

the Python codec registry.

543

544

The :class:`IncrementalEncoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

545

by providing the *errors* keyword argument. See :ref:`error-handlers` for

546

possible values.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

547

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

548

The *errors* argument will be assigned to an attribute of the same name.

549

Assigning to this attribute makes it possible to switch between different error

550

handling strategies during the lifetime of the :class:`IncrementalEncoder`

551

object.

552

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

553

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

554

.. method:: encode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

555

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

556

Encodes *object* (taking the current state of the encoder into account)

557

and returns the resulting encoded object. If this is the last call to

558

:meth:`encode` *final* must be true (the default is false).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

559

560

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

561

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

562

Victor Stinner

e15dce3

2011-05-30 22:56:00 +0200

[diff] [blame]

563

Reset the encoder to the initial state. The output is discarded: call

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

564

``.encode(object, final=True)``, passing an empty byte or text string

565

if necessary, to reset the encoder and to get the output.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

566

567

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

568

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

569

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

570

Return the current state of the encoder which must be an integer. The

571

implementation should make sure that ``0`` is the most common

572

state. (States that are more complicated than integers can be converted

573

into an integer by marshaling/pickling the state and encoding the bytes

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

574

of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

575

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

576

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

577

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

578

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

579

Set the state of the encoder to *state*. *state* must be an encoder state

580

returned by :meth:`getstate`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

581

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

582

583

.. _incremental-decoder-objects:

584

585

IncrementalDecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

586

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

587

588

The :class:`IncrementalDecoder` class is used for decoding an input in multiple

589

steps. It defines the following methods which every incremental decoder must

590

define in order to be compatible with the Python codec registry.

591

592

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

593

.. class:: IncrementalDecoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

594

595

Constructor for an :class:`IncrementalDecoder` instance.

596

597

All incremental decoders must provide this constructor interface. They are free

598

to add additional keyword arguments, but only the ones defined here are used by

599

the Python codec registry.

600

601

The :class:`IncrementalDecoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

602

by providing the *errors* keyword argument. See :ref:`error-handlers` for

603

possible values.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

604

605

The *errors* argument will be assigned to an attribute of the same name.

606

Assigning to this attribute makes it possible to switch between different error

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

607

handling strategies during the lifetime of the :class:`IncrementalDecoder`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

608

object.

609

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

610

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

611

.. method:: decode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

612

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

613

Decodes *object* (taking the current state of the decoder into account)

614

and returns the resulting decoded object. If this is the last call to

615

:meth:`decode` *final* must be true (the default is false). If *final* is

616

true the decoder must decode the input completely and must flush all

617

buffers. If this isn't possible (e.g. because of incomplete byte sequences

618

at the end of the input) it must initiate error handling just like in the

619

stateless case (which might raise an exception).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

620

621

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

622

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

623

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

624

Reset the decoder to the initial state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

625

626

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

627

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

628

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

629

Return the current state of the decoder. This must be a tuple with two

630

items, the first must be the buffer containing the still undecoded

631

input. The second must be an integer and can be additional state

632

info. (The implementation should make sure that ``0`` is the most common

633

additional state info.) If this additional state info is ``0`` it must be

634

possible to set the decoder to the state which has no input buffered and

635

``0`` as the additional state info, so that feeding the previously

636

buffered input to the decoder returns it to the previous state without

637

producing any output. (Additional state info that is more complicated than

638

integers can be converted into an integer by marshaling/pickling the info

639

and encoding the bytes of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

640

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

641

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

642

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

643

Christopher Thorne

b5e2959

2019-04-11 07:09:29 +0100

[diff] [blame]

644

Set the state of the decoder to *state*. *state* must be a decoder state

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

645

returned by :meth:`getstate`.

646

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

647

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

648

Stream Encoding and Decoding

649

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

650

651

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

652

The :class:`StreamWriter` and :class:`StreamReader` classes provide generic

653

working interfaces which can be used to implement new encoding submodules very

654

easily. See :mod:`encodings.utf_8` for an example of how this is done.

655

656

657

.. _stream-writer-objects:

658

659

StreamWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

660

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

661

662

The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the

663

following methods which every stream writer must define in order to be

664

compatible with the Python codec registry.

665

666

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

667

.. class:: StreamWriter(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

668

669

Constructor for a :class:`StreamWriter` instance.

670

671

All stream writers must provide this constructor interface. They are free to add

672

additional keyword arguments, but only the ones defined here are used by the

673

Python codec registry.

674

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

675

The *stream* argument must be a file-like object open for writing

676

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

677

678

The :class:`StreamWriter` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

679

providing the *errors* keyword argument. See :ref:`error-handlers` for

680

the standard error handlers the underlying stream codec may support.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

681

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

682

The *errors* argument will be assigned to an attribute of the same name.

683

Assigning to this attribute makes it possible to switch between different error

684

handling strategies during the lifetime of the :class:`StreamWriter` object.

685

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

686

.. method:: write(object)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

687

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

688

Writes the object's contents encoded to the stream.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

689

690

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

691

.. method:: writelines(list)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

692

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

693

Writes the concatenated list of strings to the stream (possibly by reusing

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

694

the :meth:`write` method). The standard bytes-to-bytes codecs

695

do not support this method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

696

697

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

698

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

699

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

700

Flushes and resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

701

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

702

Calling this method should ensure that the data on the output is put into

703

a clean state that allows appending of new fresh data without having to

704

rescan the whole stream to recover state.

705

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

706

707

In addition to the above methods, the :class:`StreamWriter` must also inherit

708

all other methods and attributes from the underlying stream.

709

710

711

.. _stream-reader-objects:

712

713

StreamReader Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

714

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

715

716

The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the

717

following methods which every stream reader must define in order to be

718

compatible with the Python codec registry.

719

720

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

721

.. class:: StreamReader(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

722

723

Constructor for a :class:`StreamReader` instance.

724

725

All stream readers must provide this constructor interface. They are free to add

726

additional keyword arguments, but only the ones defined here are used by the

727

Python codec registry.

728

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

729

The *stream* argument must be a file-like object open for reading

730

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

731

732

The :class:`StreamReader` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

733

providing the *errors* keyword argument. See :ref:`error-handlers` for

734

the standard error handlers the underlying stream codec may support.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

735

736

The *errors* argument will be assigned to an attribute of the same name.

737

Assigning to this attribute makes it possible to switch between different error

738

handling strategies during the lifetime of the :class:`StreamReader` object.

739

740

The set of allowed values for the *errors* argument can be extended with

741

:func:`register_error`.

742

743

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

744

.. method:: read([size[, chars, [firstline]]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

745

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

746

Decodes data from the stream and returns the resulting object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

747

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

748

The *chars* argument indicates the number of decoded

749

code points or bytes to return. The :func:`read` method will

750

never return more data than requested, but it might return less,

751

if there is not enough available.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

752

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

753

The *size* argument indicates the approximate maximum

754

number of encoded bytes or code points to read

755

for decoding. The decoder can modify this setting as

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

756

appropriate. The default value -1 indicates to read and decode as much as

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

757

possible. This parameter is intended to

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

758

prevent having to decode huge files in one step.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

759

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

760

The *firstline* flag indicates that

761

it would be sufficient to only return the first

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

762

line, if there are decoding errors on later lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

763

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

764

The method should use a greedy read strategy meaning that it should read

765

as much data as is allowed within the definition of the encoding and the

766

given size, e.g. if optional encoding endings or state markers are

767

available on the stream, these should be read too.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

768

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

769

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

770

.. method:: readline([size[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

771

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

772

Read one line from the input stream and return the decoded data.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

773

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

774

*size*, if given, is passed as size argument to the stream's

Serhiy Storchaka

cca40ff

2013-07-11 18:26:13 +0300

[diff] [blame]

775

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

776

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

777

If *keepends* is false line-endings will be stripped from the lines

778

returned.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

779

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

780

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

781

.. method:: readlines([sizehint[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

782

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

783

Read all lines available on the input stream and return them as a list of

784

lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

785

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

786

Line-endings are implemented using the codec's :meth:`decode` method and

787

are included in the list entries if *keepends* is true.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

788

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

789

*sizehint*, if given, is passed as the *size* argument to the stream's

790

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

791

792

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

793

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

794

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

795

Resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

796

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

797

Note that no stream repositioning should take place. This method is

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

798

primarily intended to be able to recover from decoding errors.

799

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

800

801

In addition to the above methods, the :class:`StreamReader` must also inherit

802

all other methods and attributes from the underlying stream.

803

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

804

.. _stream-reader-writer:

805

806

StreamReaderWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

807

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

808

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

809

The :class:`StreamReaderWriter` is a convenience class that allows wrapping

810

streams which work in both read and write modes.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

811

812

The design is such that one can use the factory functions returned by the

813

:func:`lookup` function to construct the instance.

814

815

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

816

.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

817

818

Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like

819

object. *Reader* and *Writer* must be factory functions or classes providing the

820

:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling

821

is done in the same way as defined for the stream readers and writers.

822

823

:class:`StreamReaderWriter` instances define the combined interfaces of

824

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

825

methods and attributes from the underlying stream.

826

827

828

.. _stream-recoder-objects:

829

830

StreamRecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

831

~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

832

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

833

The :class:`StreamRecoder` translates data from one encoding to another,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

834

which is sometimes useful when dealing with different encoding environments.

835

836

The design is such that one can use the factory functions returned by the

837

:func:`lookup` function to construct the instance.

838

839

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

840

.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

841

842

Creates a :class:`StreamRecoder` instance which implements a two-way conversion:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

843

*encode* and *decode* work on the frontend — the data visible to

844

code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*

845

work on the backend — the data in *stream*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

846

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

847

You can use these objects to do transparent transcodings, e.g., from Latin-1

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

848

to UTF-8 and back.

849

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

850

The *stream* argument must be a file-like object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

851

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

852

The *encode* and *decode* arguments must

853

adhere to the :class:`Codec` interface. *Reader* and

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

854

*Writer* must be factory functions or classes providing objects of the

855

:class:`StreamReader` and :class:`StreamWriter` interface respectively.

856

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

857

Error handling is done in the same way as defined for the stream readers and

858

writers.

859

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

860

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

861

:class:`StreamRecoder` instances define the combined interfaces of

862

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

863

methods and attributes from the underlying stream.

864

865

866

.. _encodings-overview:

867

868

Encodings and Unicode

869

---------------------

870

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

871

Strings are stored internally as sequences of code points in

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

872

range ``0x0``--``0x10FFFF``. (See :pep:`393` for

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

873

more details about the implementation.)

874

Once a string object is used outside of CPU and memory, endianness

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

875

and how these arrays are stored as bytes become an issue. As with other

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

876

codecs, serialising a string into a sequence of bytes is known as *encoding*,

877

and recreating the string from the sequence of bytes is known as *decoding*.

878

There are a variety of different text serialisation codecs, which are

879

collectivity referred to as :term:`text encodings <text encoding>`.

880

881

The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

882

the code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

883

object that contains code points above ``U+00FF`` can't be encoded with this

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

884

codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks

885

like the following (although the details of the error message may differ):

886

``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in

887

position 3: ordinal not in range(256)``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

888

889

There's another group of encodings (the so called charmap encodings) that choose

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

890

a different subset of all Unicode code points and how these code points are

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

891

mapped to the bytes ``0x0``--``0xff``. To see how this is done simply open

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

892

e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on

893

Windows). There's a string constant with 256 characters that shows you which

894

character is mapped to which byte value.

895

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

896

All of these encodings can only encode 256 of the 1114112 code points

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

897

defined in Unicode. A simple and straightforward way that can store each Unicode

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

898

code point, is to store each code point as four consecutive bytes. There are two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

899

possibilities: store the bytes in big endian or in little endian order. These

900

two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their

901

disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you

902

will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this

903

problem: bytes will always be in natural endianness. When these bytes are read

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

904

by a CPU with a different endianness, then bytes have to be swapped though. To

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

905

be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,

906

there's the so called BOM ("Byte Order Mark"). This is the Unicode character

907

``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``

908

byte sequence. The byte swapped version of this character (``0xFFFE``) is an

909

illegal character that may not appear in a Unicode text. So when the

910

first character in an ``UTF-16`` or ``UTF-32`` byte sequence

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

911

appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

912

Unfortunately the character ``U+FEFF`` had a second purpose as

913

a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

914

a word to be split. It can e.g. be used to give hints to a ligature algorithm.

915

With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been

916

deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

917

Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

918

it's a device to determine the storage layout of the encoded bytes, and vanishes

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

919

once the byte sequence has been decoded into a string; as a ``ZERO WIDTH

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

920

NO-BREAK SPACE`` it's a normal character that will be decoded like any other.

921

922

There's another encoding that is able to encoding the full range of Unicode

923

characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues

924

with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

925

parts: marker bits (the most significant bits) and payload bits. The marker bits

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

926

are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

927

encoded like this (with x being payload bits, which when concatenated give the

928

Unicode character):

929

930

+-----------------------------------+----------------------------------------------+

931

| Range | Encoding |

932

+===================================+==============================================+

933

| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |

934

+-----------------------------------+----------------------------------------------+

935

| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |

936

+-----------------------------------+----------------------------------------------+

937

| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |

938

+-----------------------------------+----------------------------------------------+

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

939

| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

940

+-----------------------------------+----------------------------------------------+

941

942

The least significant bit of the Unicode character is the rightmost x bit.

943

944

As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

945

the decoded string (even if it's the first character) is treated as a ``ZERO

946

WIDTH NO-BREAK SPACE``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

947

948

Without external information it's impossible to reliably determine which

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

949

encoding was used for encoding a string. Each charmap encoding can

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

950

decode any random byte sequence. However that's not possible with UTF-8, as

951

UTF-8 byte sequences have a structure that doesn't allow arbitrary byte

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

952

sequences. To increase the reliability with which a UTF-8 encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

953

detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls

954

``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters

955

is written to the file, a UTF-8 encoded BOM (which looks like this as a byte

956

sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable

957

that any charmap encoded file starts with these byte values (which would e.g.

958

map to

959

960

| LATIN SMALL LETTER I WITH DIAERESIS

961

| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

962

| INVERTED QUESTION MARK

963

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

964

in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

965

correctly guessed from the byte sequence. So here the BOM is not used to be able

966

to determine the byte order used for generating the byte sequence, but as a

967

signature that helps in guessing the encoding. On encoding the utf-8-sig codec

968

will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

969

decoding ``utf-8-sig`` will skip those three bytes if they appear as the first

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

970

three bytes in the file. In UTF-8, the use of the BOM is discouraged and

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

971

should generally be avoided.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

972

973

974

.. _standard-encodings:

Standard Encodings

------------------

Python comes with a number of codecs built-in, either implemented as C functions

980

or with dictionaries as mapping tables. The following table lists the codecs by

981

name, together with a few common aliases, and the languages for which the

982

encoding is likely used. Neither the list of aliases nor the list of languages

983

is meant to be exhaustive. Notice that spelling alternatives that only differ in

Georg Brandl

a6053b4

2009-09-01 08:11:14 +0000

[diff] [blame]

984

case or use a hyphen instead of an underscore are also valid aliases; therefore,

985

e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

986

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

987

.. impl-detail::

988

989

Some common encodings can bypass the codecs lookup machinery to

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

990

improve performance. These optimization opportunities are only

Ville Skyttä

297fd87

2017-12-15 12:19:23 +0200

[diff] [blame]

991

recognized by CPython for a limited set of (case insensitive)

992

aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs

993

(Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and

994

the same using underscores instead of dashes. Using alternative

995

aliases for these encodings may result in slower execution.

996

997

.. versionchanged:: 3.6

998

Optimization opportunity recognized for us-ascii.

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

999

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1000

Many of the character sets support the same languages. They vary in individual

1001

characters (e.g. whether the EURO SIGN is supported or not), and in the

1002

assignment of characters to code positions. For the European languages in

1003

particular, the following variants typically exist:

1004

1005

* an ISO 8859 codeset

1006

Martin Panter

4c35964

2016-05-08 13:53:41 +0000

[diff] [blame]

1007

* a Microsoft Windows code page, which is typically derived from an 8859 codeset,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1008

but replaces control characters with additional graphic characters

1009

1010

* an IBM EBCDIC code page

1011

1012

* an IBM PC code page, which is ASCII compatible

1013

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1014

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1015

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1016

+-----------------+--------------------------------+--------------------------------+

1017

| Codec | Aliases | Languages |

1018

+=================+================================+================================+

1019

| ascii | 646, us-ascii | English |

1020

+-----------------+--------------------------------+--------------------------------+

1021

| big5 | big5-tw, csbig5 | Traditional Chinese |

1022

+-----------------+--------------------------------+--------------------------------+

1023

| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |

1024

+-----------------+--------------------------------+--------------------------------+

1025

| cp037 | IBM037, IBM039 | English |

1026

+-----------------+--------------------------------+--------------------------------+

R David Murray

47d083c

2014-03-07 21:00:34 -0500

[diff] [blame]

1027

| cp273 | 273, IBM273, csIBM273 | German |

1028

| | | |

1029

| | | .. versionadded:: 3.4 |

1030

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1031

| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |

1032

+-----------------+--------------------------------+--------------------------------+

1033

| cp437 | 437, IBM437 | English |

1034

+-----------------+--------------------------------+--------------------------------+

1035

| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |

1036

| | IBM500 | |

1037

+-----------------+--------------------------------+--------------------------------+

Amaury Forgeot d'Arc

ae6388d

2009-07-15 19:21:18 +0000

[diff] [blame]

1038

| cp720 | | Arabic |

1039

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1040

| cp737 | | Greek |

1041

+-----------------+--------------------------------+--------------------------------+

1042

| cp775 | IBM775 | Baltic languages |

1043

+-----------------+--------------------------------+--------------------------------+

1044

| cp850 | 850, IBM850 | Western Europe |

1045

+-----------------+--------------------------------+--------------------------------+

1046

| cp852 | 852, IBM852 | Central and Eastern Europe |

1047

+-----------------+--------------------------------+--------------------------------+

1048

| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |

1049

| | | Macedonian, Russian, Serbian |

1050

+-----------------+--------------------------------+--------------------------------+

1051

| cp856 | | Hebrew |

1052

+-----------------+--------------------------------+--------------------------------+

1053

| cp857 | 857, IBM857 | Turkish |

1054

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

5a6214a

2010-06-27 22:41:29 +0000

[diff] [blame]

1055

| cp858 | 858, IBM858 | Western Europe |

1056

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1057

| cp860 | 860, IBM860 | Portuguese |

1058

+-----------------+--------------------------------+--------------------------------+

1059

| cp861 | 861, CP-IS, IBM861 | Icelandic |

1060

+-----------------+--------------------------------+--------------------------------+

1061

| cp862 | 862, IBM862 | Hebrew |

1062

+-----------------+--------------------------------+--------------------------------+

1063

| cp863 | 863, IBM863 | Canadian |

1064

+-----------------+--------------------------------+--------------------------------+

1065

| cp864 | IBM864 | Arabic |

1066

+-----------------+--------------------------------+--------------------------------+

1067

| cp865 | 865, IBM865 | Danish, Norwegian |

1068

+-----------------+--------------------------------+--------------------------------+

1069

| cp866 | 866, IBM866 | Russian |

1070

+-----------------+--------------------------------+--------------------------------+

1071

| cp869 | 869, CP-GR, IBM869 | Greek |

1072

+-----------------+--------------------------------+--------------------------------+

1073

| cp874 | | Thai |

1074

+-----------------+--------------------------------+--------------------------------+

1075

| cp875 | | Greek |

1076

+-----------------+--------------------------------+--------------------------------+

1077

| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |

1078

+-----------------+--------------------------------+--------------------------------+

1079

| cp949 | 949, ms949, uhc | Korean |

1080

+-----------------+--------------------------------+--------------------------------+

1081

| cp950 | 950, ms950 | Traditional Chinese |

1082

+-----------------+--------------------------------+--------------------------------+

1083

| cp1006 | | Urdu |

1084

+-----------------+--------------------------------+--------------------------------+

1085

| cp1026 | ibm1026 | Turkish |

1086

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

be0c325

2013-11-23 18:52:23 +0200

[diff] [blame]

1087

| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |

1088

| | | |

1089

| | | .. versionadded:: 3.4 |

1090

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1091

| cp1140 | ibm1140 | Western Europe |

1092

+-----------------+--------------------------------+--------------------------------+

1093

| cp1250 | windows-1250 | Central and Eastern Europe |

1094

+-----------------+--------------------------------+--------------------------------+

1095

| cp1251 | windows-1251 | Bulgarian, Byelorussian, |

1096

| | | Macedonian, Russian, Serbian |

1097

+-----------------+--------------------------------+--------------------------------+

1098

| cp1252 | windows-1252 | Western Europe |

1099

+-----------------+--------------------------------+--------------------------------+

1100

| cp1253 | windows-1253 | Greek |

1101

+-----------------+--------------------------------+--------------------------------+

1102

| cp1254 | windows-1254 | Turkish |

1103

+-----------------+--------------------------------+--------------------------------+

1104

| cp1255 | windows-1255 | Hebrew |

1105

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

4ac9ce4

2009-10-04 14:49:41 +0000

[diff] [blame]

1106

| cp1256 | windows-1256 | Arabic |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1107

+-----------------+--------------------------------+--------------------------------+

1108

| cp1257 | windows-1257 | Baltic languages |

1109

+-----------------+--------------------------------+--------------------------------+

1110

| cp1258 | windows-1258 | Vietnamese |

1111

+-----------------+--------------------------------+--------------------------------+

1112

| euc_jp | eucjp, ujis, u-jis | Japanese |

1113

+-----------------+--------------------------------+--------------------------------+

1114

| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |

1115

+-----------------+--------------------------------+--------------------------------+

1116

| euc_jisx0213 | eucjisx0213 | Japanese |

1117

+-----------------+--------------------------------+--------------------------------+

1118

| euc_kr | euckr, korean, ksc5601, | Korean |

1119

| | ks_c-5601, ks_c-5601-1987, | |

1120

| | ksx1001, ks_x-1001 | |

1121

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

3f819ca

2018-10-31 02:26:06 +0200

[diff] [blame]

1122

| gb2312 | chinese, csiso58gb231280, | Simplified Chinese |

1123

| | euc-cn, euccn, eucgb2312-cn, | |

1124

| | gb2312-1980, gb2312-80, | |

1125

| | iso-ir-58 | |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1126

+-----------------+--------------------------------+--------------------------------+

1127

| gbk | 936, cp936, ms936 | Unified Chinese |

1128

+-----------------+--------------------------------+--------------------------------+

1129

| gb18030 | gb18030-2000 | Unified Chinese |

1130

+-----------------+--------------------------------+--------------------------------+

1131

| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |

1132

+-----------------+--------------------------------+--------------------------------+

1133

| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |

1134

| | iso-2022-jp | |

1135

+-----------------+--------------------------------+--------------------------------+

1136

| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |

1137

+-----------------+--------------------------------+--------------------------------+

1138

| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |

1139

| | | Chinese, Western Europe, Greek |

1140

+-----------------+--------------------------------+--------------------------------+

1141

| iso2022_jp_2004 | iso2022jp-2004, | Japanese |

1142

| | iso-2022-jp-2004 | |

1143

+-----------------+--------------------------------+--------------------------------+

1144

| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |

1145

+-----------------+--------------------------------+--------------------------------+

1146

| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |

1147

+-----------------+--------------------------------+--------------------------------+

1148

| iso2022_kr | csiso2022kr, iso2022kr, | Korean |

1149

| | iso-2022-kr | |

1150

+-----------------+--------------------------------+--------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1151

| latin_1 | iso-8859-1, iso8859-1, 8859, | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1152

| | cp819, latin, latin1, L1 | |

1153

+-----------------+--------------------------------+--------------------------------+

1154

| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |

1155

+-----------------+--------------------------------+--------------------------------+

1156

| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |

1157

+-----------------+--------------------------------+--------------------------------+

Christian Heimes

c3f30c4

2008-02-22 16:37:40 +0000

[diff] [blame]

1158

| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1159

+-----------------+--------------------------------+--------------------------------+

1160

| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |

1161

| | | Macedonian, Russian, Serbian |

1162

+-----------------+--------------------------------+--------------------------------+

1163

| iso8859_6 | iso-8859-6, arabic | Arabic |

1164

+-----------------+--------------------------------+--------------------------------+

1165

| iso8859_7 | iso-8859-7, greek, greek8 | Greek |

1166

+-----------------+--------------------------------+--------------------------------+

1167

| iso8859_8 | iso-8859-8, hebrew | Hebrew |

1168

+-----------------+--------------------------------+--------------------------------+

1169

| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |

1170

+-----------------+--------------------------------+--------------------------------+

1171

| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |

1172

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

bfd9767

2015-09-24 09:04:05 +0200

[diff] [blame]

1173

| iso8859_11 | iso-8859-11, thai | Thai languages |

1174

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1175

| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1176

+-----------------+--------------------------------+--------------------------------+

1177

| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |

1178

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1179

| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |

1180

+-----------------+--------------------------------+--------------------------------+

1181

| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1182

+-----------------+--------------------------------+--------------------------------+

1183

| johab | cp1361, ms1361 | Korean |

1184

+-----------------+--------------------------------+--------------------------------+

1185

| koi8_r | | Russian |

1186

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

f0eeedf

2015-05-12 23:24:19 +0300

[diff] [blame]

1187

| koi8_t | | Tajik |

1188

| | | |

1189

| | | .. versionadded:: 3.5 |

1190

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1191

| koi8_u | | Ukrainian |

1192

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

ad8a1c3

2015-05-12 23:16:55 +0300

[diff] [blame]

1193

| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh |

1194

| | | |

1195

| | | .. versionadded:: 3.5 |

1196

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1197

| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |

1198

| | | Macedonian, Russian, Serbian |

1199

+-----------------+--------------------------------+--------------------------------+

1200

| mac_greek | macgreek | Greek |

1201

+-----------------+--------------------------------+--------------------------------+

1202

| mac_iceland | maciceland | Icelandic |

1203

+-----------------+--------------------------------+--------------------------------+

Ashwin Ramaswami

c4c15ed

2019-06-05 15:18:07 -0700

[diff] [blame]

1204

| mac_latin2 | maclatin2, maccentraleurope, | Central and Eastern Europe |

1205

| | mac_centeuro | |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1206

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

23110e7

2010-08-21 02:54:44 +0000

[diff] [blame]

1207

| mac_roman | macroman, macintosh | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1208

+-----------------+--------------------------------+--------------------------------+

1209

| mac_turkish | macturkish | Turkish |

1210

+-----------------+--------------------------------+--------------------------------+

1211

| ptcp154 | csptcp154, pt154, cp154, | Kazakh |

1212

| | cyrillic-asian | |

1213

+-----------------+--------------------------------+--------------------------------+

1214

| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |

1215

| | s_jis | |

1216

+-----------------+--------------------------------+--------------------------------+

1217

| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |

1218

| | sjis2004 | |

1219

+-----------------+--------------------------------+--------------------------------+

1220

| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |

1221

| | s_jisx0213 | |

1222

+-----------------+--------------------------------+--------------------------------+

Walter Dörwald

41980ca

2007-08-16 21:55:45 +0000

[diff] [blame]

1223

| utf_32 | U32, utf32 | all languages |

1224

+-----------------+--------------------------------+--------------------------------+

1225

| utf_32_be | UTF-32BE | all languages |

1226

+-----------------+--------------------------------+--------------------------------+

1227

| utf_32_le | UTF-32LE | all languages |

1228

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1229

| utf_16 | U16, utf16 | all languages |

1230

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1231

| utf_16_be | UTF-16BE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1232

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1233

| utf_16_le | UTF-16LE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1234

+-----------------+--------------------------------+--------------------------------+

1235

| utf_7 | U7, unicode-1-1-utf-7 | all languages |

1236

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

3aef48e

2019-05-13 10:42:31 +0200

[diff] [blame]

1237

| utf_8 | U8, UTF, utf8, cp65001 | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1238

+-----------------+--------------------------------+--------------------------------+

1239

| utf_8_sig | | all languages |

1240

+-----------------+--------------------------------+--------------------------------+

1241

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1242

.. versionchanged:: 3.4

1243

The utf-16\* and utf-32\* encoders no longer allow surrogate code points

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1244

(``U+D800``--``U+DFFF``) to be encoded.

1245

The utf-32\* decoders no longer decode

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1246

byte sequences that correspond to surrogate code points.

1247

Victor Stinner

3aef48e

2019-05-13 10:42:31 +0200

[diff] [blame]

1248

.. versionchanged:: 3.8

1249

``cp65001`` is now an alias to ``utf_8``.

1250

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1251

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1252

Python Specific Encodings

1253

-------------------------

1254

1255

A number of predefined codecs are specific to Python, so their codec names have

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1256

no meaning outside Python. These are listed in the tables below based on the

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1257

expected input and output types (note that while text encodings are the most

1258

common use case for codecs, the underlying codec infrastructure supports

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1259

arbitrary data transforms rather than just text encodings). For asymmetric

1260

codecs, the stated meaning describes the encoding direction.

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1261

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

Text Encodings

^^^^^^^^^^^^^^

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1265

The following codecs provide :class:`str` to :class:`bytes` encoding and

1266

:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text

1267

encodings.

Georg Brandl

226878c

2007-08-31 10:15:37 +0000

[diff] [blame]

1268

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1269

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1270

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1271

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1272

| Codec | Aliases | Meaning |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1273

+====================+=========+===========================+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1274

| idna | | Implement :rfc:`3490`, |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1275

| | | see also |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1276

| | | :mod:`encodings.idna`. |

1277

| | | Only ``errors='strict'`` |

1278

| | | is supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1279

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1280

| mbcs | ansi, | Windows only: Encode the |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1281

| | dbcs | operand according to the |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1282

| | | ANSI codepage (CP_ACP). |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1283

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1284

| oem | | Windows only: Encode the |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1285

| | | operand according to the |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1286

| | | OEM codepage (CP_OEMCP). |

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1287

| | | |

1288

| | | .. versionadded:: 3.6 |

1289

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1290

| palmos | | Encoding of PalmOS 3.5. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1291

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1292

| punycode | | Implement :rfc:`3492`. |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1293

| | | Stateful codecs are not |

1294

| | | supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1295

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1296

| raw_unicode_escape | | Latin-1 encoding with |

1297

| | | ``\uXXXX`` and |

1298

| | | ``\UXXXXXXXX`` for other |

1299

| | | code points. Existing |

1300

| | | backslashes are not |

1301

| | | escaped in any way. |

1302

| | | It is used in the Python |

1303

| | | pickle protocol. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1304

+--------------------+---------+---------------------------+

1305

| undefined | | Raise an exception for |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1306

| | | all conversions, even |

1307

| | | empty strings. The error |

1308

| | | handler is ignored. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1309

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1310

| unicode_escape | | Encoding suitable as the |

1311

| | | contents of a Unicode |

1312

| | | literal in ASCII-encoded |

1313

| | | Python source code, |

1314

| | | except that quotes are |

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1315

| | | not escaped. Decode |

1316

| | | from Latin-1 source code. |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1317

| | | Beware that Python source |

1318

| | | code actually uses UTF-8 |

1319

| | | by default. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1320

+--------------------+---------+---------------------------+

Inada Naoki

6a16b18

2019-03-18 15:44:11 +0900

[diff] [blame]

1321

1322

.. versionchanged:: 3.8

1323

"unicode_internal" codec is removed.

1324

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1325

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1326

.. _binary-transforms:

Binary Transforms

^^^^^^^^^^^^^^^^^

The following codecs provide binary transforms: :term:`bytes-like object`

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1332

to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1333

(which only produces :class:`str` output).

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1334

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1335

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1336

.. tabularcolumns:: |l|L|L|L|

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1337

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1338

+----------------------+------------------+------------------------------+------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1339

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1340

+======================+==================+==============================+==============================+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

| | | | |

+----------------------+------------------+------------------------------+------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1352

1353

| | | bz2. | :meth:`bz2.decompress` |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1354

+----------------------+------------------+------------------------------+------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1355

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

1356

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1357

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1358

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1359

+----------------------+------------------+------------------------------+------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1360

1361

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

1362

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1363

+----------------------+------------------+------------------------------+------------------------------+

1364

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1365

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1366

+----------------------+------------------+------------------------------+------------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1367

1368

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1369

+----------------------+------------------+------------------------------+------------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1370

Nick Coghlan

fdf239a

2013-10-03 00:43:22 +1000

[diff] [blame]

1371

.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,

1372

``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for

1373

decoding

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1374

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1375

.. versionadded:: 3.2

1376

Restoration of the binary transforms.

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1377

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1378

.. versionchanged:: 3.4

1379

Restoration of the aliases for the binary transforms.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1380

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1381

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

.. _text-transforms:

Text Transforms

^^^^^^^^^^^^^^^

The following codec provides a text transform: a :class:`str` to :class:`str`

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1388

mapping. It is not supported by :meth:`str.encode` (which only produces

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1389

:class:`bytes` output).

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1390

1391

.. tabularcolumns:: |l|l|L|

1392

1393

+--------------------+---------+---------------------------+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1394

| Codec | Aliases | Meaning |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1395

+====================+=========+===========================+

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1396

| rot_13 | rot13 | Return the Caesar-cypher |

1397

| | | encryption of the |

1398

| | | operand. |

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1399

+--------------------+---------+---------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1400

1401

.. versionadded:: 3.2

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1402

Restoration of the ``rot_13`` text transform.

1403

1404

.. versionchanged:: 3.4

1405

Restoration of the ``rot13`` alias.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1406

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1407

1408

:mod:`encodings.idna` --- Internationalized Domain Names in Applications

1409

------------------------------------------------------------------------

1410

1411

.. module:: encodings.idna

1412

:synopsis: Internationalized Domain Names implementation

1413

.. moduleauthor:: Martin v. Löwis

1414

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1415

This module implements :rfc:`3490` (Internationalized Domain Names in

1416

Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for

1417

Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding

1418

and :mod:`stringprep`.

1419

1420

These RFCs together define a protocol to support non-ASCII characters in domain

1421

names. A domain name containing non-ASCII characters (such as

1422

``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding

1423

(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain

1424

name is then used in all places where arbitrary characters are not allowed by

1425

the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so

1426

on. This conversion is carried out in the application; if possible invisible to

1427

the user: The application should transparently convert Unicode domain labels to

1428

IDNA on the wire, and convert back ACE labels to Unicode before presenting them

1429

to the user.

1430

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1431

Python supports this conversion in several ways: the ``idna`` codec performs

1432

conversion between Unicode and ACE, separating an input string into labels

Serhiy Storchaka

0a36ac1

2018-05-31 07:39:00 +0300

[diff] [blame]

1433

based on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>`

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1434

and converting each label to ACE as required, and conversely separating an input

1435

byte string into labels based on the ``.`` separator and converting any ACE

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1436

labels found into unicode. Furthermore, the :mod:`socket` module

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1437

transparently converts Unicode host names to ACE, so that applications need not

1438

be concerned about converting host names themselves when they pass them to the

1439

socket module. On top of that, modules that have host names as function

Georg Brandl

2442015

2008-05-26 16:32:26 +0000

[diff] [blame]

1440

parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host

1441

names (:mod:`http.client` then also transparently sends an IDNA hostname in the

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1442

:mailheader:`Host` field if it sends that field at all).

1443

1444

When receiving host names from the wire (such as in reverse name lookup), no

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1445

automatic conversion to Unicode is performed: applications wishing to present

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1446

such host names to the user should decode them to Unicode.

1447

1448

The module :mod:`encodings.idna` also implements the nameprep procedure, which

1449

performs certain normalizations on host names, to achieve case-insensitivity of

1450

international domain names, and to unify similar characters. The nameprep

1451

functions can be used directly if desired.

1452

1453

1454

.. function:: nameprep(label)

1455

1456

Return the nameprepped version of *label*. The implementation currently assumes

1457

query strings, so ``AllowUnassigned`` is true.

1458

1459

1460

.. function:: ToASCII(label)

1461

1462

Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is

assumed to be false.

.. function:: ToUnicode(label)

1467

1468

Convert a label to Unicode, as specified in :rfc:`3490`.

1469

1470

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1471

:mod:`encodings.mbcs` --- Windows ANSI codepage

1472

-----------------------------------------------

1473

1474

.. module:: encodings.mbcs

1475

:synopsis: Windows ANSI codepage

1476

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1477

This module implements the ANSI codepage (CP_ACP).

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1478

Cheryl Sabella

2d6097d

2018-10-12 10:55:20 -0400

[diff] [blame]

1479

.. availability:: Windows only.

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1480

Victor Stinner

3a50e70

2011-10-18 21:21:00 +0200

[diff] [blame]

1481

.. versionchanged:: 3.3

1482

Support any error handler.

1483

Victor Stinner

2010-06-16 23:33:54 +0000

[diff] [blame]

1484

.. versionchanged:: 3.2

1485

Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used

1486

to encode, and ``'ignore'`` to decode.

1487

1488

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1489

:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature

1490

-------------------------------------------------------------

1491

1492

.. module:: encodings.utf_8_sig

1493

:synopsis: UTF-8 codec with BOM signature

1494

.. moduleauthor:: Walter Dörwald

1495

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1496

This module implements a variant of the UTF-8 codec. On encoding, a UTF-8 encoded

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1497

BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this

Géry Ogam

2019-09-12 09:41:32 +0200

[diff] [blame]

1498

is only done once (on the first write to the byte stream). On decoding, an

Georg Brandl