Blame - Doc/library/codecs.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`codecs` --- Codec registry and base classes

2

=================================================

3

4

.. module:: codecs

5

:synopsis: Encode and decode data and streams.

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

6

Antoine Pitrou

fbd4f80

2012-08-11 16:51:50 +0200

[diff] [blame]

7

.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>

8

.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

9

.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>

10

Andrew Kuchling

2e3743c

2014-03-19 16:23:01 -0400

[diff] [blame]

11

**Source code:** :source:`Lib/codecs.py`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. index::

single: Unicode

single: Codecs

pair: Codecs; encode

pair: Codecs; decode

single: streams

pair: stackable; streams

20

Terry Jan Reedy

fa089b9

2016-06-11 15:02:54 -0400

[diff] [blame]

21

--------------

22

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

23

This module defines base classes for standard Python codecs (encoders and

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

24

decoders) and provides access to the internal Python codec registry, which

25

manages the codec and error handling lookup process. Most standard codecs

26

are :term:`text encodings <text encoding>`, which encode text to bytes,

27

but there are also codecs provided that encode text to text, and bytes to

28

bytes. Custom codecs may encode and decode between arbitrary types, but some

29

module features are restricted to use specifically with

30

:term:`text encodings <text encoding>`, or with codecs that encode to

31

:class:`bytes`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

32

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

33

The module defines the following functions for encoding and decoding with

34

any codec:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

35

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

36

.. function:: encode(obj, encoding='utf-8', errors='strict')

37

38

Encodes *obj* using the codec registered for *encoding*.

39

40

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

41

default error handler is ``'strict'`` meaning that encoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

42

:exc:`ValueError` (or a more codec specific subclass, such as

43

:exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more

44

information on codec error handling.

45

46

.. function:: decode(obj, encoding='utf-8', errors='strict')

47

48

Decodes *obj* using the codec registered for *encoding*.

49

50

*Errors* may be given to set the desired error handling scheme. The

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

51

default error handler is ``'strict'`` meaning that decoding errors raise

Nick Coghlan

6cb2b5b

2013-10-14 00:22:13 +1000

[diff] [blame]

52

:exc:`ValueError` (or a more codec specific subclass, such as

53

:exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more

54

information on codec error handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

55

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

56

The full details for each codec can also be looked up directly:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

58

.. function:: lookup(encoding)

59

60

Looks up the codec info in the Python codec registry and returns a

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

61

:class:`CodecInfo` object as defined below.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

62

63

Encodings are first looked up in the registry's cache. If not found, the list of

64

registered search functions is scanned. If no :class:`CodecInfo` object is

65

found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object

66

is stored in the cache and returned to the caller.

67

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

68

.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

69

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

70

Codec details when looking up the codec registry. The constructor

71

arguments are stored in attributes of the same name:

.. attribute:: name

The name of the encoding.

77

78

79

.. attribute:: encode

80

decode

81

82

The stateless encoding and decoding functions. These must be

83

functions or methods which have the same interface as

84

the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec

85

instances (see :ref:`Codec Interface <codec-objects>`).

86

The functions or methods are expected to work in a stateless mode.

87

88

89

.. attribute:: incrementalencoder

90

incrementaldecoder

91

92

Incremental encoder and decoder classes or factory functions.

93

These have to provide the interface defined by the base classes

94

:class:`IncrementalEncoder` and :class:`IncrementalDecoder`,

95

respectively. Incremental codecs can maintain state.

96

97

98

.. attribute:: streamwriter

99

streamreader

100

101

Stream writer and reader classes or factory functions. These have to

102

provide the interface defined by the base classes

103

:class:`StreamWriter` and :class:`StreamReader`, respectively.

104

Stream codecs can maintain state.

105

106

To simplify access to the various codec components, the module provides

107

these additional functions which use :func:`lookup` for the codec lookup:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

108

109

.. function:: getencoder(encoding)

110

111

Look up the codec for the given encoding and return its encoder function.

112

113

Raises a :exc:`LookupError` in case the encoding cannot be found.

114

115

116

.. function:: getdecoder(encoding)

117

118

Look up the codec for the given encoding and return its decoder function.

119

120

Raises a :exc:`LookupError` in case the encoding cannot be found.

121

122

123

.. function:: getincrementalencoder(encoding)

124

125

Look up the codec for the given encoding and return its incremental encoder

126

class or factory function.

127

128

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

129

doesn't support an incremental encoder.

130

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

131

132

.. function:: getincrementaldecoder(encoding)

133

134

Look up the codec for the given encoding and return its incremental decoder

135

class or factory function.

136

137

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

138

doesn't support an incremental decoder.

139

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

140

141

.. function:: getreader(encoding)

142

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

143

Look up the codec for the given encoding and return its :class:`StreamReader`

144

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

145

146

Raises a :exc:`LookupError` in case the encoding cannot be found.

147

148

149

.. function:: getwriter(encoding)

150

Berker Peksag

732ba82

2016-05-21 14:56:35 +0300

[diff] [blame]

151

Look up the codec for the given encoding and return its :class:`StreamWriter`

152

class or factory function.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

153

154

Raises a :exc:`LookupError` in case the encoding cannot be found.

155

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

156

Custom codecs are made available by registering a suitable codec search

157

function:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

158

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

159

.. function:: register(search_function)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

160

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

161

Register a codec search function. Search functions are expected to take one

162

argument, being the encoding name in all lower case letters, and return a

163

:class:`CodecInfo` object. In case a search function cannot find

164

a given encoding, it should return ``None``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. note::

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

168

Search function registration is not currently reversible,

169

which may cause problems in some cases, such as unit testing or

170

module reloading.

171

172

While the builtin :func:`open` and the associated :mod:`io` module are the

173

recommended approach for working with encoded text files, this module

174

provides additional utility functions and classes that allow the use of a

175

wider range of codecs when working with binary files:

176

177

.. function:: open(filename, mode='r', encoding=None, errors='strict', buffering=1)

178

179

Open an encoded file using the given *mode* and return an instance of

180

:class:`StreamReaderWriter`, providing transparent encoding/decoding.

181

The default file mode is ``'r'``, meaning to open the file in read mode.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

182

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

183

.. note::

184

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

185

Underlying encoded files are always opened in binary mode.

186

No automatic conversion of ``'\n'`` is done on reading and writing.

187

The *mode* argument may be any binary mode acceptable to the built-in

188

:func:`open` function; the ``'b'`` is automatically added.

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

189

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

190

*encoding* specifies the encoding which is to be used for the file.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

191

Any encoding that encodes to and decodes from bytes is allowed, and

192

the data types supported by the file methods depend on the codec used.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

193

194

*errors* may be given to define the error handling. It defaults to ``'strict'``

195

which causes a :exc:`ValueError` to be raised in case an encoding error occurs.

196

197

*buffering* has the same meaning as for the built-in :func:`open` function. It

198

defaults to line buffered.

199

200

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

201

.. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

202

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

203

Return a :class:`StreamRecoder` instance, a wrapped version of *file*

204

which provides transparent transcoding. The original file is closed

205

when the wrapped version is closed.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

206

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

207

Data written to the wrapped file is decoded according to the given

208

*data_encoding* and then written to the original file as bytes using

209

*file_encoding*. Bytes read from the original file are decoded

210

according to *file_encoding*, and the result is encoded

211

using *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

212

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

213

If *file_encoding* is not given, it defaults to *data_encoding*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

214

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

215

*errors* may be given to define the error handling. It defaults to

216

``'strict'``, which causes :exc:`ValueError` to be raised in case an encoding

217

error occurs.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

218

219

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

220

.. function:: iterencode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

221

222

Uses an incremental encoder to iteratively encode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

223

*iterator*. This function is a :term:`generator`.

224

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

225

other keyword argument) is passed through to the incremental encoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

226

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

227

This function requires that the codec accept text :class:`str` objects

228

to encode. Therefore it does not support bytes-to-bytes encoders such as

229

``base64_codec``.

230

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

231

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

232

.. function:: iterdecode(iterator, encoding, errors='strict', **kwargs)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

233

234

Uses an incremental decoder to iteratively decode the input provided by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

235

*iterator*. This function is a :term:`generator`.

236

The *errors* argument (as well as any

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

237

other keyword argument) is passed through to the incremental decoder.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

238

Martin Panter

c73e9d8

2016-10-15 00:56:47 +0000

[diff] [blame]

239

This function requires that the codec accept :class:`bytes` objects

240

to decode. Therefore it does not support text-to-text encoders such as

241

``rot_13``, although ``rot_13`` may be used equivalently with

242

:func:`iterencode`.

243

Georg Brandl

2009-04-05 22:20:44 +0000

[diff] [blame]

244

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

245

The module also provides the following constants which are useful for reading

246

and writing to platform dependent files:

.. data:: BOM

BOM_BE

BOM_LE

BOM_UTF8

BOM_UTF16

BOM_UTF16_BE

BOM_UTF16_LE

BOM_UTF32

BOM_UTF32_BE

BOM_UTF32_LE

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

260

These constants define various byte sequences,

261

being Unicode byte order marks (BOMs) for several encodings. They are

262

used in UTF-16 and UTF-32 data streams to indicate the byte order used,

263

and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

264

:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's

265

native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,

266

:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for

267

:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32

encodings.

.. _codec-base-classes:

Codec Base Classes

------------------

The :mod:`codecs` module defines a set of base classes which define the

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

277

interfaces for working with codec objects, and can also be used as the basis

278

for custom codec implementations.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

279

280

Each codec has to define four interfaces to make it usable as codec in Python:

281

stateless encoder, stateless decoder, stream reader and stream writer. The

282

stream reader and writers typically reuse the stateless encoder/decoder to

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

283

implement the file protocols. Codec authors also need to define how the

284

codec will handle encoding and decoding errors.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

285

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

286

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

287

.. _surrogateescape:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

.. _error-handlers:

Error Handlers

^^^^^^^^^^^^^^

To simplify and standardize error handling,

294

codecs may implement different error handling schemes by

295

accepting the *errors* string argument. The following string values are

296

defined and implemented by all standard Python codecs:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

297

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

298

.. tabularcolumns:: |l|L|

299

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

300

+-------------------------+-----------------------------------------------+

301

| Value | Meaning |

302

+=========================+===============================================+

303

| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

304

| | this is the default. Implemented in |

305

| | :func:`strict_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

306

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

307

| ``'ignore'`` | Ignore the malformed data and continue |

308

| | without further notice. Implemented in |

309

| | :func:`ignore_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

310

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

311

312

The following error handlers are only applicable to

313

:term:`text encodings <text encoding>`:

314

315

+-------------------------+-----------------------------------------------+

316

| Value | Meaning |

317

+=========================+===============================================+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

318

| ``'replace'`` | Replace with a suitable replacement |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

319

| | marker; Python will use the official |

320

| | ``U+FFFD`` REPLACEMENT CHARACTER for the |

321

| | built-in codecs on decoding, and '?' on |

322

| | encoding. Implemented in |

323

| | :func:`replace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

324

+-------------------------+-----------------------------------------------+

325

| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

326

| | reference (only for encoding). Implemented |

327

| | in :func:`xmlcharrefreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

328

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

329

| ``'backslashreplace'`` | Replace with backslashed escape sequences. |

330

| | Implemented in |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

331

| | :func:`backslashreplace_errors`. |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

332

+-------------------------+-----------------------------------------------+

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

333

| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

334

| | (only for encoding). Implemented in |

335

| | :func:`namereplace_errors`. |

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

336

+-------------------------+-----------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

337

| ``'surrogateescape'`` | On decoding, replace byte with individual |

338

| | surrogate code ranging from ``U+DC80`` to |

339

| | ``U+DCFF``. This code will then be turned |

340

| | back into the same byte when the |

341

| | ``'surrogateescape'`` error handler is used |

342

| | when encoding the data. (See :pep:`383` for |

343

| | more.) |

Martin v. Löwis

011e842

2009-05-05 04:43:17 +0000

[diff] [blame]

344

+-------------------------+-----------------------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

345

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

346

In addition, the following error handler is specific to the given codecs:

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

347

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

348

+-------------------+------------------------+-------------------------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

349

| Value | Codecs | Meaning |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

350

+===================+========================+===========================================+

351

|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

352

| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |

353

| | utf-32-be, utf-32-le | presence of surrogates as an error. |

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

354

+-------------------+------------------------+-------------------------------------------+

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

355

356

.. versionadded:: 3.1

Martin v. Löwis

43c5778

2009-05-10 08:15:24 +0000

[diff] [blame]

357

The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.

Martin v. Löwis

db12d45

2009-05-02 18:52:14 +0000

[diff] [blame]

358

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

359

.. versionchanged:: 3.4

360

The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.

361

Berker Peksag

87f6c22

2014-11-25 18:59:20 +0200

[diff] [blame]

362

.. versionadded:: 3.5

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

363

The ``'namereplace'`` error handler.

364

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

365

.. versionchanged:: 3.5

366

The ``'backslashreplace'`` error handlers now works with decoding and

367

translating.

368

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

369

The set of allowed values can be extended by registering a new named error

370

handler:

371

372

.. function:: register_error(name, error_handler)

373

374

Register the error handling function *error_handler* under the name *name*.

375

The *error_handler* argument will be called during encoding and decoding

376

in case of an error, when *name* is specified as the errors parameter.

377

378

For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError`

379

instance, which contains information about the location of the error. The

380

error handler must either raise this or a different exception, or return a

381

tuple with a replacement for the unencodable part of the input and a position

382

where encoding should continue. The replacement may be either :class:`str` or

383

:class:`bytes`. If the replacement is bytes, the encoder will simply copy

384

them into the output buffer. If the replacement is a string, the encoder will

385

encode the replacement. Encoding continues on original input at the

386

specified position. Negative position values will be treated as being

387

relative to the end of the input string. If the resulting position is out of

388

bound an :exc:`IndexError` will be raised.

389

390

Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or

391

:exc:`UnicodeTranslateError` will be passed to the handler and that the

392

replacement from the error handler will be put into the output directly.

393

394

395

Previously registered error handlers (including the standard error handlers)

396

can be looked up by name:

397

398

.. function:: lookup_error(name)

399

400

Return the error handler previously registered under the name *name*.

401

402

Raises a :exc:`LookupError` in case the handler cannot be found.

403

404

The following standard error handlers are also made available as module level

405

functions:

406

407

.. function:: strict_errors(exception)

408

409

Implements the ``'strict'`` error handling: each encoding or

410

decoding error raises a :exc:`UnicodeError`.

411

412

413

.. function:: replace_errors(exception)

414

415

Implements the ``'replace'`` error handling (for :term:`text encodings

416

<text encoding>` only): substitutes ``'?'`` for encoding errors

417

(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement

Georg Brandl

7e91af3

2015-02-25 13:05:53 +0100

[diff] [blame]

418

character) for decoding errors.

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

419

420

421

.. function:: ignore_errors(exception)

422

423

Implements the ``'ignore'`` error handling: malformed data is ignored and

424

encoding or decoding is continued without further notice.

425

426

427

.. function:: xmlcharrefreplace_errors(exception)

428

429

Implements the ``'xmlcharrefreplace'`` error handling (for encoding with

430

:term:`text encodings <text encoding>` only): the

431

unencodable character is replaced by an appropriate XML character reference.

432

433

434

.. function:: backslashreplace_errors(exception)

435

Serhiy Storchaka

07985ef

2015-01-25 22:56:57 +0200

[diff] [blame]

436

Implements the ``'backslashreplace'`` error handling (for

437

:term:`text encodings <text encoding>` only): malformed data is

438

replaced by a backslashed escape sequence.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

439

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

440

.. function:: namereplace_errors(exception)

441

Nick Coghlan

f212636

2015-01-07 13:14:47 +1000

[diff] [blame]

442

Implements the ``'namereplace'`` error handling (for encoding with

443

:term:`text encodings <text encoding>` only): the

Nick Coghlan

582acb7

2015-01-07 00:37:01 +1000

[diff] [blame]

444

unencodable character is replaced by a ``\N{...}`` escape sequence.

445

446

.. versionadded:: 3.5

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

.. _codec-objects:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

451

Stateless Encoding and Decoding

452

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

453

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

454

The base :class:`Codec` class defines these methods which also define the

455

function interfaces of the stateless encoder and decoder:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

456

457

458

.. method:: Codec.encode(input[, errors])

459

460

Encodes the object *input* and returns a tuple (output object, length consumed).

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

461

For instance, :term:`text encoding` converts

462

a string object to a bytes object using a particular

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

463

character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).

464

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

465

The *errors* argument defines the error handling to apply.

466

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

467

468

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

469

:class:`StreamWriter` for codecs which have to keep state in order to make

470

encoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

471

472

The encoder must be able to handle zero length input and return an empty object

473

of the output object type in this situation.

474

475

476

.. method:: Codec.decode(input[, errors])

477

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

478

Decodes the object *input* and returns a tuple (output object, length

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

479

consumed). For instance, for a :term:`text encoding`, decoding converts

480

a bytes object encoded using a particular

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

481

character set encoding to a string object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

482

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

483

For text encodings and bytes-to-bytes codecs,

484

*input* must be a bytes object or one which provides the read-only

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

485

buffer interface -- for example, buffer objects and memory mapped files.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

486

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

487

The *errors* argument defines the error handling to apply.

488

It defaults to ``'strict'`` handling.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

489

490

The method may not store state in the :class:`Codec` instance. Use

Berker Peksag

41ca828

2015-07-30 18:26:10 +0300

[diff] [blame]

491

:class:`StreamReader` for codecs which have to keep state in order to make

492

decoding efficient.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

493

494

The decoder must be able to handle zero length input and return an empty object

495

of the output object type in this situation.

496

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

497

498

Incremental Encoding and Decoding

499

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

500

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

501

The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide

502

the basic interface for incremental encoding and decoding. Encoding/decoding the

503

input isn't done with one call to the stateless encoder/decoder function, but

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

504

with multiple calls to the

505

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of

506

the incremental encoder/decoder. The incremental encoder/decoder keeps track of

507

the encoding/decoding process during method calls.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

508

Serhiy Storchaka

bfdcd43

2013-10-13 23:09:14 +0300

[diff] [blame]

509

The joined output of calls to the

510

:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is

511

the same as if all the single inputs were joined into one, and this input was

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

512

encoded/decoded with the stateless encoder/decoder.

513

514

515

.. _incremental-encoder-objects:

516

517

IncrementalEncoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

518

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

519

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

520

The :class:`IncrementalEncoder` class is used for encoding an input in multiple

521

steps. It defines the following methods which every incremental encoder must

522

define in order to be compatible with the Python codec registry.

523

524

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

525

.. class:: IncrementalEncoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

526

527

Constructor for an :class:`IncrementalEncoder` instance.

528

529

All incremental encoders must provide this constructor interface. They are free

530

to add additional keyword arguments, but only the ones defined here are used by

531

the Python codec registry.

532

533

The :class:`IncrementalEncoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

534

by providing the *errors* keyword argument. See :ref:`error-handlers` for

535

possible values.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

536

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

537

The *errors* argument will be assigned to an attribute of the same name.

538

Assigning to this attribute makes it possible to switch between different error

539

handling strategies during the lifetime of the :class:`IncrementalEncoder`

540

object.

541

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

542

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

543

.. method:: encode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

544

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

545

Encodes *object* (taking the current state of the encoder into account)

546

and returns the resulting encoded object. If this is the last call to

547

:meth:`encode` *final* must be true (the default is false).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

548

549

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

550

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

551

Victor Stinner

e15dce3

2011-05-30 22:56:00 +0200

[diff] [blame]

552

Reset the encoder to the initial state. The output is discarded: call

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

553

``.encode(object, final=True)``, passing an empty byte or text string

554

if necessary, to reset the encoder and to get the output.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

555

556

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

557

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

558

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

559

Return the current state of the encoder which must be an integer. The

560

implementation should make sure that ``0`` is the most common

561

state. (States that are more complicated than integers can be converted

562

into an integer by marshaling/pickling the state and encoding the bytes

563

of the resulting string into an integer).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

564

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

565

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

566

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

567

Zhiming Wang

2017-09-10 02:09:55 -0400

[diff] [blame]

568

Set the state of the encoder to *state*. *state* must be an encoder state

569

returned by :meth:`getstate`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

570

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

571

572

.. _incremental-decoder-objects:

573

574

IncrementalDecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

575

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

576

577

The :class:`IncrementalDecoder` class is used for decoding an input in multiple

578

steps. It defines the following methods which every incremental decoder must

579

define in order to be compatible with the Python codec registry.

580

581

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

582

.. class:: IncrementalDecoder(errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

583

584

Constructor for an :class:`IncrementalDecoder` instance.

585

586

All incremental decoders must provide this constructor interface. They are free

587

to add additional keyword arguments, but only the ones defined here are used by

588

the Python codec registry.

589

590

The :class:`IncrementalDecoder` may implement different error handling schemes

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

591

by providing the *errors* keyword argument. See :ref:`error-handlers` for

592

possible values.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

593

594

The *errors* argument will be assigned to an attribute of the same name.

595

Assigning to this attribute makes it possible to switch between different error

Benjamin Peterson

3e4f055

2008-09-02 00:31:15 +0000

[diff] [blame]

596

handling strategies during the lifetime of the :class:`IncrementalDecoder`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

597

object.

598

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

599

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

600

.. method:: decode(object[, final])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

601

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

602

Decodes *object* (taking the current state of the decoder into account)

603

and returns the resulting decoded object. If this is the last call to

604

:meth:`decode` *final* must be true (the default is false). If *final* is

605

true the decoder must decode the input completely and must flush all

606

buffers. If this isn't possible (e.g. because of incomplete byte sequences

607

at the end of the input) it must initiate error handling just like in the

608

stateless case (which might raise an exception).

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

609

610

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

611

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

612

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

613

Reset the decoder to the initial state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

614

615

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

616

.. method:: getstate()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

617

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

618

Return the current state of the decoder. This must be a tuple with two

619

items, the first must be the buffer containing the still undecoded

620

input. The second must be an integer and can be additional state

621

info. (The implementation should make sure that ``0`` is the most common

622

additional state info.) If this additional state info is ``0`` it must be

623

possible to set the decoder to the state which has no input buffered and

624

``0`` as the additional state info, so that feeding the previously

625

buffered input to the decoder returns it to the previous state without

626

producing any output. (Additional state info that is more complicated than

627

integers can be converted into an integer by marshaling/pickling the info

628

and encoding the bytes of the resulting string into an integer.)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

629

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

630

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

631

.. method:: setstate(state)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

632

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

633

Set the state of the encoder to *state*. *state* must be a decoder state

634

returned by :meth:`getstate`.

635

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

636

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

637

Stream Encoding and Decoding

638

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

639

640

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

641

The :class:`StreamWriter` and :class:`StreamReader` classes provide generic

642

working interfaces which can be used to implement new encoding submodules very

643

easily. See :mod:`encodings.utf_8` for an example of how this is done.

644

645

646

.. _stream-writer-objects:

647

648

StreamWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

649

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

650

651

The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the

652

following methods which every stream writer must define in order to be

653

compatible with the Python codec registry.

654

655

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

656

.. class:: StreamWriter(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

657

658

Constructor for a :class:`StreamWriter` instance.

659

660

All stream writers must provide this constructor interface. They are free to add

661

additional keyword arguments, but only the ones defined here are used by the

662

Python codec registry.

663

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

664

The *stream* argument must be a file-like object open for writing

665

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

666

667

The :class:`StreamWriter` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

668

providing the *errors* keyword argument. See :ref:`error-handlers` for

669

the standard error handlers the underlying stream codec may support.

Serhiy Storchaka

2014-11-25 13:57:17 +0200

[diff] [blame]

670

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

671

The *errors* argument will be assigned to an attribute of the same name.

672

Assigning to this attribute makes it possible to switch between different error

673

handling strategies during the lifetime of the :class:`StreamWriter` object.

674

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

675

.. method:: write(object)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

676

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

677

Writes the object's contents encoded to the stream.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

678

679

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

680

.. method:: writelines(list)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

681

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

682

Writes the concatenated list of strings to the stream (possibly by reusing

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

683

the :meth:`write` method). The standard bytes-to-bytes codecs

684

do not support this method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

685

686

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

687

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

688

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

689

Flushes and resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

690

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

691

Calling this method should ensure that the data on the output is put into

692

a clean state that allows appending of new fresh data without having to

693

rescan the whole stream to recover state.

694

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

695

696

In addition to the above methods, the :class:`StreamWriter` must also inherit

697

all other methods and attributes from the underlying stream.

698

699

700

.. _stream-reader-objects:

701

702

StreamReader Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

703

~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

704

705

The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the

706

following methods which every stream reader must define in order to be

707

compatible with the Python codec registry.

708

709

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

710

.. class:: StreamReader(stream, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

711

712

Constructor for a :class:`StreamReader` instance.

713

714

All stream readers must provide this constructor interface. They are free to add

715

additional keyword arguments, but only the ones defined here are used by the

716

Python codec registry.

717

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

718

The *stream* argument must be a file-like object open for reading

719

text or binary data, as appropriate for the specific codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

720

721

The :class:`StreamReader` may implement different error handling schemes by

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

722

providing the *errors* keyword argument. See :ref:`error-handlers` for

723

the standard error handlers the underlying stream codec may support.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

724

725

The *errors* argument will be assigned to an attribute of the same name.

726

Assigning to this attribute makes it possible to switch between different error

727

handling strategies during the lifetime of the :class:`StreamReader` object.

728

729

The set of allowed values for the *errors* argument can be extended with

730

:func:`register_error`.

731

732

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

733

.. method:: read([size[, chars, [firstline]]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

734

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

735

Decodes data from the stream and returns the resulting object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

736

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

737

The *chars* argument indicates the number of decoded

738

code points or bytes to return. The :func:`read` method will

739

never return more data than requested, but it might return less,

740

if there is not enough available.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

741

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

742

The *size* argument indicates the approximate maximum

743

number of encoded bytes or code points to read

744

for decoding. The decoder can modify this setting as

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

745

appropriate. The default value -1 indicates to read and decode as much as

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

746

possible. This parameter is intended to

747

prevent having to decode huge files in one step.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

748

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

749

The *firstline* flag indicates that

750

it would be sufficient to only return the first

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

751

line, if there are decoding errors on later lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

752

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

753

The method should use a greedy read strategy meaning that it should read

754

as much data as is allowed within the definition of the encoding and the

755

given size, e.g. if optional encoding endings or state markers are

756

available on the stream, these should be read too.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

757

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

758

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

759

.. method:: readline([size[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

760

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

761

Read one line from the input stream and return the decoded data.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

762

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

763

*size*, if given, is passed as size argument to the stream's

Serhiy Storchaka

cca40ff

2013-07-11 18:26:13 +0300

[diff] [blame]

764

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

765

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

766

If *keepends* is false line-endings will be stripped from the lines

767

returned.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

768

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

769

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

770

.. method:: readlines([sizehint[, keepends]])

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

771

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

772

Read all lines available on the input stream and return them as a list of

773

lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

774

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

775

Line-endings are implemented using the codec's decoder method and are

776

included in the list entries if *keepends* is true.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

777

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

778

*sizehint*, if given, is passed as the *size* argument to the stream's

779

:meth:`read` method.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

780

781

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

782

.. method:: reset()

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

783

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

784

Resets the codec buffers used for keeping state.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

785

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

786

Note that no stream repositioning should take place. This method is

787

primarily intended to be able to recover from decoding errors.

788

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

789

790

In addition to the above methods, the :class:`StreamReader` must also inherit

791

all other methods and attributes from the underlying stream.

792

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

793

.. _stream-reader-writer:

794

795

StreamReaderWriter Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

796

~~~~~~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

797

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

798

The :class:`StreamReaderWriter` is a convenience class that allows wrapping

799

streams which work in both read and write modes.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

800

801

The design is such that one can use the factory functions returned by the

802

:func:`lookup` function to construct the instance.

803

804

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

805

.. class:: StreamReaderWriter(stream, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

806

807

Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like

808

object. *Reader* and *Writer* must be factory functions or classes providing the

809

:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling

810

is done in the same way as defined for the stream readers and writers.

811

812

:class:`StreamReaderWriter` instances define the combined interfaces of

813

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

814

methods and attributes from the underlying stream.

815

816

817

.. _stream-recoder-objects:

818

819

StreamRecoder Objects

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

820

~~~~~~~~~~~~~~~~~~~~~

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

821

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

822

The :class:`StreamRecoder` translates data from one encoding to another,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

823

which is sometimes useful when dealing with different encoding environments.

824

825

The design is such that one can use the factory functions returned by the

826

:func:`lookup` function to construct the instance.

827

828

Pablo Galindo

e184cfd

2017-11-10 23:05:12 +0000

[diff] [blame]

829

.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors='strict')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

830

831

Creates a :class:`StreamRecoder` instance which implements a two-way conversion:

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

832

*encode* and *decode* work on the frontend — the data visible to

833

code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer*

834

work on the backend — the data in *stream*.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

835

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

836

You can use these objects to do transparent transcodings from e.g. Latin-1

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

837

to UTF-8 and back.

838

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

839

The *stream* argument must be a file-like object.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

840

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

841

The *encode* and *decode* arguments must

842

adhere to the :class:`Codec` interface. *Reader* and

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

843

*Writer* must be factory functions or classes providing objects of the

844

:class:`StreamReader` and :class:`StreamWriter` interface respectively.

845

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

846

Error handling is done in the same way as defined for the stream readers and

847

writers.

848

Benjamin Peterson

2008-04-25 01:59:09 +0000

[diff] [blame]

849

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

850

:class:`StreamRecoder` instances define the combined interfaces of

851

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

852

methods and attributes from the underlying stream.

853

854

855

.. _encodings-overview:

856

857

Encodings and Unicode

858

---------------------

859

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

860

Strings are stored internally as sequences of code points in

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

861

range ``0x0``--``0x10FFFF``. (See :pep:`393` for

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

862

more details about the implementation.)

863

Once a string object is used outside of CPU and memory, endianness

864

and how these arrays are stored as bytes become an issue. As with other

865

codecs, serialising a string into a sequence of bytes is known as *encoding*,

866

and recreating the string from the sequence of bytes is known as *decoding*.

867

There are a variety of different text serialisation codecs, which are

868

collectivity referred to as :term:`text encodings <text encoding>`.

869

870

The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

871

the code points 0--255 to the bytes ``0x0``--``0xff``, which means that a string

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

872

object that contains code points above ``U+00FF`` can't be encoded with this

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

873

codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks

874

like the following (although the details of the error message may differ):

875

``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in

876

position 3: ordinal not in range(256)``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

877

878

There's another group of encodings (the so called charmap encodings) that choose

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

879

a different subset of all Unicode code points and how these code points are

Serhiy Storchaka

c7b1a0b

2016-11-26 13:43:28 +0200

[diff] [blame]

880

mapped to the bytes ``0x0``--``0xff``. To see how this is done simply open

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

881

e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on

882

Windows). There's a string constant with 256 characters that shows you which

883

character is mapped to which byte value.

884

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

885

All of these encodings can only encode 256 of the 1114112 code points

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

886

defined in Unicode. A simple and straightforward way that can store each Unicode

Georg Brandl

2015-01-14 08:26:30 +0100

[diff] [blame]

887

code point, is to store each code point as four consecutive bytes. There are two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

888

possibilities: store the bytes in big endian or in little endian order. These

889

two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their

890

disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you

891

will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this

892

problem: bytes will always be in natural endianness. When these bytes are read

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

893

by a CPU with a different endianness, then bytes have to be swapped though. To

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

894

be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,

895

there's the so called BOM ("Byte Order Mark"). This is the Unicode character

896

``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``

897

byte sequence. The byte swapped version of this character (``0xFFFE``) is an

898

illegal character that may not appear in a Unicode text. So when the

899

first character in an ``UTF-16`` or ``UTF-32`` byte sequence

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

900

appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

901

Unfortunately the character ``U+FEFF`` had a second purpose as

902

a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

903

a word to be split. It can e.g. be used to give hints to a ligature algorithm.

904

With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been

905

deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

906

Unicode software still must be able to handle ``U+FEFF`` in both roles: as a BOM

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

907

it's a device to determine the storage layout of the encoded bytes, and vanishes

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

908

once the byte sequence has been decoded into a string; as a ``ZERO WIDTH

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

909

NO-BREAK SPACE`` it's a normal character that will be decoded like any other.

910

911

There's another encoding that is able to encoding the full range of Unicode

912

characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues

913

with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

914

parts: marker bits (the most significant bits) and payload bits. The marker bits

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

915

are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

916

encoded like this (with x being payload bits, which when concatenated give the

917

Unicode character):

918

919

+-----------------------------------+----------------------------------------------+

920

| Range | Encoding |

921

+===================================+==============================================+

922

| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |

923

+-----------------------------------+----------------------------------------------+

924

| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |

925

+-----------------------------------+----------------------------------------------+

926

| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |

927

+-----------------------------------+----------------------------------------------+

Ezio Melotti

222b208

2011-09-01 08:11:28 +0300

[diff] [blame]

928

| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

929

+-----------------------------------+----------------------------------------------+

930

931

The least significant bit of the Unicode character is the rightmost x bit.

932

933

As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

934

the decoded string (even if it's the first character) is treated as a ``ZERO

935

WIDTH NO-BREAK SPACE``.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

936

937

Without external information it's impossible to reliably determine which

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

938

encoding was used for encoding a string. Each charmap encoding can

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

939

decode any random byte sequence. However that's not possible with UTF-8, as

940

UTF-8 byte sequences have a structure that doesn't allow arbitrary byte

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

941

sequences. To increase the reliability with which a UTF-8 encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

942

detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls

943

``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters

944

is written to the file, a UTF-8 encoded BOM (which looks like this as a byte

945

sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable

946

that any charmap encoded file starts with these byte values (which would e.g.

947

map to

948

949

| LATIN SMALL LETTER I WITH DIAERESIS

950

| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

951

| INVERTED QUESTION MARK

952

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

953

in iso-8859-1), this increases the probability that a ``utf-8-sig`` encoding can be

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

954

correctly guessed from the byte sequence. So here the BOM is not used to be able

955

to determine the byte order used for generating the byte sequence, but as a

956

signature that helps in guessing the encoding. On encoding the utf-8-sig codec

957

will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On

Ezio Melotti

2011-10-25 10:40:38 +0300

[diff] [blame]

958

decoding ``utf-8-sig`` will skip those three bytes if they appear as the first

959

three bytes in the file. In UTF-8, the use of the BOM is discouraged and

960

should generally be avoided.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

961

962

963

.. _standard-encodings:

Standard Encodings

------------------

Python comes with a number of codecs built-in, either implemented as C functions

969

or with dictionaries as mapping tables. The following table lists the codecs by

970

name, together with a few common aliases, and the languages for which the

971

encoding is likely used. Neither the list of aliases nor the list of languages

972

is meant to be exhaustive. Notice that spelling alternatives that only differ in

Georg Brandl

a6053b4

2009-09-01 08:11:14 +0000

[diff] [blame]

973

case or use a hyphen instead of an underscore are also valid aliases; therefore,

974

e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

975

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

976

.. impl-detail::

977

978

Some common encodings can bypass the codecs lookup machinery to

979

improve performance. These optimization opportunities are only

Ville Skyttä

297fd87

2017-12-15 12:19:23 +0200

[diff] [blame]

980

recognized by CPython for a limited set of (case insensitive)

981

aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs

982

(Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and

983

the same using underscores instead of dashes. Using alternative

984

aliases for these encodings may result in slower execution.

985

986

.. versionchanged:: 3.6

987

Optimization opportunity recognized for us-ascii.

Alexander Belopolsky

1d52146

2011-02-25 19:19:57 +0000

[diff] [blame]

988

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

989

Many of the character sets support the same languages. They vary in individual

990

characters (e.g. whether the EURO SIGN is supported or not), and in the

991

assignment of characters to code positions. For the European languages in

992

particular, the following variants typically exist:

993

994

* an ISO 8859 codeset

995

Martin Panter

4c35964

2016-05-08 13:53:41 +0000

[diff] [blame]

996

* a Microsoft Windows code page, which is typically derived from an 8859 codeset,

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

997

but replaces control characters with additional graphic characters

998

999

* an IBM EBCDIC code page

1000

1001

* an IBM PC code page, which is ASCII compatible

1002

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1003

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1004

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1005

+-----------------+--------------------------------+--------------------------------+

1006

| Codec | Aliases | Languages |

1007

+=================+================================+================================+

1008

| ascii | 646, us-ascii | English |

1009

+-----------------+--------------------------------+--------------------------------+

1010

| big5 | big5-tw, csbig5 | Traditional Chinese |

1011

+-----------------+--------------------------------+--------------------------------+

1012

| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |

1013

+-----------------+--------------------------------+--------------------------------+

1014

| cp037 | IBM037, IBM039 | English |

1015

+-----------------+--------------------------------+--------------------------------+

R David Murray

47d083c

2014-03-07 21:00:34 -0500

[diff] [blame]

1016

| cp273 | 273, IBM273, csIBM273 | German |

1017

| | | |

1018

| | | .. versionadded:: 3.4 |

1019

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1020

| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |

1021

+-----------------+--------------------------------+--------------------------------+

1022

| cp437 | 437, IBM437 | English |

1023

+-----------------+--------------------------------+--------------------------------+

1024

| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |

1025

| | IBM500 | |

1026

+-----------------+--------------------------------+--------------------------------+

Amaury Forgeot d'Arc

ae6388d

2009-07-15 19:21:18 +0000

[diff] [blame]

1027

| cp720 | | Arabic |

1028

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1029

| cp737 | | Greek |

1030

+-----------------+--------------------------------+--------------------------------+

1031

| cp775 | IBM775 | Baltic languages |

1032

+-----------------+--------------------------------+--------------------------------+

1033

| cp850 | 850, IBM850 | Western Europe |

1034

+-----------------+--------------------------------+--------------------------------+

1035

| cp852 | 852, IBM852 | Central and Eastern Europe |

1036

+-----------------+--------------------------------+--------------------------------+

1037

| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |

1038

| | | Macedonian, Russian, Serbian |

1039

+-----------------+--------------------------------+--------------------------------+

1040

| cp856 | | Hebrew |

1041

+-----------------+--------------------------------+--------------------------------+

1042

| cp857 | 857, IBM857 | Turkish |

1043

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

5a6214a

2010-06-27 22:41:29 +0000

[diff] [blame]

1044

| cp858 | 858, IBM858 | Western Europe |

1045

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1046

| cp860 | 860, IBM860 | Portuguese |

1047

+-----------------+--------------------------------+--------------------------------+

1048

| cp861 | 861, CP-IS, IBM861 | Icelandic |

1049

+-----------------+--------------------------------+--------------------------------+

1050

| cp862 | 862, IBM862 | Hebrew |

1051

+-----------------+--------------------------------+--------------------------------+

1052

| cp863 | 863, IBM863 | Canadian |

1053

+-----------------+--------------------------------+--------------------------------+

1054

| cp864 | IBM864 | Arabic |

1055

+-----------------+--------------------------------+--------------------------------+

1056

| cp865 | 865, IBM865 | Danish, Norwegian |

1057

+-----------------+--------------------------------+--------------------------------+

1058

| cp866 | 866, IBM866 | Russian |

1059

+-----------------+--------------------------------+--------------------------------+

1060

| cp869 | 869, CP-GR, IBM869 | Greek |

1061

+-----------------+--------------------------------+--------------------------------+

1062

| cp874 | | Thai |

1063

+-----------------+--------------------------------+--------------------------------+

1064

| cp875 | | Greek |

1065

+-----------------+--------------------------------+--------------------------------+

1066

| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |

1067

+-----------------+--------------------------------+--------------------------------+

1068

| cp949 | 949, ms949, uhc | Korean |

1069

+-----------------+--------------------------------+--------------------------------+

1070

| cp950 | 950, ms950 | Traditional Chinese |

1071

+-----------------+--------------------------------+--------------------------------+

1072

| cp1006 | | Urdu |

1073

+-----------------+--------------------------------+--------------------------------+

1074

| cp1026 | ibm1026 | Turkish |

1075

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

be0c325

2013-11-23 18:52:23 +0200

[diff] [blame]

1076

| cp1125 | 1125, ibm1125, cp866u, ruscii | Ukrainian |

1077

| | | |

1078

| | | .. versionadded:: 3.4 |

1079

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1080

| cp1140 | ibm1140 | Western Europe |

1081

+-----------------+--------------------------------+--------------------------------+

1082

| cp1250 | windows-1250 | Central and Eastern Europe |

1083

+-----------------+--------------------------------+--------------------------------+

1084

| cp1251 | windows-1251 | Bulgarian, Byelorussian, |

1085

| | | Macedonian, Russian, Serbian |

1086

+-----------------+--------------------------------+--------------------------------+

1087

| cp1252 | windows-1252 | Western Europe |

1088

+-----------------+--------------------------------+--------------------------------+

1089

| cp1253 | windows-1253 | Greek |

1090

+-----------------+--------------------------------+--------------------------------+

1091

| cp1254 | windows-1254 | Turkish |

1092

+-----------------+--------------------------------+--------------------------------+

1093

| cp1255 | windows-1255 | Hebrew |

1094

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

4ac9ce4

2009-10-04 14:49:41 +0000

[diff] [blame]

1095

| cp1256 | windows-1256 | Arabic |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1096

+-----------------+--------------------------------+--------------------------------+

1097

| cp1257 | windows-1257 | Baltic languages |

1098

+-----------------+--------------------------------+--------------------------------+

1099

| cp1258 | windows-1258 | Vietnamese |

1100

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

2f3ca9f

2011-10-27 01:38:56 +0200

[diff] [blame]

1101

| cp65001 | | Windows only: Windows UTF-8 |

1102

| | | (``CP_UTF8``) |

1103

| | | |

1104

| | | .. versionadded:: 3.3 |

1105

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1106

| euc_jp | eucjp, ujis, u-jis | Japanese |

1107

+-----------------+--------------------------------+--------------------------------+

1108

| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |

1109

+-----------------+--------------------------------+--------------------------------+

1110

| euc_jisx0213 | eucjisx0213 | Japanese |

1111

+-----------------+--------------------------------+--------------------------------+

1112

| euc_kr | euckr, korean, ksc5601, | Korean |

1113

| | ks_c-5601, ks_c-5601-1987, | |

1114

| | ksx1001, ks_x-1001 | |

1115

+-----------------+--------------------------------+--------------------------------+

1116

| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |

1117

| | cn, euccn, eucgb2312-cn, | |

1118

| | gb2312-1980, gb2312-80, iso- | |

1119

| | ir-58 | |

1120

+-----------------+--------------------------------+--------------------------------+

1121

| gbk | 936, cp936, ms936 | Unified Chinese |

1122

+-----------------+--------------------------------+--------------------------------+

1123

| gb18030 | gb18030-2000 | Unified Chinese |

1124

+-----------------+--------------------------------+--------------------------------+

1125

| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |

1126

+-----------------+--------------------------------+--------------------------------+

1127

| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |

1128

| | iso-2022-jp | |

1129

+-----------------+--------------------------------+--------------------------------+

1130

| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |

1131

+-----------------+--------------------------------+--------------------------------+

1132

| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |

1133

| | | Chinese, Western Europe, Greek |

1134

+-----------------+--------------------------------+--------------------------------+

1135

| iso2022_jp_2004 | iso2022jp-2004, | Japanese |

1136

| | iso-2022-jp-2004 | |

1137

+-----------------+--------------------------------+--------------------------------+

1138

| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |

1139

+-----------------+--------------------------------+--------------------------------+

1140

| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |

1141

+-----------------+--------------------------------+--------------------------------+

1142

| iso2022_kr | csiso2022kr, iso2022kr, | Korean |

1143

| | iso-2022-kr | |

1144

+-----------------+--------------------------------+--------------------------------+

1145

| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |

1146

| | cp819, latin, latin1, L1 | |

1147

+-----------------+--------------------------------+--------------------------------+

1148

| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |

1149

+-----------------+--------------------------------+--------------------------------+

1150

| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |

1151

+-----------------+--------------------------------+--------------------------------+

Christian Heimes

c3f30c4

2008-02-22 16:37:40 +0000

[diff] [blame]

1152

| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1153

+-----------------+--------------------------------+--------------------------------+

1154

| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |

1155

| | | Macedonian, Russian, Serbian |

1156

+-----------------+--------------------------------+--------------------------------+

1157

| iso8859_6 | iso-8859-6, arabic | Arabic |

1158

+-----------------+--------------------------------+--------------------------------+

1159

| iso8859_7 | iso-8859-7, greek, greek8 | Greek |

1160

+-----------------+--------------------------------+--------------------------------+

1161

| iso8859_8 | iso-8859-8, hebrew | Hebrew |

1162

+-----------------+--------------------------------+--------------------------------+

1163

| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |

1164

+-----------------+--------------------------------+--------------------------------+

1165

| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |

1166

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

bfd9767

2015-09-24 09:04:05 +0200

[diff] [blame]

1167

| iso8859_11 | iso-8859-11, thai | Thai languages |

1168

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1169

| iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1170

+-----------------+--------------------------------+--------------------------------+

1171

| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |

1172

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

93dc9eb

2010-03-14 10:56:14 +0000

[diff] [blame]

1173

| iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |

1174

+-----------------+--------------------------------+--------------------------------+

1175

| iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1176

+-----------------+--------------------------------+--------------------------------+

1177

| johab | cp1361, ms1361 | Korean |

1178

+-----------------+--------------------------------+--------------------------------+

1179

| koi8_r | | Russian |

1180

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

f0eeedf

2015-05-12 23:24:19 +0300

[diff] [blame]

1181

| koi8_t | | Tajik |

1182

| | | |

1183

| | | .. versionadded:: 3.5 |

1184

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1185

| koi8_u | | Ukrainian |

1186

+-----------------+--------------------------------+--------------------------------+

Serhiy Storchaka

ad8a1c3

2015-05-12 23:16:55 +0300

[diff] [blame]

1187

| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh |

1188

| | | |

1189

| | | .. versionadded:: 3.5 |

1190

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1191

| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |

1192

| | | Macedonian, Russian, Serbian |

1193

+-----------------+--------------------------------+--------------------------------+

1194

| mac_greek | macgreek | Greek |

1195

+-----------------+--------------------------------+--------------------------------+

1196

| mac_iceland | maciceland | Icelandic |

1197

+-----------------+--------------------------------+--------------------------------+

1198

| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |

1199

+-----------------+--------------------------------+--------------------------------+

Benjamin Peterson

23110e7

2010-08-21 02:54:44 +0000

[diff] [blame]

1200

| mac_roman | macroman, macintosh | Western Europe |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1201

+-----------------+--------------------------------+--------------------------------+

1202

| mac_turkish | macturkish | Turkish |

1203

+-----------------+--------------------------------+--------------------------------+

1204

| ptcp154 | csptcp154, pt154, cp154, | Kazakh |

1205

| | cyrillic-asian | |

1206

+-----------------+--------------------------------+--------------------------------+

1207

| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |

1208

| | s_jis | |

1209

+-----------------+--------------------------------+--------------------------------+

1210

| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |

1211

| | sjis2004 | |

1212

+-----------------+--------------------------------+--------------------------------+

1213

| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |

1214

| | s_jisx0213 | |

1215

+-----------------+--------------------------------+--------------------------------+

Walter Dörwald

41980ca

2007-08-16 21:55:45 +0000

[diff] [blame]

1216

| utf_32 | U32, utf32 | all languages |

1217

+-----------------+--------------------------------+--------------------------------+

1218

| utf_32_be | UTF-32BE | all languages |

1219

+-----------------+--------------------------------+--------------------------------+

1220

| utf_32_le | UTF-32LE | all languages |

1221

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1222

| utf_16 | U16, utf16 | all languages |

1223

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1224

| utf_16_be | UTF-16BE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1225

+-----------------+--------------------------------+--------------------------------+

Victor Stinner

53a9dd7

2010-12-08 22:25:45 +0000

[diff] [blame]

1226

| utf_16_le | UTF-16LE | all languages |

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1227

+-----------------+--------------------------------+--------------------------------+

1228

| utf_7 | U7, unicode-1-1-utf-7 | all languages |

1229

+-----------------+--------------------------------+--------------------------------+

1230

| utf_8 | U8, UTF, utf8 | all languages |

1231

+-----------------+--------------------------------+--------------------------------+

1232

| utf_8_sig | | all languages |

1233

+-----------------+--------------------------------+--------------------------------+

1234

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1235

.. versionchanged:: 3.4

1236

The utf-16\* and utf-32\* encoders no longer allow surrogate code points

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1237

(``U+D800``--``U+DFFF``) to be encoded.

1238

The utf-32\* decoders no longer decode

Serhiy Storchaka

2013-11-19 11:32:41 +0200

[diff] [blame]

1239

byte sequences that correspond to surrogate code points.

1240

1241

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1242

Python Specific Encodings

1243

-------------------------

1244

1245

A number of predefined codecs are specific to Python, so their codec names have

1246

no meaning outside Python. These are listed in the tables below based on the

1247

expected input and output types (note that while text encodings are the most

1248

common use case for codecs, the underlying codec infrastructure supports

1249

arbitrary data transforms rather than just text encodings). For asymmetric

1250

codecs, the stated purpose describes the encoding direction.

1251

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

Text Encodings

^^^^^^^^^^^^^^

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1255

The following codecs provide :class:`str` to :class:`bytes` encoding and

1256

:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text

1257

encodings.

Georg Brandl

226878c

2007-08-31 10:15:37 +0000

[diff] [blame]

1258

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1259

.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|

1260

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1261

+--------------------+---------+---------------------------+

1262

| Codec | Aliases | Purpose |

1263

+====================+=========+===========================+

1264

| idna | | Implements :rfc:`3490`, |

1265

| | | see also |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1266

| | | :mod:`encodings.idna`. |

1267

| | | Only ``errors='strict'`` |

1268

| | | is supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1269

+--------------------+---------+---------------------------+

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1270

| mbcs | ansi, | Windows only: Encode |

1271

| | dbcs | operand according to the |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1272

| | | ANSI codepage (CP_ACP) |

1273

+--------------------+---------+---------------------------+

Steve Dower

5a71327

2016-09-06 19:46:42 -0700

[diff] [blame]

1274

| oem | | Windows only: Encode |

1275

| | | operand according to the |

1276

| | | OEM codepage (CP_OEMCP) |

1277

| | | |

1278

| | | .. versionadded:: 3.6 |

1279

+--------------------+---------+---------------------------+

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1280

| palmos | | Encoding of PalmOS 3.5 |

1281

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1282

| punycode | | Implements :rfc:`3492`. |

1283

| | | Stateful codecs are not |

1284

| | | supported. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1285

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1286

| raw_unicode_escape | | Latin-1 encoding with |

1287

| | | ``\uXXXX`` and |

1288

| | | ``\UXXXXXXXX`` for other |

1289

| | | code points. Existing |

1290

| | | backslashes are not |

1291

| | | escaped in any way. |

1292

| | | It is used in the Python |

1293

| | | pickle protocol. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1294

+--------------------+---------+---------------------------+

1295

| undefined | | Raise an exception for |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1296

| | | all conversions, even |

1297

| | | empty strings. The error |

1298

| | | handler is ignored. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1299

+--------------------+---------+---------------------------+

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1300

| unicode_escape | | Encoding suitable as the |

1301

| | | contents of a Unicode |

1302

| | | literal in ASCII-encoded |

1303

| | | Python source code, |

1304

| | | except that quotes are |

1305

| | | not escaped. Decodes from |

1306

| | | Latin-1 source code. |

1307

| | | Beware that Python source |

1308

| | | code actually uses UTF-8 |

1309

| | | by default. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1310

+--------------------+---------+---------------------------+

1311

| unicode_internal | | Return the internal |

1312

| | | representation of the |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1313

| | | operand. Stateful codecs |

1314

| | | are not supported. |

Victor Stinner

9f4b1e9

2011-11-10 20:56:30 +0100

[diff] [blame]

1315

| | | |

1316

| | | .. deprecated:: 3.3 |

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1317

| | | This representation is |

1318

| | | obsoleted by |

1319

| | | :pep:`393`. |

Georg Brandl

2008-05-11 14:52:00 +0000

[diff] [blame]

1320

+--------------------+---------+---------------------------+

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1321

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1322

.. _binary-transforms:

Binary Transforms

^^^^^^^^^^^^^^^^^

The following codecs provide binary transforms: :term:`bytes-like object`

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1328

to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`

1329

(which only produces :class:`str` output).

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1330

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1331

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1332

.. tabularcolumns:: |l|L|L|L|

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1333

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1334

+----------------------+------------------+------------------------------+------------------------------+

1335

1336

+======================+==================+==============================+==============================+

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

| | | ``'\n'``) | |

| | | | |

+----------------------+------------------+------------------------------+------------------------------+

1348

1349

1350

+----------------------+------------------+------------------------------+------------------------------+

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

1351

1352

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1353

1354

1355

+----------------------+------------------+------------------------------+------------------------------+

Martin Panter

06171bd

2015-09-12 00:34:28 +0000

[diff] [blame]

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1359

+----------------------+------------------+------------------------------+------------------------------+

1360

1361

1362

+----------------------+------------------+------------------------------+------------------------------+

1363

1364

1365

+----------------------+------------------+------------------------------+------------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1366

Nick Coghlan

fdf239a

2013-10-03 00:43:22 +1000

[diff] [blame]

1367

.. [#b64] In addition to :term:`bytes-like objects <bytes-like object>`,

1368

``'base64_codec'`` also accepts ASCII-only instances of :class:`str` for

1369

decoding

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1370

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1371

.. versionadded:: 3.2

1372

Restoration of the binary transforms.

Nick Coghlan

2013-05-23 20:24:02 +1000

[diff] [blame]

1373

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1374

.. versionchanged:: 3.4

1375

Restoration of the aliases for the binary transforms.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1376

Georg Brandl

2013-03-28 13:28:44 +0100

[diff] [blame]

1377

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

.. _text-transforms:

Text Transforms

^^^^^^^^^^^^^^^

The following codec provides a text transform: a :class:`str` to :class:`str`

Nick Coghlan

2015-01-07 00:22:00 +1000

[diff] [blame]

1384

mapping. It is not supported by :meth:`str.encode` (which only produces

1385

:class:`bytes` output).

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1386

1387

.. tabularcolumns:: |l|l|L|

1388

1389

+--------------------+---------+---------------------------+

1390

| Codec | Aliases | Purpose |

1391

+====================+=========+===========================+

1392

| rot_13 | rot13 | Returns the Caesar-cypher |

1393

| | | encryption of the operand |

1394

+--------------------+---------+---------------------------+

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1395

1396

.. versionadded:: 3.2

Nick Coghlan

2013-11-23 11:13:36 +1000

[diff] [blame]

1397

Restoration of the ``rot_13`` text transform.

1398

1399

.. versionchanged:: 3.4

1400

Restoration of the ``rot13`` alias.

Georg Brandl

2010-12-02 18:06:51 +0000

[diff] [blame]

1401

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1402

1403

:mod:`encodings.idna` --- Internationalized Domain Names in Applications

1404

------------------------------------------------------------------------

1405

1406

.. module:: encodings.idna

1407

:synopsis: Internationalized Domain Names implementation

1408

.. moduleauthor:: Martin v. Löwis

1409

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1410

This module implements :rfc:`3490` (Internationalized Domain Names in

1411

Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for

1412

Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding

1413

and :mod:`stringprep`.

1414

1415

These RFCs together define a protocol to support non-ASCII characters in domain

1416

names. A domain name containing non-ASCII characters (such as

1417

``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding

1418

(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain

1419

name is then used in all places where arbitrary characters are not allowed by

1420

the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so

1421

on. This conversion is carried out in the application; if possible invisible to

1422

the user: The application should transparently convert Unicode domain labels to

1423

IDNA on the wire, and convert back ACE labels to Unicode before presenting them

1424

to the user.

1425

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1426

Python supports this conversion in several ways: the ``idna`` codec performs

1427

conversion between Unicode and ACE, separating an input string into labels

Serhiy Storchaka

0a36ac1

2018-05-31 07:39:00 +0300

[diff] [blame]

1428

based on the separator characters defined in :rfc:`section 3.1 of RFC 3490 <3490#section-3.1>`

R David Murray

e0fd2f8

2011-04-13 14:12:18 -0400

[diff] [blame]

1429

and converting each label to ACE as required, and conversely separating an input

1430

byte string into labels based on the ``.`` separator and converting any ACE

1431

labels found into unicode. Furthermore, the :mod:`socket` module

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1432

transparently converts Unicode host names to ACE, so that applications need not

1433

be concerned about converting host names themselves when they pass them to the

1434

socket module. On top of that, modules that have host names as function

Georg Brandl

2442015

2008-05-26 16:32:26 +0000

[diff] [blame]

1435

parameters, such as :mod:`http.client` and :mod:`ftplib`, accept Unicode host

1436

names (:mod:`http.client` then also transparently sends an IDNA hostname in the

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1437

:mailheader:`Host` field if it sends that field at all).

1438

1439

When receiving host names from the wire (such as in reverse name lookup), no

1440

automatic conversion to Unicode is performed: Applications wishing to present

1441

such host names to the user should decode them to Unicode.

1442

1443

The module :mod:`encodings.idna` also implements the nameprep procedure, which

1444

performs certain normalizations on host names, to achieve case-insensitivity of

1445

international domain names, and to unify similar characters. The nameprep

1446

functions can be used directly if desired.

1447

1448

1449

.. function:: nameprep(label)

1450

1451

Return the nameprepped version of *label*. The implementation currently assumes

1452

query strings, so ``AllowUnassigned`` is true.

1453

1454

1455

.. function:: ToASCII(label)

1456

1457

Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is

assumed to be false.

.. function:: ToUnicode(label)

1462

1463

Convert a label to Unicode, as specified in :rfc:`3490`.

1464

1465

Victor Stinner

554f3f0

2010-06-16 23:33:54 +0000

[diff] [blame]

1466

:mod:`encodings.mbcs` --- Windows ANSI codepage

1467

-----------------------------------------------

1468

1469

.. module:: encodings.mbcs

1470

:synopsis: Windows ANSI codepage

1471

Victor Stinner

3a50e70

2011-10-18 21:21:00 +0200

[diff] [blame]

1472

Encode operand according to the ANSI codepage (CP_ACP).

Victor Stinner

554f3f0

2010-06-16 23:33:54 +0000

[diff] [blame]

1473

1474

Availability: Windows only.

1475

Victor Stinner

3a50e70

2011-10-18 21:21:00 +0200

[diff] [blame]

1476

.. versionchanged:: 3.3

1477

Support any error handler.

1478

Victor Stinner

554f3f0

2010-06-16 23:33:54 +0000

[diff] [blame]

1479

.. versionchanged:: 3.2

1480

Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used

1481

to encode, and ``'ignore'`` to decode.

1482

1483

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

1484

:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature

1485

-------------------------------------------------------------

1486

1487

.. module:: encodings.utf_8_sig

1488

:synopsis: UTF-8 codec with BOM signature

1489

.. moduleauthor:: Walter Dörwald

1490

Georg Brandl