Blame - Doc/library/codecs.rst - platform/external/python/cpython3

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1

2

:mod:`codecs` --- Codec registry and base classes

3

=================================================

4

5

.. module:: codecs

6

:synopsis: Encode and decode data and streams.

7

.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>

8

.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>

9

.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>

.. index::

single: Unicode

single: Codecs

pair: Codecs; encode

pair: Codecs; decode

single: streams

pair: stackable; streams

19

20

This module defines base classes for standard Python codecs (encoders and

21

decoders) and provides access to the internal Python codec registry which

22

manages the codec and error handling lookup process.

23

24

It defines the following functions:

25

26

27

.. function:: register(search_function)

28

29

Register a codec search function. Search functions are expected to take one

30

argument, the encoding name in all lower case letters, and return a

31

:class:`CodecInfo` object having the following attributes:

32

33

* ``name`` The name of the encoding;

34

35

* ``encoder`` The stateless encoding function;

36

37

* ``decoder`` The stateless decoding function;

38

39

* ``incrementalencoder`` An incremental encoder class or factory function;

40

41

* ``incrementaldecoder`` An incremental decoder class or factory function;

42

43

* ``streamwriter`` A stream writer class or factory function;

44

45

* ``streamreader`` A stream reader class or factory function.

46

47

The various functions or classes take the following arguments:

48

49

*encoder* and *decoder*: These must be functions or methods which have the same

50

interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see

51

Codec Interface). The functions/methods are expected to work in a stateless

52

mode.

53

54

*incrementalencoder* and *incrementalencoder*: These have to be factory

55

functions providing the following interface:

56

57

``factory(errors='strict')``

58

59

The factory functions must return objects providing the interfaces defined by

60

the base classes :class:`IncrementalEncoder` and :class:`IncrementalEncoder`,

61

respectively. Incremental codecs can maintain state.

62

63

*streamreader* and *streamwriter*: These have to be factory functions providing

64

the following interface:

65

66

``factory(stream, errors='strict')``

67

68

The factory functions must return objects providing the interfaces defined by

69

the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.

70

Stream codecs can maintain state.

71

72

Possible values for errors are ``'strict'`` (raise an exception in case of an

73

encoding error), ``'replace'`` (replace malformed data with a suitable

74

replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and

75

continue without further notice), ``'xmlcharrefreplace'`` (replace with the

76

appropriate XML character reference (for encoding only)) and

77

``'backslashreplace'`` (replace with backslashed escape sequences (for encoding

78

only)) as well as any other error handling name defined via

79

:func:`register_error`.

80

81

In case a search function cannot find a given encoding, it should return

``None``.

.. function:: lookup(encoding)

86

87

Looks up the codec info in the Python codec registry and returns a

88

:class:`CodecInfo` object as defined above.

89

90

Encodings are first looked up in the registry's cache. If not found, the list of

91

registered search functions is scanned. If no :class:`CodecInfo` object is

92

found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object

93

is stored in the cache and returned to the caller.

94

95

To simplify access to the various codecs, the module provides these additional

96

functions which use :func:`lookup` for the codec lookup:

97

98

99

.. function:: getencoder(encoding)

100

101

Look up the codec for the given encoding and return its encoder function.

102

103

Raises a :exc:`LookupError` in case the encoding cannot be found.

104

105

106

.. function:: getdecoder(encoding)

107

108

Look up the codec for the given encoding and return its decoder function.

109

110

Raises a :exc:`LookupError` in case the encoding cannot be found.

111

112

113

.. function:: getincrementalencoder(encoding)

114

115

Look up the codec for the given encoding and return its incremental encoder

116

class or factory function.

117

118

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

119

doesn't support an incremental encoder.

120

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

121

122

.. function:: getincrementaldecoder(encoding)

123

124

Look up the codec for the given encoding and return its incremental decoder

125

class or factory function.

126

127

Raises a :exc:`LookupError` in case the encoding cannot be found or the codec

128

doesn't support an incremental decoder.

129

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

130

131

.. function:: getreader(encoding)

132

133

Look up the codec for the given encoding and return its StreamReader class or

134

factory function.

135

136

Raises a :exc:`LookupError` in case the encoding cannot be found.

137

138

139

.. function:: getwriter(encoding)

140

141

Look up the codec for the given encoding and return its StreamWriter class or

142

factory function.

143

144

Raises a :exc:`LookupError` in case the encoding cannot be found.

145

146

147

.. function:: register_error(name, error_handler)

148

149

Register the error handling function *error_handler* under the name *name*.

150

*error_handler* will be called during encoding and decoding in case of an error,

151

when *name* is specified as the errors parameter.

152

153

For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`

154

instance, which contains information about the location of the error. The error

155

handler must either raise this or a different exception or return a tuple with a

156

replacement for the unencodable part of the input and a position where encoding

157

should continue. The encoder will encode the replacement and continue encoding

158

the original input at the specified position. Negative position values will be

159

treated as being relative to the end of the input string. If the resulting

160

position is out of bound an :exc:`IndexError` will be raised.

161

162

Decoding and translating works similar, except :exc:`UnicodeDecodeError` or

163

:exc:`UnicodeTranslateError` will be passed to the handler and that the

164

replacement from the error handler will be put into the output directly.

165

166

167

.. function:: lookup_error(name)

168

169

Return the error handler previously registered under the name *name*.

170

171

Raises a :exc:`LookupError` in case the handler cannot be found.

172

173

174

.. function:: strict_errors(exception)

175

176

Implements the ``strict`` error handling.

177

178

179

.. function:: replace_errors(exception)

180

181

Implements the ``replace`` error handling.

182

183

184

.. function:: ignore_errors(exception)

185

186

Implements the ``ignore`` error handling.

187

188

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

189

.. function:: xmlcharrefreplace_errors(exception)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

190

191

Implements the ``xmlcharrefreplace`` error handling.

192

193

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

194

.. function:: backslashreplace_errors(exception)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

195

196

Implements the ``backslashreplace`` error handling.

197

198

To simplify working with encoded files or stream, the module also defines these

utility functions:

.. function:: open(filename, mode[, encoding[, errors[, buffering]]])

203

204

Open an encoded file using the given *mode* and return a wrapped version

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

205

providing transparent encoding/decoding. The default file mode is ``'r'``

206

meaning to open the file in read mode.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

.. note::

The wrapped version will only accept the object format defined by the codecs,

211

i.e. Unicode objects for most built-in codecs. Output is also codec-dependent

212

and will usually be Unicode as well.

213

Christian Heimes

18c6689

2008-02-17 13:31:39 +0000

[diff] [blame]

214

.. note::

215

216

Files are always opened in binary mode, even if no binary mode was

217

specified. This is done to avoid data loss due to encodings using 8-bit

218

values. This means that no automatic conversion of ``'\n'`` is done

219

on reading and writing.

220

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

221

*encoding* specifies the encoding which is to be used for the file.

222

223

*errors* may be given to define the error handling. It defaults to ``'strict'``

224

which causes a :exc:`ValueError` to be raised in case an encoding error occurs.

225

226

*buffering* has the same meaning as for the built-in :func:`open` function. It

227

defaults to line buffered.

228

229

230

.. function:: EncodedFile(file, input[, output[, errors]])

231

232

Return a wrapped version of file which provides transparent encoding

233

translation.

234

235

Strings written to the wrapped file are interpreted according to the given

236

*input* encoding and then written to the original file as strings using the

237

*output* encoding. The intermediate encoding will usually be Unicode but depends

238

on the specified codecs.

239

240

If *output* is not given, it defaults to *input*.

241

242

*errors* may be given to define the error handling. It defaults to ``'strict'``,

243

which causes :exc:`ValueError` to be raised in case an encoding error occurs.

244

245

246

.. function:: iterencode(iterable, encoding[, errors])

247

248

Uses an incremental encoder to iteratively encode the input provided by

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

249

*iterable*. This function is a :term:`generator`. *errors* (as well as any

250

other keyword argument) is passed through to the incremental encoder.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

251

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

252

253

.. function:: iterdecode(iterable, encoding[, errors])

254

255

Uses an incremental decoder to iteratively decode the input provided by

Georg Brandl

9afde1c

2007-11-01 20:32:30 +0000

[diff] [blame]

256

*iterable*. This function is a :term:`generator`. *errors* (as well as any

257

other keyword argument) is passed through to the incremental decoder.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

258

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

259

The module also provides the following constants which are useful for reading

260

and writing to platform dependent files:

.. data:: BOM

BOM_BE

BOM_LE

BOM_UTF8

BOM_UTF16

BOM_UTF16_BE

BOM_UTF16_LE

BOM_UTF32

BOM_UTF32_BE

BOM_UTF32_LE

These constants define various encodings of the Unicode byte order mark (BOM)

275

used in UTF-16 and UTF-32 data streams to indicate the byte order used in the

276

stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either

277

:const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's

278

native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,

279

:const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for

280

:const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32

encodings.

.. _codec-base-classes:

Codec Base Classes

------------------

The :mod:`codecs` module defines a set of base classes which define the

290

interface and can also be used to easily write you own codecs for use in Python.

291

292

Each codec has to define four interfaces to make it usable as codec in Python:

293

stateless encoder, stateless decoder, stream reader and stream writer. The

294

stream reader and writers typically reuse the stateless encoder/decoder to

295

implement the file protocols.

296

297

The :class:`Codec` class defines the interface for stateless encoders/decoders.

298

299

To simplify and standardize error handling, the :meth:`encode` and

300

:meth:`decode` methods may implement different error handling schemes by

301

providing the *errors* string argument. The following string values are defined

302

and implemented by all standard Python codecs:

303

304

+-------------------------+-----------------------------------------------+

305

| Value | Meaning |

306

+=========================+===============================================+

307

| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |

308

| | this is the default. |

309

+-------------------------+-----------------------------------------------+

310

| ``'ignore'`` | Ignore the character and continue with the |

311

| | next. |

312

+-------------------------+-----------------------------------------------+

313

| ``'replace'`` | Replace with a suitable replacement |

314

| | character; Python will use the official |

315

| | U+FFFD REPLACEMENT CHARACTER for the built-in |

316

| | Unicode codecs on decoding and '?' on |

317

| | encoding. |

318

+-------------------------+-----------------------------------------------+

319

| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |

320

| | reference (only for encoding). |

321

+-------------------------+-----------------------------------------------+

322

| ``'backslashreplace'`` | Replace with backslashed escape sequences |

323

| | (only for encoding). |

324

+-------------------------+-----------------------------------------------+

325

326

The set of allowed values can be extended via :meth:`register_error`.

.. _codec-objects:

Codec Objects

^^^^^^^^^^^^^

The :class:`Codec` class defines these methods which also define the function

335

interfaces of the stateless encoder and decoder:

336

337

338

.. method:: Codec.encode(input[, errors])

339

340

Encodes the object *input* and returns a tuple (output object, length consumed).

341

While codecs are not restricted to use with Unicode, in a Unicode context,

342

encoding converts a Unicode object to a plain string using a particular

343

character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).

344

345

*errors* defines the error handling to apply. It defaults to ``'strict'``

346

handling.

347

348

The method may not store state in the :class:`Codec` instance. Use

349

:class:`StreamCodec` for codecs which have to keep state in order to make

350

encoding/decoding efficient.

351

352

The encoder must be able to handle zero length input and return an empty object

353

of the output object type in this situation.

354

355

356

.. method:: Codec.decode(input[, errors])

357

358

Decodes the object *input* and returns a tuple (output object, length consumed).

359

In a Unicode context, decoding converts a plain string encoded using a

360

particular character set encoding to a Unicode object.

361

362

*input* must be an object which provides the ``bf_getreadbuf`` buffer slot.

363

Python strings, buffer objects and memory mapped files are examples of objects

364

providing this slot.

365

366

*errors* defines the error handling to apply. It defaults to ``'strict'``

367

handling.

368

369

The method may not store state in the :class:`Codec` instance. Use

370

:class:`StreamCodec` for codecs which have to keep state in order to make

371

encoding/decoding efficient.

372

373

The decoder must be able to handle zero length input and return an empty object

374

of the output object type in this situation.

375

376

The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide

377

the basic interface for incremental encoding and decoding. Encoding/decoding the

378

input isn't done with one call to the stateless encoder/decoder function, but

379

with multiple calls to the :meth:`encode`/:meth:`decode` method of the

380

incremental encoder/decoder. The incremental encoder/decoder keeps track of the

381

encoding/decoding process during method calls.

382

383

The joined output of calls to the :meth:`encode`/:meth:`decode` method is the

384

same as if all the single inputs were joined into one, and this input was

385

encoded/decoded with the stateless encoder/decoder.

386

387

388

.. _incremental-encoder-objects:

389

390

IncrementalEncoder Objects

391

^^^^^^^^^^^^^^^^^^^^^^^^^^

392

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

393

The :class:`IncrementalEncoder` class is used for encoding an input in multiple

394

steps. It defines the following methods which every incremental encoder must

395

define in order to be compatible with the Python codec registry.

396

397

398

.. class:: IncrementalEncoder([errors])

399

400

Constructor for an :class:`IncrementalEncoder` instance.

401

402

All incremental encoders must provide this constructor interface. They are free

403

to add additional keyword arguments, but only the ones defined here are used by

404

the Python codec registry.

405

406

The :class:`IncrementalEncoder` may implement different error handling schemes

407

by providing the *errors* keyword argument. These parameters are predefined:

408

409

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

410

411

* ``'ignore'`` Ignore the character and continue with the next.

412

413

* ``'replace'`` Replace with a suitable replacement character

414

415

* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference

416

417

* ``'backslashreplace'`` Replace with backslashed escape sequences.

418

419

The *errors* argument will be assigned to an attribute of the same name.

420

Assigning to this attribute makes it possible to switch between different error

421

handling strategies during the lifetime of the :class:`IncrementalEncoder`

422

object.

423

424

The set of allowed values for the *errors* argument can be extended with

425

:func:`register_error`.

426

427

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

428

.. method:: encode(object[, final])

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

429

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

430

Encodes *object* (taking the current state of the encoder into account)

431

and returns the resulting encoded object. If this is the last call to

432

:meth:`encode` *final* must be true (the default is false).

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

433

434

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

435

.. method:: reset()

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

436

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

437

Reset the encoder to the initial state.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

438

439

440

.. method:: IncrementalEncoder.getstate()

441

442

Return the current state of the encoder which must be an integer. The

443

implementation should make sure that ``0`` is the most common state. (States

444

that are more complicated than integers can be converted into an integer by

445

marshaling/pickling the state and encoding the bytes of the resulting string

446

into an integer).

447

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

448

449

.. method:: IncrementalEncoder.setstate(state)

450

451

Set the state of the encoder to *state*. *state* must be an encoder state

452

returned by :meth:`getstate`.

453

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

454

455

.. _incremental-decoder-objects:

456

457

IncrementalDecoder Objects

458

^^^^^^^^^^^^^^^^^^^^^^^^^^

459

460

The :class:`IncrementalDecoder` class is used for decoding an input in multiple

461

steps. It defines the following methods which every incremental decoder must

462

define in order to be compatible with the Python codec registry.

463

464

465

.. class:: IncrementalDecoder([errors])

466

467

Constructor for an :class:`IncrementalDecoder` instance.

468

469

All incremental decoders must provide this constructor interface. They are free

470

to add additional keyword arguments, but only the ones defined here are used by

471

the Python codec registry.

472

473

The :class:`IncrementalDecoder` may implement different error handling schemes

474

by providing the *errors* keyword argument. These parameters are predefined:

475

476

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

477

478

* ``'ignore'`` Ignore the character and continue with the next.

479

480

* ``'replace'`` Replace with a suitable replacement character.

481

482

The *errors* argument will be assigned to an attribute of the same name.

483

Assigning to this attribute makes it possible to switch between different error

484

handling strategies during the lifetime of the :class:`IncrementalEncoder`

485

object.

486

487

The set of allowed values for the *errors* argument can be extended with

488

:func:`register_error`.

489

490

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

491

.. method:: decode(object[, final])

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

492

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

493

Decodes *object* (taking the current state of the decoder into account)

494

and returns the resulting decoded object. If this is the last call to

495

:meth:`decode` *final* must be true (the default is false). If *final* is

496

true the decoder must decode the input completely and must flush all

497

buffers. If this isn't possible (e.g. because of incomplete byte sequences

498

at the end of the input) it must initiate error handling just like in the

499

stateless case (which might raise an exception).

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

500

501

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

502

.. method:: reset()

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

503

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

504

Reset the decoder to the initial state.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

505

506

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

507

.. method:: getstate()

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

508

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

509

Return the current state of the decoder. This must be a tuple with two

510

items, the first must be the buffer containing the still undecoded

511

input. The second must be an integer and can be additional state

512

info. (The implementation should make sure that ``0`` is the most common

513

additional state info.) If this additional state info is ``0`` it must be

514

possible to set the decoder to the state which has no input buffered and

515

``0`` as the additional state info, so that feeding the previously

516

buffered input to the decoder returns it to the previous state without

517

producing any output. (Additional state info that is more complicated than

518

integers can be converted into an integer by marshaling/pickling the info

519

and encoding the bytes of the resulting string into an integer.)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

520

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

521

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

522

.. method:: setstate(state)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

523

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

524

Set the state of the encoder to *state*. *state* must be a decoder state

525

returned by :meth:`getstate`.

526

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

527

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

528

The :class:`StreamWriter` and :class:`StreamReader` classes provide generic

529

working interfaces which can be used to implement new encoding submodules very

530

easily. See :mod:`encodings.utf_8` for an example of how this is done.

531

532

533

.. _stream-writer-objects:

StreamWriter Objects

^^^^^^^^^^^^^^^^^^^^

The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the

539

following methods which every stream writer must define in order to be

540

compatible with the Python codec registry.

541

542

543

.. class:: StreamWriter(stream[, errors])

544

545

Constructor for a :class:`StreamWriter` instance.

546

547

All stream writers must provide this constructor interface. They are free to add

548

additional keyword arguments, but only the ones defined here are used by the

549

Python codec registry.

550

551

*stream* must be a file-like object open for writing binary data.

552

553

The :class:`StreamWriter` may implement different error handling schemes by

554

providing the *errors* keyword argument. These parameters are predefined:

555

556

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

557

558

* ``'ignore'`` Ignore the character and continue with the next.

559

560

* ``'replace'`` Replace with a suitable replacement character

561

562

* ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference

563

564

* ``'backslashreplace'`` Replace with backslashed escape sequences.

565

566

The *errors* argument will be assigned to an attribute of the same name.

567

Assigning to this attribute makes it possible to switch between different error

568

handling strategies during the lifetime of the :class:`StreamWriter` object.

569

570

The set of allowed values for the *errors* argument can be extended with

571

:func:`register_error`.

572

573

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

574

.. method:: write(object)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

575

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

576

Writes the object's contents encoded to the stream.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

577

578

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

579

.. method:: writelines(list)

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

580

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

581

Writes the concatenated list of strings to the stream (possibly by reusing

582

the :meth:`write` method).

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

583

584

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

585

.. method:: reset()

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

586

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

587

Flushes and resets the codec buffers used for keeping state.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

588

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

589

Calling this method should ensure that the data on the output is put into

590

a clean state that allows appending of new fresh data without having to

591

rescan the whole stream to recover state.

592

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

593

594

In addition to the above methods, the :class:`StreamWriter` must also inherit

595

all other methods and attributes from the underlying stream.

596

597

598

.. _stream-reader-objects:

StreamReader Objects

^^^^^^^^^^^^^^^^^^^^

The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the

604

following methods which every stream reader must define in order to be

605

compatible with the Python codec registry.

606

607

608

.. class:: StreamReader(stream[, errors])

609

610

Constructor for a :class:`StreamReader` instance.

611

612

All stream readers must provide this constructor interface. They are free to add

613

additional keyword arguments, but only the ones defined here are used by the

614

Python codec registry.

615

616

*stream* must be a file-like object open for reading (binary) data.

617

618

The :class:`StreamReader` may implement different error handling schemes by

619

providing the *errors* keyword argument. These parameters are defined:

620

621

* ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.

622

623

* ``'ignore'`` Ignore the character and continue with the next.

624

625

* ``'replace'`` Replace with a suitable replacement character.

626

627

The *errors* argument will be assigned to an attribute of the same name.

628

Assigning to this attribute makes it possible to switch between different error

629

handling strategies during the lifetime of the :class:`StreamReader` object.

630

631

The set of allowed values for the *errors* argument can be extended with

632

:func:`register_error`.

633

634

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

635

.. method:: read([size[, chars, [firstline]]])

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

636

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

637

Decodes data from the stream and returns the resulting object.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

638

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

639

*chars* indicates the number of characters to read from the

640

stream. :func:`read` will never return more than *chars* characters, but

641

it might return less, if there are not enough characters available.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

642

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

643

*size* indicates the approximate maximum number of bytes to read from the

644

stream for decoding purposes. The decoder can modify this setting as

645

appropriate. The default value -1 indicates to read and decode as much as

646

possible. *size* is intended to prevent having to decode huge files in

647

one step.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

648

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

649

*firstline* indicates that it would be sufficient to only return the first

650

line, if there are decoding errors on later lines.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

651

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

652

The method should use a greedy read strategy meaning that it should read

653

as much data as is allowed within the definition of the encoding and the

654

given size, e.g. if optional encoding endings or state markers are

655

available on the stream, these should be read too.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

656

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

657

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

658

.. method:: readline([size[, keepends]])

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

659

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

660

Read one line from the input stream and return the decoded data.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

661

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

662

*size*, if given, is passed as size argument to the stream's

663

:meth:`readline` method.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

664

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

665

If *keepends* is false line-endings will be stripped from the lines

666

returned.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

667

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

668

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

669

.. method:: readlines([sizehint[, keepends]])

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

670

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

671

Read all lines available on the input stream and return them as a list of

672

lines.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

673

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

674

Line-endings are implemented using the codec's decoder method and are

675

included in the list entries if *keepends* is true.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

676

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

677

*sizehint*, if given, is passed as the *size* argument to the stream's

678

:meth:`read` method.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

679

680

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

681

.. method:: reset()

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

682

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

683

Resets the codec buffers used for keeping state.

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

684

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

685

Note that no stream repositioning should take place. This method is

686

primarily intended to be able to recover from decoding errors.

687

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

688

689

In addition to the above methods, the :class:`StreamReader` must also inherit

690

all other methods and attributes from the underlying stream.

691

692

The next two base classes are included for convenience. They are not needed by

693

the codec registry, but may provide useful in practice.

694

695

696

.. _stream-reader-writer:

697

698

StreamReaderWriter Objects

699

^^^^^^^^^^^^^^^^^^^^^^^^^^

700

701

The :class:`StreamReaderWriter` allows wrapping streams which work in both read

702

and write modes.

703

704

The design is such that one can use the factory functions returned by the

705

:func:`lookup` function to construct the instance.

706

707

708

.. class:: StreamReaderWriter(stream, Reader, Writer, errors)

709

710

Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like

711

object. *Reader* and *Writer* must be factory functions or classes providing the

712

:class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling

713

is done in the same way as defined for the stream readers and writers.

714

715

:class:`StreamReaderWriter` instances define the combined interfaces of

716

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

717

methods and attributes from the underlying stream.

718

719

720

.. _stream-recoder-objects:

721

722

StreamRecoder Objects

723

^^^^^^^^^^^^^^^^^^^^^

724

725

The :class:`StreamRecoder` provide a frontend - backend view of encoding data

726

which is sometimes useful when dealing with different encoding environments.

727

728

The design is such that one can use the factory functions returned by the

729

:func:`lookup` function to construct the instance.

730

731

732

.. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)

733

734

Creates a :class:`StreamRecoder` instance which implements a two-way conversion:

735

*encode* and *decode* work on the frontend (the input to :meth:`read` and output

736

of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and

737

writing to the stream).

738

739

You can use these objects to do transparent direct recodings from e.g. Latin-1

740

to UTF-8 and back.

741

742

*stream* must be a file-like object.

743

744

*encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,

745

*Writer* must be factory functions or classes providing objects of the

746

:class:`StreamReader` and :class:`StreamWriter` interface respectively.

747

748

*encode* and *decode* are needed for the frontend translation, *Reader* and

749

*Writer* for the backend translation. The intermediate format used is

750

determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode

751

as the intermediate encoding.

752

753

Error handling is done in the same way as defined for the stream readers and

754

writers.

755

Benjamin Peterson

e41251e

2008-04-25 01:59:09 +0000

[diff] [blame]

756

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

757

:class:`StreamRecoder` instances define the combined interfaces of

758

:class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other

759

methods and attributes from the underlying stream.

760

761

762

.. _encodings-overview:

763

764

Encodings and Unicode

765

---------------------

766

767

Unicode strings are stored internally as sequences of codepoints (to be precise

768

as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either

Georg Brandl

52d168a

2008-01-07 18:10:24 +0000

[diff] [blame]

769

via :option:`--without-wide-unicode` or :option:`--with-wide-unicode`, with the

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

770

former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data

771

type. Once a Unicode object is used outside of CPU and memory, CPU endianness

772

and how these arrays are stored as bytes become an issue. Transforming a

773

unicode object into a sequence of bytes is called encoding and recreating the

774

unicode object from the sequence of bytes is known as decoding. There are many

775

different methods for how this transformation can be done (these methods are

776

also called encodings). The simplest method is to map the codepoints 0-255 to

777

the bytes ``0x0``-``0xff``. This means that a unicode object that contains

778

codepoints above ``U+00FF`` can't be encoded with this method (which is called

779

``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a

780

:exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'

781

codec can't encode character u'\u1234' in position 3: ordinal not in

782

range(256)``.

783

784

There's another group of encodings (the so called charmap encodings) that choose

785

a different subset of all unicode code points and how these codepoints are

786

mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open

787

e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on

788

Windows). There's a string constant with 256 characters that shows you which

789

character is mapped to which byte value.

790

791

All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints

792

defined in unicode. A simple and straightforward way that can store each Unicode

793

code point, is to store each codepoint as two consecutive bytes. There are two

794

possibilities: Store the bytes in big endian or in little endian order. These

795

two encodings are called UTF-16-BE and UTF-16-LE respectively. Their

796

disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you

797

will always have to swap bytes on encoding and decoding. UTF-16 avoids this

798

problem: Bytes will always be in natural endianness. When these bytes are read

799

by a CPU with a different endianness, then bytes have to be swapped though. To

800

be able to detect the endianness of a UTF-16 byte sequence, there's the so

801

called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.

802

This character will be prepended to every UTF-16 byte sequence. The byte swapped

803

version of this character (``0xFFFE``) is an illegal character that may not

804

appear in a Unicode text. So when the first character in an UTF-16 byte sequence

805

appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.

806

Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as

807

a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow

808

a word to be split. It can e.g. be used to give hints to a ligature algorithm.

809

With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been

810

deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless

811

Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM

812

it's a device to determine the storage layout of the encoded bytes, and vanishes

813

once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH

814

NO-BREAK SPACE`` it's a normal character that will be decoded like any other.

815

816

There's another encoding that is able to encoding the full range of Unicode

817

characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues

818

with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two

819

parts: Marker bits (the most significant bits) and payload bits. The marker bits

820

are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are

821

encoded like this (with x being payload bits, which when concatenated give the

822

Unicode character):

823

824

+-----------------------------------+----------------------------------------------+

825

| Range | Encoding |

826

+===================================+==============================================+

827

| ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx |

828

+-----------------------------------+----------------------------------------------+

829

| ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx |

830

+-----------------------------------+----------------------------------------------+

831

| ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx |

832

+-----------------------------------+----------------------------------------------+

833

| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |

834

+-----------------------------------+----------------------------------------------+

835

| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |

836

+-----------------------------------+----------------------------------------------+

837

| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |

838

| | 10xxxxxx |

839

+-----------------------------------+----------------------------------------------+

840

841

The least significant bit of the Unicode character is the rightmost x bit.

842

843

As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in

844

the decoded Unicode string (even if it's the first character) is treated as a

845

``ZERO WIDTH NO-BREAK SPACE``.

846

847

Without external information it's impossible to reliably determine which

848

encoding was used for encoding a Unicode string. Each charmap encoding can

849

decode any random byte sequence. However that's not possible with UTF-8, as

850

UTF-8 byte sequences have a structure that doesn't allow arbitrary byte

Thomas Wouters

89d996e

2007-09-08 17:39:28 +0000

[diff] [blame]

851

sequences. To increase the reliability with which a UTF-8 encoding can be

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

852

detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls

853

``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters

854

is written to the file, a UTF-8 encoded BOM (which looks like this as a byte

855

sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable

856

that any charmap encoded file starts with these byte values (which would e.g.

857

map to

858

859

| LATIN SMALL LETTER I WITH DIAERESIS

860

| RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

861

| INVERTED QUESTION MARK

862

863

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be

864

correctly guessed from the byte sequence. So here the BOM is not used to be able

865

to determine the byte order used for generating the byte sequence, but as a

866

signature that helps in guessing the encoding. On encoding the utf-8-sig codec

867

will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On

868

decoding utf-8-sig will skip those three bytes if they appear as the first three

bytes in the file.

.. _standard-encodings:

Standard Encodings

------------------

Python comes with a number of codecs built-in, either implemented as C functions

878

or with dictionaries as mapping tables. The following table lists the codecs by

879

name, together with a few common aliases, and the languages for which the

880

encoding is likely used. Neither the list of aliases nor the list of languages

881

is meant to be exhaustive. Notice that spelling alternatives that only differ in

882

case or use a hyphen instead of an underscore are also valid aliases.

883

884

Many of the character sets support the same languages. They vary in individual

885

characters (e.g. whether the EURO SIGN is supported or not), and in the

886

assignment of characters to code positions. For the European languages in

887

particular, the following variants typically exist:

888

889

* an ISO 8859 codeset

890

891

* a Microsoft Windows code page, which is typically derived from a 8859 codeset,

892

but replaces control characters with additional graphic characters

893

894

* an IBM EBCDIC code page

895

896

* an IBM PC code page, which is ASCII compatible

897

898

+-----------------+--------------------------------+--------------------------------+

899

| Codec | Aliases | Languages |

900

+=================+================================+================================+

901

| ascii | 646, us-ascii | English |

902

+-----------------+--------------------------------+--------------------------------+

903

| big5 | big5-tw, csbig5 | Traditional Chinese |

904

+-----------------+--------------------------------+--------------------------------+

905

| big5hkscs | big5-hkscs, hkscs | Traditional Chinese |

906

+-----------------+--------------------------------+--------------------------------+

907

| cp037 | IBM037, IBM039 | English |

908

+-----------------+--------------------------------+--------------------------------+

909

| cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |

910

+-----------------+--------------------------------+--------------------------------+

911

| cp437 | 437, IBM437 | English |

912

+-----------------+--------------------------------+--------------------------------+

913

| cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe |

914

| | IBM500 | |

915

+-----------------+--------------------------------+--------------------------------+

916

| cp737 | | Greek |

917

+-----------------+--------------------------------+--------------------------------+

918

| cp775 | IBM775 | Baltic languages |

919

+-----------------+--------------------------------+--------------------------------+

920

| cp850 | 850, IBM850 | Western Europe |

921

+-----------------+--------------------------------+--------------------------------+

922

| cp852 | 852, IBM852 | Central and Eastern Europe |

923

+-----------------+--------------------------------+--------------------------------+

924

| cp855 | 855, IBM855 | Bulgarian, Byelorussian, |

925

| | | Macedonian, Russian, Serbian |

926

+-----------------+--------------------------------+--------------------------------+

927

| cp856 | | Hebrew |

928

+-----------------+--------------------------------+--------------------------------+

929

| cp857 | 857, IBM857 | Turkish |

930

+-----------------+--------------------------------+--------------------------------+

931

| cp860 | 860, IBM860 | Portuguese |

932

+-----------------+--------------------------------+--------------------------------+

933

| cp861 | 861, CP-IS, IBM861 | Icelandic |

934

+-----------------+--------------------------------+--------------------------------+

935

| cp862 | 862, IBM862 | Hebrew |

936

+-----------------+--------------------------------+--------------------------------+

937

| cp863 | 863, IBM863 | Canadian |

938

+-----------------+--------------------------------+--------------------------------+

939

| cp864 | IBM864 | Arabic |

940

+-----------------+--------------------------------+--------------------------------+

941

| cp865 | 865, IBM865 | Danish, Norwegian |

942

+-----------------+--------------------------------+--------------------------------+

943

| cp866 | 866, IBM866 | Russian |

944

+-----------------+--------------------------------+--------------------------------+

945

| cp869 | 869, CP-GR, IBM869 | Greek |

946

+-----------------+--------------------------------+--------------------------------+

947

| cp874 | | Thai |

948

+-----------------+--------------------------------+--------------------------------+

949

| cp875 | | Greek |

950

+-----------------+--------------------------------+--------------------------------+

951

| cp932 | 932, ms932, mskanji, ms-kanji | Japanese |

952

+-----------------+--------------------------------+--------------------------------+

953

| cp949 | 949, ms949, uhc | Korean |

954

+-----------------+--------------------------------+--------------------------------+

955

| cp950 | 950, ms950 | Traditional Chinese |

956

+-----------------+--------------------------------+--------------------------------+

957

| cp1006 | | Urdu |

958

+-----------------+--------------------------------+--------------------------------+

959

| cp1026 | ibm1026 | Turkish |

960

+-----------------+--------------------------------+--------------------------------+

961

| cp1140 | ibm1140 | Western Europe |

962

+-----------------+--------------------------------+--------------------------------+

963

| cp1250 | windows-1250 | Central and Eastern Europe |

964

+-----------------+--------------------------------+--------------------------------+

965

| cp1251 | windows-1251 | Bulgarian, Byelorussian, |

966

| | | Macedonian, Russian, Serbian |

967

+-----------------+--------------------------------+--------------------------------+

968

| cp1252 | windows-1252 | Western Europe |

969

+-----------------+--------------------------------+--------------------------------+

970

| cp1253 | windows-1253 | Greek |

971

+-----------------+--------------------------------+--------------------------------+

972

| cp1254 | windows-1254 | Turkish |

973

+-----------------+--------------------------------+--------------------------------+

974

| cp1255 | windows-1255 | Hebrew |

975

+-----------------+--------------------------------+--------------------------------+

976

| cp1256 | windows1256 | Arabic |

977

+-----------------+--------------------------------+--------------------------------+

978

| cp1257 | windows-1257 | Baltic languages |

979

+-----------------+--------------------------------+--------------------------------+

980

| cp1258 | windows-1258 | Vietnamese |

981

+-----------------+--------------------------------+--------------------------------+

982

| euc_jp | eucjp, ujis, u-jis | Japanese |

983

+-----------------+--------------------------------+--------------------------------+

984

| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |

985

+-----------------+--------------------------------+--------------------------------+

986

| euc_jisx0213 | eucjisx0213 | Japanese |

987

+-----------------+--------------------------------+--------------------------------+

988

| euc_kr | euckr, korean, ksc5601, | Korean |

989

| | ks_c-5601, ks_c-5601-1987, | |

990

| | ksx1001, ks_x-1001 | |

991

+-----------------+--------------------------------+--------------------------------+

992

| gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese |

993

| | cn, euccn, eucgb2312-cn, | |

994

| | gb2312-1980, gb2312-80, iso- | |

995

| | ir-58 | |

996

+-----------------+--------------------------------+--------------------------------+

997

| gbk | 936, cp936, ms936 | Unified Chinese |

998

+-----------------+--------------------------------+--------------------------------+

999

| gb18030 | gb18030-2000 | Unified Chinese |

1000

+-----------------+--------------------------------+--------------------------------+

1001

| hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |

1002

+-----------------+--------------------------------+--------------------------------+

1003

| iso2022_jp | csiso2022jp, iso2022jp, | Japanese |

1004

| | iso-2022-jp | |

1005

+-----------------+--------------------------------+--------------------------------+

1006

| iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |

1007

+-----------------+--------------------------------+--------------------------------+

1008

| iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified |

1009

| | | Chinese, Western Europe, Greek |

1010

+-----------------+--------------------------------+--------------------------------+

1011

| iso2022_jp_2004 | iso2022jp-2004, | Japanese |

1012

| | iso-2022-jp-2004 | |

1013

+-----------------+--------------------------------+--------------------------------+

1014

| iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |

1015

+-----------------+--------------------------------+--------------------------------+

1016

| iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |

1017

+-----------------+--------------------------------+--------------------------------+

1018

| iso2022_kr | csiso2022kr, iso2022kr, | Korean |

1019

| | iso-2022-kr | |

1020

+-----------------+--------------------------------+--------------------------------+

1021

| latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe |

1022

| | cp819, latin, latin1, L1 | |

1023

+-----------------+--------------------------------+--------------------------------+

1024

| iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |

1025

+-----------------+--------------------------------+--------------------------------+

1026

| iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |

1027

+-----------------+--------------------------------+--------------------------------+

Christian Heimes

c3f30c4

2008-02-22 16:37:40 +0000

[diff] [blame]

1028

| iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1029

+-----------------+--------------------------------+--------------------------------+

1030

| iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, |

1031

| | | Macedonian, Russian, Serbian |

1032

+-----------------+--------------------------------+--------------------------------+

1033

| iso8859_6 | iso-8859-6, arabic | Arabic |

1034

+-----------------+--------------------------------+--------------------------------+

1035

| iso8859_7 | iso-8859-7, greek, greek8 | Greek |

1036

+-----------------+--------------------------------+--------------------------------+

1037

| iso8859_8 | iso-8859-8, hebrew | Hebrew |

1038

+-----------------+--------------------------------+--------------------------------+

1039

| iso8859_9 | iso-8859-9, latin5, L5 | Turkish |

1040

+-----------------+--------------------------------+--------------------------------+

1041

| iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |

1042

+-----------------+--------------------------------+--------------------------------+

1043

| iso8859_13 | iso-8859-13 | Baltic languages |

1044

+-----------------+--------------------------------+--------------------------------+

1045

| iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |

1046

+-----------------+--------------------------------+--------------------------------+

1047

| iso8859_15 | iso-8859-15 | Western Europe |

1048

+-----------------+--------------------------------+--------------------------------+

1049

| johab | cp1361, ms1361 | Korean |

1050

+-----------------+--------------------------------+--------------------------------+

1051

| koi8_r | | Russian |

1052

+-----------------+--------------------------------+--------------------------------+

1053

| koi8_u | | Ukrainian |

1054

+-----------------+--------------------------------+--------------------------------+

1055

| mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, |

1056

| | | Macedonian, Russian, Serbian |

1057

+-----------------+--------------------------------+--------------------------------+

1058

| mac_greek | macgreek | Greek |

1059

+-----------------+--------------------------------+--------------------------------+

1060

| mac_iceland | maciceland | Icelandic |

1061

+-----------------+--------------------------------+--------------------------------+

1062

| mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |

1063

+-----------------+--------------------------------+--------------------------------+

1064

| mac_roman | macroman | Western Europe |

1065

+-----------------+--------------------------------+--------------------------------+

1066

| mac_turkish | macturkish | Turkish |

1067

+-----------------+--------------------------------+--------------------------------+

1068

| ptcp154 | csptcp154, pt154, cp154, | Kazakh |

1069

| | cyrillic-asian | |

1070

+-----------------+--------------------------------+--------------------------------+

1071

| shift_jis | csshiftjis, shiftjis, sjis, | Japanese |

1072

| | s_jis | |

1073

+-----------------+--------------------------------+--------------------------------+

1074

| shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese |

1075

| | sjis2004 | |

1076

+-----------------+--------------------------------+--------------------------------+

1077

| shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese |

1078

| | s_jisx0213 | |

1079

+-----------------+--------------------------------+--------------------------------+

Walter Dörwald

41980ca

2007-08-16 21:55:45 +0000

[diff] [blame]

1080

| utf_32 | U32, utf32 | all languages |

1081

+-----------------+--------------------------------+--------------------------------+

1082

| utf_32_be | UTF-32BE | all languages |

1083

+-----------------+--------------------------------+--------------------------------+

1084

| utf_32_le | UTF-32LE | all languages |

1085

+-----------------+--------------------------------+--------------------------------+

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1086

| utf_16 | U16, utf16 | all languages |

1087

+-----------------+--------------------------------+--------------------------------+

1088

| utf_16_be | UTF-16BE | all languages (BMP only) |

1089

+-----------------+--------------------------------+--------------------------------+

1090

| utf_16_le | UTF-16LE | all languages (BMP only) |

1091

+-----------------+--------------------------------+--------------------------------+

1092

| utf_7 | U7, unicode-1-1-utf-7 | all languages |

1093

+-----------------+--------------------------------+--------------------------------+

1094

| utf_8 | U8, UTF, utf8 | all languages |

1095

+-----------------+--------------------------------+--------------------------------+

1096

| utf_8_sig | | all languages |

1097

+-----------------+--------------------------------+--------------------------------+

1098

1099

A number of codecs are specific to Python, so their codec names have no meaning

1100

outside Python. Some of them don't convert from Unicode strings to byte strings,

1101

but instead use the property of the Python codecs machinery that any bijective

1102

function with one argument can be considered as an encoding.

1103

1104

For the codecs listed below, the result in the "encoding" direction is always a

1105

byte string. The result of the "decoding" direction is listed as operand type in

1106

the table.

1107

Georg Brandl

226878c

2007-08-31 10:15:37 +0000

[diff] [blame]

1108

.. XXX fix here, should be in above table

1109

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1110

+--------------------+---------+----------------+---------------------------+

1111

1112

+====================+=========+================+===========================+

+--------------------+---------+----------------+---------------------------+

+--------------------+---------+----------------+---------------------------+

1121

1122

+--------------------+---------+----------------+---------------------------+

1123

1124

+--------------------+---------+----------------+---------------------------+

+--------------------+---------+----------------+---------------------------+

+--------------------+---------+----------------+---------------------------+

+--------------------+---------+----------------+---------------------------+

+--------------------+---------+----------------+---------------------------+

1147

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1148

1149

:mod:`encodings.idna` --- Internationalized Domain Names in Applications

1150

------------------------------------------------------------------------

1151

1152

.. module:: encodings.idna

1153

:synopsis: Internationalized Domain Names implementation

1154

.. moduleauthor:: Martin v. Löwis

1155

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1156

This module implements :rfc:`3490` (Internationalized Domain Names in

1157

Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for

1158

Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding

1159

and :mod:`stringprep`.

1160

1161

These RFCs together define a protocol to support non-ASCII characters in domain

1162

names. A domain name containing non-ASCII characters (such as

1163

``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding

1164

(ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain

1165

name is then used in all places where arbitrary characters are not allowed by

1166

the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so

1167

on. This conversion is carried out in the application; if possible invisible to

1168

the user: The application should transparently convert Unicode domain labels to

1169

IDNA on the wire, and convert back ACE labels to Unicode before presenting them

1170

to the user.

1171

1172

Python supports this conversion in several ways: The ``idna`` codec allows to

1173

convert between Unicode and the ACE. Furthermore, the :mod:`socket` module

1174

transparently converts Unicode host names to ACE, so that applications need not

1175

be concerned about converting host names themselves when they pass them to the

1176

socket module. On top of that, modules that have host names as function

1177

parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names

1178

(:mod:`httplib` then also transparently sends an IDNA hostname in the

1179

:mailheader:`Host` field if it sends that field at all).

1180

1181

When receiving host names from the wire (such as in reverse name lookup), no

1182

automatic conversion to Unicode is performed: Applications wishing to present

1183

such host names to the user should decode them to Unicode.

1184

1185

The module :mod:`encodings.idna` also implements the nameprep procedure, which

1186

performs certain normalizations on host names, to achieve case-insensitivity of

1187

international domain names, and to unify similar characters. The nameprep

1188

functions can be used directly if desired.

1189

1190

1191

.. function:: nameprep(label)

1192

1193

Return the nameprepped version of *label*. The implementation currently assumes

1194

query strings, so ``AllowUnassigned`` is true.

1195

1196

1197

.. function:: ToASCII(label)

1198

1199

Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is

assumed to be false.

.. function:: ToUnicode(label)

1204

1205

Convert a label to Unicode, as specified in :rfc:`3490`.

1206

1207

1208

:mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature

1209

-------------------------------------------------------------

1210

1211

.. module:: encodings.utf_8_sig

1212

:synopsis: UTF-8 codec with BOM signature

1213

.. moduleauthor:: Walter Dörwald

1214

Georg Brandl

116aa62

2007-08-15 14:28:22 +0000

[diff] [blame]

1215

This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded

1216

BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this

1217

is only done once (on the first write to the byte stream). For decoding an

1218

optional UTF-8 encoded BOM at the start of the data will be skipped.

1219