blob: 5e1bd1e1980df786de9b581f4de47947d708058f [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`csv` --- CSV File Reading and Writing
3===========================================
4
5.. module:: csv
6 :synopsis: Write and read tabular data to and from delimited files.
7.. sectionauthor:: Skip Montanaro <skip@pobox.com>
8
9
Georg Brandl116aa622007-08-15 14:28:22 +000010.. index::
11 single: csv
12 pair: data; tabular
13
14The so-called CSV (Comma Separated Values) format is the most common import and
15export format for spreadsheets and databases. There is no "CSV standard", so
16the format is operationally defined by the many applications which read and
17write it. The lack of a standard means that subtle differences often exist in
18the data produced and consumed by different applications. These differences can
19make it annoying to process CSV files from multiple sources. Still, while the
20delimiters and quoting characters vary, the overall format is similar enough
21that it is possible to write a single module which can efficiently manipulate
22such data, hiding the details of reading and writing the data from the
23programmer.
24
25The :mod:`csv` module implements classes to read and write tabular data in CSV
26format. It allows programmers to say, "write this data in the format preferred
27by Excel," or "read data from this file which was generated by Excel," without
28knowing the precise details of the CSV format used by Excel. Programmers can
29also describe the CSV formats understood by other applications or define their
30own special-purpose CSV formats.
31
32The :mod:`csv` module's :class:`reader` and :class:`writer` objects read and
33write sequences. Programmers can also read and write data in dictionary form
34using the :class:`DictReader` and :class:`DictWriter` classes.
35
36.. note::
37
38 This version of the :mod:`csv` module doesn't support Unicode input. Also,
39 there are currently some issues regarding ASCII NUL characters. Accordingly,
40 all input should be UTF-8 or printable ASCII to be safe; see the examples in
41 section :ref:`csv-examples`. These restrictions will be removed in the future.
42
43
44.. seealso::
45
46 .. % \seemodule{array}{Arrays of uniformly types numeric values.}
47
48 :pep:`305` - CSV File API
49 The Python Enhancement Proposal which proposed this addition to Python.
50
51
52.. _csv-contents:
53
54Module Contents
55---------------
56
57The :mod:`csv` module defines the following functions:
58
59
60.. function:: reader(csvfile[, dialect='excel'][, fmtparam])
61
62 Return a reader object which will iterate over lines in the given *csvfile*.
Georg Brandl9afde1c2007-11-01 20:32:30 +000063 *csvfile* can be any object which supports the :term:`iterator` protocol and returns a
Georg Brandl116aa622007-08-15 14:28:22 +000064 string each time its :meth:`next` method is called --- file objects and list
65 objects are both suitable. If *csvfile* is a file object, it must be opened
66 with the 'b' flag on platforms where that makes a difference. An optional
67 *dialect* parameter can be given which is used to define a set of parameters
68 specific to a particular CSV dialect. It may be an instance of a subclass of
69 the :class:`Dialect` class or one of the strings returned by the
70 :func:`list_dialects` function. The other optional *fmtparam* keyword arguments
71 can be given to override individual formatting parameters in the current
72 dialect. For full details about the dialect and formatting parameters, see
73 section :ref:`csv-fmt-params`.
74
75 All data read are returned as strings. No automatic data type conversion is
76 performed.
77
Georg Brandl55ac8f02007-09-01 13:51:09 +000078 The parser is quite strict with respect to multi-line quoted fields. Previously,
79 if a line ended within a quoted field without a terminating newline character, a
80 newline would be inserted into the returned field. This behavior caused problems
81 when reading files which contained carriage return characters within fields.
82 The behavior was changed to return the field without inserting newlines. As a
83 consequence, if newlines embedded within fields are important, the input should
84 be split into lines in a manner which preserves the newline characters.
Georg Brandl116aa622007-08-15 14:28:22 +000085
86
87.. function:: writer(csvfile[, dialect='excel'][, fmtparam])
88
89 Return a writer object responsible for converting the user's data into delimited
90 strings on the given file-like object. *csvfile* can be any object with a
91 :func:`write` method. If *csvfile* is a file object, it must be opened with the
92 'b' flag on platforms where that makes a difference. An optional *dialect*
93 parameter can be given which is used to define a set of parameters specific to a
94 particular CSV dialect. It may be an instance of a subclass of the
95 :class:`Dialect` class or one of the strings returned by the
96 :func:`list_dialects` function. The other optional *fmtparam* keyword arguments
97 can be given to override individual formatting parameters in the current
98 dialect. For full details about the dialect and formatting parameters, see
99 section :ref:`csv-fmt-params`. To make it
100 as easy as possible to interface with modules which implement the DB API, the
101 value :const:`None` is written as the empty string. While this isn't a
102 reversible transformation, it makes it easier to dump SQL NULL data values to
103 CSV files without preprocessing the data returned from a ``cursor.fetch*`` call.
104 All other non-string data are stringified with :func:`str` before being written.
105
106
107.. function:: register_dialect(name[, dialect][, fmtparam])
108
109 Associate *dialect* with *name*. *name* must be a string or Unicode object. The
110 dialect can be specified either by passing a sub-class of :class:`Dialect`, or
111 by *fmtparam* keyword arguments, or both, with keyword arguments overriding
112 parameters of the dialect. For full details about the dialect and formatting
113 parameters, see section :ref:`csv-fmt-params`.
114
115
116.. function:: unregister_dialect(name)
117
118 Delete the dialect associated with *name* from the dialect registry. An
119 :exc:`Error` is raised if *name* is not a registered dialect name.
120
121
122.. function:: get_dialect(name)
123
124 Return the dialect associated with *name*. An :exc:`Error` is raised if *name*
125 is not a registered dialect name.
126
Guido van Rossum77677112007-11-05 19:43:04 +0000127 .. versionchanged:: 2.5
128 This function now returns an immutable :class:`Dialect`. Previously an
129 instance of the requested dialect was returned. Users could modify the
130 underlying class, changing the behavior of active readers and writers.
Georg Brandl116aa622007-08-15 14:28:22 +0000131
132.. function:: list_dialects()
133
134 Return the names of all registered dialects.
135
136
137.. function:: field_size_limit([new_limit])
138
139 Returns the current maximum field size allowed by the parser. If *new_limit* is
140 given, this becomes the new limit.
141
Georg Brandl116aa622007-08-15 14:28:22 +0000142
143The :mod:`csv` module defines the following classes:
144
Georg Brandl9afde1c2007-11-01 20:32:30 +0000145.. class:: DictReader(csvfile[, fieldnames=None[, restkey=None[, restval=None[, dialect='excel'[, *args, **kwds]]]]])
Georg Brandl116aa622007-08-15 14:28:22 +0000146
147 Create an object which operates like a regular reader but maps the information
148 read into a dict whose keys are given by the optional *fieldnames* parameter.
149 If the *fieldnames* parameter is omitted, the values in the first row of the
150 *csvfile* will be used as the fieldnames. If the row read has fewer fields than
151 the fieldnames sequence, the value of *restval* will be used as the default
152 value. If the row read has more fields than the fieldnames sequence, the
153 remaining data is added as a sequence keyed by the value of *restkey*. If the
154 row read has fewer fields than the fieldnames sequence, the remaining keys take
155 the value of the optional *restval* parameter. Any other optional or keyword
156 arguments are passed to the underlying :class:`reader` instance.
157
158
159.. class:: DictWriter(csvfile, fieldnames[, restval=''[, extrasaction='raise'[, dialect='excel'[, *args, **kwds]]]])
160
161 Create an object which operates like a regular writer but maps dictionaries onto
162 output rows. The *fieldnames* parameter identifies the order in which values in
163 the dictionary passed to the :meth:`writerow` method are written to the
164 *csvfile*. The optional *restval* parameter specifies the value to be written
165 if the dictionary is missing a key in *fieldnames*. If the dictionary passed to
166 the :meth:`writerow` method contains a key not found in *fieldnames*, the
167 optional *extrasaction* parameter indicates what action to take. If it is set
168 to ``'raise'`` a :exc:`ValueError` is raised. If it is set to ``'ignore'``,
169 extra values in the dictionary are ignored. Any other optional or keyword
170 arguments are passed to the underlying :class:`writer` instance.
171
172 Note that unlike the :class:`DictReader` class, the *fieldnames* parameter of
173 the :class:`DictWriter` is not optional. Since Python's :class:`dict` objects
174 are not ordered, there is not enough information available to deduce the order
175 in which the row should be written to the *csvfile*.
176
177
178.. class:: Dialect
179
180 The :class:`Dialect` class is a container class relied on primarily for its
181 attributes, which are used to define the parameters for a specific
182 :class:`reader` or :class:`writer` instance.
183
184
185.. class:: excel()
186
187 The :class:`excel` class defines the usual properties of an Excel-generated CSV
188 file. It is registered with the dialect name ``'excel'``.
189
190
191.. class:: excel_tab()
192
193 The :class:`excel_tab` class defines the usual properties of an Excel-generated
194 TAB-delimited file. It is registered with the dialect name ``'excel-tab'``.
195
196
197.. class:: Sniffer()
198
199 The :class:`Sniffer` class is used to deduce the format of a CSV file.
200
201The :class:`Sniffer` class provides two methods:
202
203
204.. method:: Sniffer.sniff(sample[, delimiters=None])
205
206 Analyze the given *sample* and return a :class:`Dialect` subclass reflecting the
207 parameters found. If the optional *delimiters* parameter is given, it is
208 interpreted as a string containing possible valid delimiter characters.
209
210
211.. method:: Sniffer.has_header(sample)
212
213 Analyze the sample text (presumed to be in CSV format) and return :const:`True`
214 if the first row appears to be a series of column headers.
215
216The :mod:`csv` module defines the following constants:
217
218
219.. data:: QUOTE_ALL
220
221 Instructs :class:`writer` objects to quote all fields.
222
223
224.. data:: QUOTE_MINIMAL
225
226 Instructs :class:`writer` objects to only quote those fields which contain
227 special characters such as *delimiter*, *quotechar* or any of the characters in
228 *lineterminator*.
229
230
231.. data:: QUOTE_NONNUMERIC
232
233 Instructs :class:`writer` objects to quote all non-numeric fields.
234
235 Instructs the reader to convert all non-quoted fields to type *float*.
236
237
238.. data:: QUOTE_NONE
239
240 Instructs :class:`writer` objects to never quote fields. When the current
241 *delimiter* occurs in output data it is preceded by the current *escapechar*
242 character. If *escapechar* is not set, the writer will raise :exc:`Error` if
243 any characters that require escaping are encountered.
244
245 Instructs :class:`reader` to perform no special processing of quote characters.
246
247The :mod:`csv` module defines the following exception:
248
249
250.. exception:: Error
251
252 Raised by any of the functions when an error is detected.
253
254
255.. _csv-fmt-params:
256
257Dialects and Formatting Parameters
258----------------------------------
259
260To make it easier to specify the format of input and output records, specific
261formatting parameters are grouped together into dialects. A dialect is a
262subclass of the :class:`Dialect` class having a set of specific methods and a
263single :meth:`validate` method. When creating :class:`reader` or
264:class:`writer` objects, the programmer can specify a string or a subclass of
265the :class:`Dialect` class as the dialect parameter. In addition to, or instead
266of, the *dialect* parameter, the programmer can also specify individual
267formatting parameters, which have the same names as the attributes defined below
268for the :class:`Dialect` class.
269
270Dialects support the following attributes:
271
272
273.. attribute:: Dialect.delimiter
274
275 A one-character string used to separate fields. It defaults to ``','``.
276
277
278.. attribute:: Dialect.doublequote
279
280 Controls how instances of *quotechar* appearing inside a field should be
281 themselves be quoted. When :const:`True`, the character is doubled. When
282 :const:`False`, the *escapechar* is used as a prefix to the *quotechar*. It
283 defaults to :const:`True`.
284
285 On output, if *doublequote* is :const:`False` and no *escapechar* is set,
286 :exc:`Error` is raised if a *quotechar* is found in a field.
287
288
289.. attribute:: Dialect.escapechar
290
291 A one-character string used by the writer to escape the *delimiter* if *quoting*
292 is set to :const:`QUOTE_NONE` and the *quotechar* if *doublequote* is
293 :const:`False`. On reading, the *escapechar* removes any special meaning from
294 the following character. It defaults to :const:`None`, which disables escaping.
295
296
297.. attribute:: Dialect.lineterminator
298
299 The string used to terminate lines produced by the :class:`writer`. It defaults
300 to ``'\r\n'``.
301
302 .. note::
303
304 The :class:`reader` is hard-coded to recognise either ``'\r'`` or ``'\n'`` as
305 end-of-line, and ignores *lineterminator*. This behavior may change in the
306 future.
307
308
309.. attribute:: Dialect.quotechar
310
311 A one-character string used to quote fields containing special characters, such
312 as the *delimiter* or *quotechar*, or which contain new-line characters. It
313 defaults to ``'"'``.
314
315
316.. attribute:: Dialect.quoting
317
318 Controls when quotes should be generated by the writer and recognised by the
319 reader. It can take on any of the :const:`QUOTE_\*` constants (see section
320 :ref:`csv-contents`) and defaults to :const:`QUOTE_MINIMAL`.
321
322
323.. attribute:: Dialect.skipinitialspace
324
325 When :const:`True`, whitespace immediately following the *delimiter* is ignored.
326 The default is :const:`False`.
327
328
329Reader Objects
330--------------
331
332Reader objects (:class:`DictReader` instances and objects returned by the
333:func:`reader` function) have the following public methods:
334
335
336.. method:: csvreader.next()
337
338 Return the next row of the reader's iterable object as a list, parsed according
339 to the current dialect.
340
341Reader objects have the following public attributes:
342
343
344.. attribute:: csvreader.dialect
345
346 A read-only description of the dialect in use by the parser.
347
348
349.. attribute:: csvreader.line_num
350
351 The number of lines read from the source iterator. This is not the same as the
352 number of records returned, as records can span multiple lines.
353
Georg Brandl116aa622007-08-15 14:28:22 +0000354
355
356Writer Objects
357--------------
358
359:class:`Writer` objects (:class:`DictWriter` instances and objects returned by
360the :func:`writer` function) have the following public methods. A *row* must be
361a sequence of strings or numbers for :class:`Writer` objects and a dictionary
362mapping fieldnames to strings or numbers (by passing them through :func:`str`
363first) for :class:`DictWriter` objects. Note that complex numbers are written
364out surrounded by parens. This may cause some problems for other programs which
365read CSV files (assuming they support complex numbers at all).
366
367
368.. method:: csvwriter.writerow(row)
369
370 Write the *row* parameter to the writer's file object, formatted according to
371 the current dialect.
372
373
374.. method:: csvwriter.writerows(rows)
375
376 Write all the *rows* parameters (a list of *row* objects as described above) to
377 the writer's file object, formatted according to the current dialect.
378
379Writer objects have the following public attribute:
380
381
382.. attribute:: csvwriter.dialect
383
384 A read-only description of the dialect in use by the writer.
385
386
387.. _csv-examples:
388
389Examples
390--------
391
392The simplest example of reading a CSV file::
393
394 import csv
395 reader = csv.reader(open("some.csv", "rb"))
396 for row in reader:
Georg Brandl6911e3c2007-09-04 07:15:32 +0000397 print(row)
Georg Brandl116aa622007-08-15 14:28:22 +0000398
399Reading a file with an alternate format::
400
401 import csv
402 reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
403 for row in reader:
Georg Brandl6911e3c2007-09-04 07:15:32 +0000404 print(row)
Georg Brandl116aa622007-08-15 14:28:22 +0000405
406The corresponding simplest possible writing example is::
407
408 import csv
409 writer = csv.writer(open("some.csv", "wb"))
410 writer.writerows(someiterable)
411
412Registering a new dialect::
413
414 import csv
415
416 csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
417
418 reader = csv.reader(open("passwd", "rb"), 'unixpwd')
419
420A slightly more advanced use of the reader --- catching and reporting errors::
421
422 import csv, sys
423 filename = "some.csv"
424 reader = csv.reader(open(filename, "rb"))
425 try:
426 for row in reader:
Georg Brandl6911e3c2007-09-04 07:15:32 +0000427 print(row)
Georg Brandl116aa622007-08-15 14:28:22 +0000428 except csv.Error as e:
429 sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
430
431And while the module doesn't directly support parsing strings, it can easily be
432done::
433
434 import csv
435 for row in csv.reader(['one,two,three']):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000436 print(row)
Georg Brandl116aa622007-08-15 14:28:22 +0000437
438The :mod:`csv` module doesn't directly support reading and writing Unicode, but
439it is 8-bit-clean save for some problems with ASCII NUL characters. So you can
440write functions or classes that handle the encoding and decoding for you as long
441as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
442
Georg Brandl9afde1c2007-11-01 20:32:30 +0000443:func:`unicode_csv_reader` below is a :term:`generator` that wraps :class:`csv.reader`
Georg Brandl116aa622007-08-15 14:28:22 +0000444to handle Unicode CSV data (a list of Unicode strings). :func:`utf_8_encoder`
Georg Brandl9afde1c2007-11-01 20:32:30 +0000445is a :term:`generator` that encodes the Unicode strings as UTF-8, one string (or row) at
Georg Brandl116aa622007-08-15 14:28:22 +0000446a time. The encoded strings are parsed by the CSV reader, and
447:func:`unicode_csv_reader` decodes the UTF-8-encoded cells back into Unicode::
448
449 import csv
450
451 def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
452 # csv.py doesn't do Unicode; encode temporarily as UTF-8:
453 csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
454 dialect=dialect, **kwargs)
455 for row in csv_reader:
456 # decode UTF-8 back to Unicode, cell by cell:
457 yield [unicode(cell, 'utf-8') for cell in row]
458
459 def utf_8_encoder(unicode_csv_data):
460 for line in unicode_csv_data:
461 yield line.encode('utf-8')
462
463For all other encodings the following :class:`UnicodeReader` and
464:class:`UnicodeWriter` classes can be used. They take an additional *encoding*
465parameter in their constructor and make sure that the data passes the real
466reader or writer encoded as UTF-8::
467
468 import csv, codecs, cStringIO
469
470 class UTF8Recoder:
471 """
472 Iterator that reads an encoded stream and reencodes the input to UTF-8
473 """
474 def __init__(self, f, encoding):
475 self.reader = codecs.getreader(encoding)(f)
476
477 def __iter__(self):
478 return self
479
480 def __next__(self):
481 return next(self.reader).encode("utf-8")
482
483 class UnicodeReader:
484 """
485 A CSV reader which will iterate over lines in the CSV file "f",
486 which is encoded in the given encoding.
487 """
488
489 def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
490 f = UTF8Recoder(f, encoding)
491 self.reader = csv.reader(f, dialect=dialect, **kwds)
492
493 def __next__(self):
494 row = next(self.reader)
495 return [unicode(s, "utf-8") for s in row]
496
497 def __iter__(self):
498 return self
499
500 class UnicodeWriter:
501 """
502 A CSV writer which will write rows to CSV file "f",
503 which is encoded in the given encoding.
504 """
505
506 def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
507 # Redirect output to a queue
508 self.queue = cStringIO.StringIO()
509 self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
510 self.stream = f
511 self.encoder = codecs.getincrementalencoder(encoding)()
512
513 def writerow(self, row):
514 self.writer.writerow([s.encode("utf-8") for s in row])
515 # Fetch UTF-8 output from the queue ...
516 data = self.queue.getvalue()
517 data = data.decode("utf-8")
518 # ... and reencode it into the target encoding
519 data = self.encoder.encode(data)
520 # write to the target stream
521 self.stream.write(data)
522 # empty queue
523 self.queue.truncate(0)
524
525 def writerows(self, rows):
526 for row in rows:
527 self.writerow(row)
528