blob: 63fb53e0f88a4a9660c2f37db9747454fee6a39a [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`urllib` --- Open arbitrary resources by URL
2=================================================
3
4.. module:: urllib
5 :synopsis: Open an arbitrary network resource by URL (requires sockets).
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12
13This module provides a high-level interface for fetching data across the World
14Wide Web. In particular, the :func:`urlopen` function is similar to the
15built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
16instead of filenames. Some restrictions apply --- it can only open URLs for
17reading, and no seek operations are available.
18
Christian Heimes790c8232008-01-07 21:14:23 +000019High-level interface
20--------------------
Georg Brandl116aa622007-08-15 14:28:22 +000021
22.. function:: urlopen(url[, data[, proxies]])
23
24 Open a network object denoted by a URL for reading. If the URL does not have a
25 scheme identifier, or if it has :file:`file:` as its scheme identifier, this
26 opens a local file (without universal newlines); otherwise it opens a socket to
27 a server somewhere on the network. If the connection cannot be made the
28 :exc:`IOError` exception is raised. If all went well, a file-like object is
29 returned. This supports the following methods: :meth:`read`, :meth:`readline`,
Christian Heimes9bd667a2008-01-20 15:14:11 +000030 :meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and
Georg Brandl9afde1c2007-11-01 20:32:30 +000031 :meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One
Georg Brandl116aa622007-08-15 14:28:22 +000032 caveat: the :meth:`read` method, if the size argument is omitted or negative,
33 may not read until the end of the data stream; there is no good way to determine
34 that the entire stream from a socket has been read in the general case.
35
Christian Heimes9bd667a2008-01-20 15:14:11 +000036 Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
37 these methods have the same interface as for file objects --- see section
38 :ref:`bltin-file-objects` in this manual. (It is not a built-in file object,
39 however, so it can't be used at those few places where a true built-in file
40 object is required.)
Georg Brandl116aa622007-08-15 14:28:22 +000041
42 .. index:: module: mimetools
43
44 The :meth:`info` method returns an instance of the class
45 :class:`mimetools.Message` containing meta-information associated with the
46 URL. When the method is HTTP, these headers are those returned by the server
47 at the head of the retrieved HTML page (including Content-Length and
48 Content-Type). When the method is FTP, a Content-Length header will be
49 present if (as is now usual) the server passed back a file length in response
50 to the FTP retrieval request. A Content-Type header will be present if the
51 MIME type can be guessed. When the method is local-file, returned headers
52 will include a Date representing the file's last-modified time, a
53 Content-Length giving file size, and a Content-Type containing a guess at the
54 file's type. See also the description of the :mod:`mimetools` module.
55
56 The :meth:`geturl` method returns the real URL of the page. In some cases, the
57 HTTP server redirects a client to another URL. The :func:`urlopen` function
58 handles this transparently, but in some cases the caller needs to know which URL
59 the client was redirected to. The :meth:`geturl` method can be used to get at
60 this redirected URL.
61
Christian Heimes9bd667a2008-01-20 15:14:11 +000062 The :meth:`getcode` method returns the HTTP status code that was sent with the
63 response, or ``None`` if the URL is no HTTP URL.
64
Georg Brandl116aa622007-08-15 14:28:22 +000065 If the *url* uses the :file:`http:` scheme identifier, the optional *data*
66 argument may be given to specify a ``POST`` request (normally the request type
67 is ``GET``). The *data* argument must be in standard
68 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
69 function below.
70
71 The :func:`urlopen` function works transparently with proxies which do not
72 require authentication. In a Unix or Windows environment, set the
73 :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
74 identifies the proxy server before starting the Python interpreter. For example
75 (the ``'%'`` is the command prompt)::
76
77 % http_proxy="http://www.someproxy.com:3128"
78 % export http_proxy
79 % python
80 ...
81
Christian Heimes9bd667a2008-01-20 15:14:11 +000082 The :envvar:`no_proxy` environment variable can be used to specify hosts which
83 shouldn't be reached via proxy; if set, it should be a comma-separated list
84 of hostname suffixes, optionally with ``:port`` appended, for example
85 ``cern.ch,ncsa.uiuc.edu,some.host:8080``.
86
Georg Brandl116aa622007-08-15 14:28:22 +000087 In a Windows environment, if no proxy environment variables are set, proxy
88 settings are obtained from the registry's Internet Settings section.
89
90 .. index:: single: Internet Config
91
92 In a Macintosh environment, :func:`urlopen` will retrieve proxy information from
93 Internet Config.
94
95 Alternatively, the optional *proxies* argument may be used to explicitly specify
96 proxies. It must be a dictionary mapping scheme names to proxy URLs, where an
97 empty dictionary causes no proxies to be used, and ``None`` (the default value)
98 causes environmental proxy settings to be used as discussed above. For
99 example::
100
101 # Use http://www.someproxy.com:3128 for http proxying
102 proxies = {'http': 'http://www.someproxy.com:3128'}
103 filehandle = urllib.urlopen(some_url, proxies=proxies)
104 # Don't use any proxies
105 filehandle = urllib.urlopen(some_url, proxies={})
106 # Use proxies from environment - both versions are equivalent
107 filehandle = urllib.urlopen(some_url, proxies=None)
108 filehandle = urllib.urlopen(some_url)
109
110 The :func:`urlopen` function does not support explicit proxy specification. If
111 you need to override environmental proxy settings, use :class:`URLopener`, or a
112 subclass such as :class:`FancyURLopener`.
113
114 Proxies which require authentication for use are not currently supported; this
115 is considered an implementation limitation.
116
Georg Brandl116aa622007-08-15 14:28:22 +0000117
118.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
119
120 Copy a network object denoted by a URL to a local file, if necessary. If the URL
121 points to a local file, or a valid cached copy of the object exists, the object
122 is not copied. Return a tuple ``(filename, headers)`` where *filename* is the
123 local file name under which the object can be found, and *headers* is whatever
124 the :meth:`info` method of the object returned by :func:`urlopen` returned (for
125 a remote object, possibly cached). Exceptions are the same as for
126 :func:`urlopen`.
127
128 The second argument, if present, specifies the file location to copy to (if
129 absent, the location will be a tempfile with a generated name). The third
130 argument, if present, is a hook function that will be called once on
131 establishment of the network connection and once after each block read
132 thereafter. The hook will be passed three arguments; a count of blocks
133 transferred so far, a block size in bytes, and the total size of the file. The
134 third argument may be ``-1`` on older FTP servers which do not return a file
135 size in response to a retrieval request.
136
137 If the *url* uses the :file:`http:` scheme identifier, the optional *data*
138 argument may be given to specify a ``POST`` request (normally the request type
139 is ``GET``). The *data* argument must in standard
140 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
141 function below.
142
Georg Brandl55ac8f02007-09-01 13:51:09 +0000143 :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
144 the amount of data available was less than the expected amount (which is the
145 size reported by a *Content-Length* header). This can occur, for example, when
146 the download is interrupted.
Georg Brandl116aa622007-08-15 14:28:22 +0000147
Georg Brandl55ac8f02007-09-01 13:51:09 +0000148 The *Content-Length* is treated as a lower bound: if there's more data to read,
149 urlretrieve reads more data, but if less data is available, it raises the
150 exception.
Georg Brandl116aa622007-08-15 14:28:22 +0000151
Georg Brandl55ac8f02007-09-01 13:51:09 +0000152 You can still retrieve the downloaded data in this case, it is stored in the
153 :attr:`content` attribute of the exception instance.
Georg Brandl116aa622007-08-15 14:28:22 +0000154
Georg Brandl55ac8f02007-09-01 13:51:09 +0000155 If no *Content-Length* header was supplied, urlretrieve can not check the size
156 of the data it has downloaded, and just returns it. In this case you just have
157 to assume that the download was successful.
Georg Brandl116aa622007-08-15 14:28:22 +0000158
159
160.. data:: _urlopener
161
162 The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
163 of the :class:`FancyURLopener` class and use it to perform their requested
164 actions. To override this functionality, programmers can create a subclass of
165 :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
166 class to the ``urllib._urlopener`` variable before calling the desired function.
167 For example, applications may want to specify a different
168 :mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
169 accomplished with the following code::
170
171 import urllib
172
173 class AppURLopener(urllib.FancyURLopener):
174 version = "App/1.7"
175
176 urllib._urlopener = AppURLopener()
177
178
179.. function:: urlcleanup()
180
181 Clear the cache that may have been built up by previous calls to
182 :func:`urlretrieve`.
183
184
Christian Heimes790c8232008-01-07 21:14:23 +0000185Utility functions
186-----------------
187
Georg Brandl116aa622007-08-15 14:28:22 +0000188.. function:: quote(string[, safe])
189
190 Replace special characters in *string* using the ``%xx`` escape. Letters,
191 digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
192 parameter specifies additional characters that should not be quoted --- its
193 default value is ``'/'``.
194
195 Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
196
197
198.. function:: quote_plus(string[, safe])
199
200 Like :func:`quote`, but also replaces spaces by plus signs, as required for
201 quoting HTML form values. Plus signs in the original string are escaped unless
202 they are included in *safe*. It also does not have *safe* default to ``'/'``.
203
204
205.. function:: unquote(string)
206
207 Replace ``%xx`` escapes by their single-character equivalent.
208
209 Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
210
211
212.. function:: unquote_plus(string)
213
214 Like :func:`unquote`, but also replaces plus signs by spaces, as required for
215 unquoting HTML form values.
216
217
218.. function:: urlencode(query[, doseq])
219
220 Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
221 string, suitable to pass to :func:`urlopen` above as the optional *data*
222 argument. This is useful to pass a dictionary of form fields to a ``POST``
223 request. The resulting string is a series of ``key=value`` pairs separated by
224 ``'&'`` characters, where both *key* and *value* are quoted using
225 :func:`quote_plus` above. If the optional parameter *doseq* is present and
226 evaluates to true, individual ``key=value`` pairs are generated for each element
227 of the sequence. When a sequence of two-element tuples is used as the *query*
228 argument, the first element of each tuple is a key and the second is a value.
229 The order of parameters in the encoded string will match the order of parameter
230 tuples in the sequence. The :mod:`cgi` module provides the functions
231 :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
232 into Python data structures.
233
234
235.. function:: pathname2url(path)
236
237 Convert the pathname *path* from the local syntax for a path to the form used in
238 the path component of a URL. This does not produce a complete URL. The return
239 value will already be quoted using the :func:`quote` function.
240
241
242.. function:: url2pathname(path)
243
244 Convert the path component *path* from an encoded URL to the local syntax for a
245 path. This does not accept a complete URL. This function uses :func:`unquote`
246 to decode *path*.
247
248
Christian Heimes790c8232008-01-07 21:14:23 +0000249URL Opener objects
250------------------
251
Georg Brandl116aa622007-08-15 14:28:22 +0000252.. class:: URLopener([proxies[, **x509]])
253
254 Base class for opening and reading URLs. Unless you need to support opening
255 objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
256 you probably want to use :class:`FancyURLopener`.
257
258 By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
259 of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number.
260 Applications can define their own :mailheader:`User-Agent` header by subclassing
261 :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
262 :attr:`version` to an appropriate string value in the subclass definition.
263
264 The optional *proxies* parameter should be a dictionary mapping scheme names to
265 proxy URLs, where an empty dictionary turns proxies off completely. Its default
266 value is ``None``, in which case environmental proxy settings will be used if
267 present, as discussed in the definition of :func:`urlopen`, above.
268
269 Additional keyword parameters, collected in *x509*, may be used for
270 authentication of the client when using the :file:`https:` scheme. The keywords
271 *key_file* and *cert_file* are supported to provide an SSL key and certificate;
272 both are needed to support client authentication.
273
274 :class:`URLopener` objects will raise an :exc:`IOError` exception if the server
275 returns an error code.
276
Christian Heimes790c8232008-01-07 21:14:23 +0000277 .. method:: open(fullurl[, data])
278
279 Open *fullurl* using the appropriate protocol. This method sets up cache and
280 proxy information, then calls the appropriate open method with its input
281 arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
282 The *data* argument has the same meaning as the *data* argument of
283 :func:`urlopen`.
284
285
286 .. method:: open_unknown(fullurl[, data])
287
288 Overridable interface to open unknown URL types.
289
290
291 .. method:: retrieve(url[, filename[, reporthook[, data]]])
292
293 Retrieves the contents of *url* and places it in *filename*. The return value
294 is a tuple consisting of a local filename and either a
295 :class:`mimetools.Message` object containing the response headers (for remote
296 URLs) or ``None`` (for local URLs). The caller must then open and read the
297 contents of *filename*. If *filename* is not given and the URL refers to a
298 local file, the input filename is returned. If the URL is non-local and
299 *filename* is not given, the filename is the output of :func:`tempfile.mktemp`
300 with a suffix that matches the suffix of the last path component of the input
301 URL. If *reporthook* is given, it must be a function accepting three numeric
302 parameters. It will be called after each chunk of data is read from the
303 network. *reporthook* is ignored for local URLs.
304
305 If the *url* uses the :file:`http:` scheme identifier, the optional *data*
306 argument may be given to specify a ``POST`` request (normally the request type
307 is ``GET``). The *data* argument must in standard
308 :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
309 function below.
310
311
312 .. attribute:: version
313
314 Variable that specifies the user agent of the opener object. To get
315 :mod:`urllib` to tell servers that it is a particular user agent, set this in a
316 subclass as a class variable or in the constructor before calling the base
317 constructor.
318
Georg Brandl116aa622007-08-15 14:28:22 +0000319
320.. class:: FancyURLopener(...)
321
322 :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
323 for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
324 response codes listed above, the :mailheader:`Location` header is used to fetch
325 the actual URL. For 401 response codes (authentication required), basic HTTP
326 authentication is performed. For the 30x response codes, recursion is bounded
327 by the value of the *maxtries* attribute, which defaults to 10.
328
329 For all other response codes, the method :meth:`http_error_default` is called
330 which you can override in subclasses to handle the error appropriately.
331
332 .. note::
333
334 According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
335 must not be automatically redirected without confirmation by the user. In
336 reality, browsers do allow automatic redirection of these responses, changing
337 the POST to a GET, and :mod:`urllib` reproduces this behaviour.
338
339 The parameters to the constructor are the same as those for :class:`URLopener`.
340
341 .. note::
342
343 When performing basic authentication, a :class:`FancyURLopener` instance calls
344 its :meth:`prompt_user_passwd` method. The default implementation asks the
345 users for the required information on the controlling terminal. A subclass may
346 override this method to support more appropriate behavior if needed.
347
Christian Heimes790c8232008-01-07 21:14:23 +0000348 The :class:`FancyURLopener` class offers one additional method that should be
349 overloaded to provide the appropriate behavior:
350
351 .. method:: prompt_user_passwd(host, realm)
352
353 Return information needed to authenticate the user at the given host in the
354 specified security realm. The return value should be a tuple, ``(user,
355 password)``, which can be used for basic authentication.
356
357 The implementation prompts for this information on the terminal; an application
358 should override this method to use an appropriate interaction model in the local
359 environment.
Georg Brandl116aa622007-08-15 14:28:22 +0000360
361.. exception:: ContentTooShortError(msg[, content])
362
363 This exception is raised when the :func:`urlretrieve` function detects that the
364 amount of the downloaded data is less than the expected amount (given by the
365 *Content-Length* header). The :attr:`content` attribute stores the downloaded
366 (and supposedly truncated) data.
367
Christian Heimes790c8232008-01-07 21:14:23 +0000368
369:mod:`urllib` Restrictions
370--------------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000371
372 .. index::
373 pair: HTTP; protocol
374 pair: FTP; protocol
375
376* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
377 1.0), FTP, and local files.
378
379* The caching feature of :func:`urlretrieve` has been disabled until I find the
380 time to hack proper processing of Expiration time headers.
381
382* There should be a function to query whether a particular URL is in the cache.
383
384* For backward compatibility, if a URL appears to point to a local file but the
385 file can't be opened, the URL is re-interpreted using the FTP protocol. This
386 can sometimes cause confusing error messages.
387
388* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
389 long delays while waiting for a network connection to be set up. This means
390 that it is difficult to build an interactive Web client using these functions
391 without using threads.
392
393 .. index::
394 single: HTML
395 pair: HTTP; protocol
396 module: htmllib
397
398* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
399 returned by the server. This may be binary data (such as an image), plain text
400 or (for example) HTML. The HTTP protocol provides type information in the reply
401 header, which can be inspected by looking at the :mailheader:`Content-Type`
402 header. If the returned data is HTML, you can use the module :mod:`htmllib` to
403 parse it.
404
405 .. index:: single: FTP
406
407* The code handling the FTP protocol cannot differentiate between a file and a
408 directory. This can lead to unexpected behavior when attempting to read a URL
409 that points to a file that is not accessible. If the URL ends in a ``/``, it is
410 assumed to refer to a directory and will be handled accordingly. But if an
411 attempt to read a file leads to a 550 error (meaning the URL cannot be found or
412 is not accessible, often for permission reasons), then the path is treated as a
413 directory in order to handle the case when a directory is specified by a URL but
414 the trailing ``/`` has been left off. This can cause misleading results when
415 you try to fetch a file whose read permissions make it inaccessible; the FTP
416 code will try to read it, fail with a 550 error, and then perform a directory
417 listing for the unreadable file. If fine-grained control is needed, consider
418 using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
419 *_urlopener* to meet your needs.
420
421* This module does not support the use of proxies which require authentication.
422 This may be implemented in the future.
423
424 .. index:: module: urlparse
425
426* Although the :mod:`urllib` module contains (undocumented) routines to parse
427 and unparse URL strings, the recommended interface for URL manipulation is in
428 module :mod:`urlparse`.
429
430
Georg Brandl116aa622007-08-15 14:28:22 +0000431.. _urllib-examples:
432
433Examples
434--------
435
436Here is an example session that uses the ``GET`` method to retrieve a URL
437containing parameters::
438
439 >>> import urllib
440 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
441 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
Collin Winterc79461b2007-09-01 23:34:30 +0000442 >>> print(f.read())
Georg Brandl116aa622007-08-15 14:28:22 +0000443
444The following example uses the ``POST`` method instead::
445
446 >>> import urllib
447 >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
448 >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
Collin Winterc79461b2007-09-01 23:34:30 +0000449 >>> print(f.read())
Georg Brandl116aa622007-08-15 14:28:22 +0000450
451The following example uses an explicitly specified HTTP proxy, overriding
452environment settings::
453
454 >>> import urllib
455 >>> proxies = {'http': 'http://proxy.example.com:8080/'}
456 >>> opener = urllib.FancyURLopener(proxies)
457 >>> f = opener.open("http://www.python.org")
458 >>> f.read()
459
460The following example uses no proxies at all, overriding environment settings::
461
462 >>> import urllib
463 >>> opener = urllib.FancyURLopener({})
464 >>> f = opener.open("http://www.python.org/")
465 >>> f.read()
466