| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | :mod:`urllib` --- Open arbitrary resources by URL | 
 | 2 | ================================================= | 
 | 3 |  | 
 | 4 | .. module:: urllib | 
 | 5 |    :synopsis: Open an arbitrary network resource by URL (requires sockets). | 
 | 6 |  | 
| Brett Cannon | 8bb8fa5 | 2008-07-02 01:57:08 +0000 | [diff] [blame] | 7 | .. note:: | 
 | 8 |     The :mod:`urllib` module has been split into parts and renamed in | 
 | 9 |     Python 3.0 to :mod:`urllib.request`, :mod:`urllib.parse`, | 
 | 10 |     and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt | 
 | 11 |     imports when converting your sources to 3.0. | 
 | 12 |     Also note that the :func:`urllib.urlopen` function has been removed in | 
 | 13 |     Python 3.0 in favor of :func:`urllib2.urlopen`. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 14 |  | 
 | 15 | .. index:: | 
 | 16 |    single: WWW | 
 | 17 |    single: World Wide Web | 
 | 18 |    single: URL | 
 | 19 |  | 
 | 20 | This module provides a high-level interface for fetching data across the World | 
 | 21 | Wide Web.  In particular, the :func:`urlopen` function is similar to the | 
 | 22 | built-in function :func:`open`, but accepts Universal Resource Locators (URLs) | 
 | 23 | instead of filenames.  Some restrictions apply --- it can only open URLs for | 
 | 24 | reading, and no seek operations are available. | 
 | 25 |  | 
| Sandro Tosi | 71a5ea0 | 2011-08-12 19:11:24 +0200 | [diff] [blame] | 26 | .. warning:: When opening HTTPS URLs, it does not attempt to validate the | 
| Antoine Pitrou | 66bfda8 | 2010-09-29 11:30:52 +0000 | [diff] [blame] | 27 |    server certificate.  Use at your own risk! | 
 | 28 |  | 
 | 29 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 30 | High-level interface | 
 | 31 | -------------------- | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 32 |  | 
 | 33 | .. function:: urlopen(url[, data[, proxies]]) | 
 | 34 |  | 
 | 35 |    Open a network object denoted by a URL for reading.  If the URL does not have a | 
 | 36 |    scheme identifier, or if it has :file:`file:` as its scheme identifier, this | 
 | 37 |    opens a local file (without universal newlines); otherwise it opens a socket to | 
 | 38 |    a server somewhere on the network.  If the connection cannot be made the | 
 | 39 |    :exc:`IOError` exception is raised.  If all went well, a file-like object is | 
 | 40 |    returned.  This supports the following methods: :meth:`read`, :meth:`readline`, | 
| Georg Brandl | 9b0d46d | 2008-01-20 11:43:03 +0000 | [diff] [blame] | 41 |    :meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and | 
| Georg Brandl | e7a0990 | 2007-10-21 12:10:28 +0000 | [diff] [blame] | 42 |    :meth:`geturl`.  It also has proper support for the :term:`iterator` protocol. One | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 43 |    caveat: the :meth:`read` method, if the size argument is omitted or negative, | 
 | 44 |    may not read until the end of the data stream; there is no good way to determine | 
 | 45 |    that the entire stream from a socket has been read in the general case. | 
 | 46 |  | 
| Georg Brandl | 9b0d46d | 2008-01-20 11:43:03 +0000 | [diff] [blame] | 47 |    Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods, | 
 | 48 |    these methods have the same interface as for file objects --- see section | 
 | 49 |    :ref:`bltin-file-objects` in this manual.  (It is not a built-in file object, | 
 | 50 |    however, so it can't be used at those few places where a true built-in file | 
 | 51 |    object is required.) | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 52 |  | 
 | 53 |    .. index:: module: mimetools | 
 | 54 |  | 
 | 55 |    The :meth:`info` method returns an instance of the class | 
| Senthil Kumaran | 1c919a6 | 2010-06-29 13:28:20 +0000 | [diff] [blame] | 56 |    :class:`mimetools.Message` containing meta-information associated with the | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 57 |    URL.  When the method is HTTP, these headers are those returned by the server | 
 | 58 |    at the head of the retrieved HTML page (including Content-Length and | 
 | 59 |    Content-Type).  When the method is FTP, a Content-Length header will be | 
 | 60 |    present if (as is now usual) the server passed back a file length in response | 
 | 61 |    to the FTP retrieval request. A Content-Type header will be present if the | 
 | 62 |    MIME type can be guessed.  When the method is local-file, returned headers | 
 | 63 |    will include a Date representing the file's last-modified time, a | 
 | 64 |    Content-Length giving file size, and a Content-Type containing a guess at the | 
 | 65 |    file's type. See also the description of the :mod:`mimetools` module. | 
 | 66 |  | 
 | 67 |    The :meth:`geturl` method returns the real URL of the page.  In some cases, the | 
 | 68 |    HTTP server redirects a client to another URL.  The :func:`urlopen` function | 
 | 69 |    handles this transparently, but in some cases the caller needs to know which URL | 
 | 70 |    the client was redirected to.  The :meth:`geturl` method can be used to get at | 
 | 71 |    this redirected URL. | 
 | 72 |  | 
| Georg Brandl | 9b0d46d | 2008-01-20 11:43:03 +0000 | [diff] [blame] | 73 |    The :meth:`getcode` method returns the HTTP status code that was sent with the | 
 | 74 |    response, or ``None`` if the URL is no HTTP URL. | 
 | 75 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 76 |    If the *url* uses the :file:`http:` scheme identifier, the optional *data* | 
 | 77 |    argument may be given to specify a ``POST`` request (normally the request type | 
 | 78 |    is ``GET``).  The *data* argument must be in standard | 
 | 79 |    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` | 
 | 80 |    function below. | 
 | 81 |  | 
 | 82 |    The :func:`urlopen` function works transparently with proxies which do not | 
 | 83 |    require authentication.  In a Unix or Windows environment, set the | 
 | 84 |    :envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that | 
 | 85 |    identifies the proxy server before starting the Python interpreter.  For example | 
 | 86 |    (the ``'%'`` is the command prompt):: | 
 | 87 |  | 
 | 88 |       % http_proxy="http://www.someproxy.com:3128" | 
 | 89 |       % export http_proxy | 
 | 90 |       % python | 
 | 91 |       ... | 
 | 92 |  | 
| Georg Brandl | 2235011 | 2008-01-20 12:05:43 +0000 | [diff] [blame] | 93 |    The :envvar:`no_proxy` environment variable can be used to specify hosts which | 
 | 94 |    shouldn't be reached via proxy; if set, it should be a comma-separated list | 
 | 95 |    of hostname suffixes, optionally with ``:port`` appended, for example | 
 | 96 |    ``cern.ch,ncsa.uiuc.edu,some.host:8080``. | 
 | 97 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 98 |    In a Windows environment, if no proxy environment variables are set, proxy | 
 | 99 |    settings are obtained from the registry's Internet Settings section. | 
 | 100 |  | 
 | 101 |    .. index:: single: Internet Config | 
 | 102 |  | 
| Senthil Kumaran | 45a505f | 2009-10-18 01:24:41 +0000 | [diff] [blame] | 103 |    In a Mac OS X  environment, :func:`urlopen` will retrieve proxy information | 
 | 104 |    from the OS X System Configuration Framework, which can be managed with | 
 | 105 |    Network System Preferences panel. | 
 | 106 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 107 |  | 
 | 108 |    Alternatively, the optional *proxies* argument may be used to explicitly specify | 
 | 109 |    proxies.  It must be a dictionary mapping scheme names to proxy URLs, where an | 
 | 110 |    empty dictionary causes no proxies to be used, and ``None`` (the default value) | 
 | 111 |    causes environmental proxy settings to be used as discussed above.  For | 
 | 112 |    example:: | 
 | 113 |  | 
 | 114 |       # Use http://www.someproxy.com:3128 for http proxying | 
 | 115 |       proxies = {'http': 'http://www.someproxy.com:3128'} | 
 | 116 |       filehandle = urllib.urlopen(some_url, proxies=proxies) | 
 | 117 |       # Don't use any proxies | 
 | 118 |       filehandle = urllib.urlopen(some_url, proxies={}) | 
 | 119 |       # Use proxies from environment - both versions are equivalent | 
 | 120 |       filehandle = urllib.urlopen(some_url, proxies=None) | 
 | 121 |       filehandle = urllib.urlopen(some_url) | 
 | 122 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 123 |    Proxies which require authentication for use are not currently supported; this | 
 | 124 |    is considered an implementation limitation. | 
 | 125 |  | 
 | 126 |    .. versionchanged:: 2.3 | 
 | 127 |       Added the *proxies* support. | 
 | 128 |  | 
| Georg Brandl | 2235011 | 2008-01-20 12:05:43 +0000 | [diff] [blame] | 129 |    .. versionchanged:: 2.6 | 
 | 130 |       Added :meth:`getcode` to returned object and support for the | 
 | 131 |       :envvar:`no_proxy` environment variable. | 
| Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 132 |  | 
| Brett Cannon | 8bb8fa5 | 2008-07-02 01:57:08 +0000 | [diff] [blame] | 133 |    .. deprecated:: 2.6 | 
 | 134 |       The :func:`urlopen` function has been removed in Python 3.0 in favor | 
 | 135 |       of :func:`urllib2.urlopen`. | 
| Georg Brandl | 2235011 | 2008-01-20 12:05:43 +0000 | [diff] [blame] | 136 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 137 |  | 
 | 138 | .. function:: urlretrieve(url[, filename[, reporthook[, data]]]) | 
 | 139 |  | 
 | 140 |    Copy a network object denoted by a URL to a local file, if necessary. If the URL | 
 | 141 |    points to a local file, or a valid cached copy of the object exists, the object | 
 | 142 |    is not copied.  Return a tuple ``(filename, headers)`` where *filename* is the | 
 | 143 |    local file name under which the object can be found, and *headers* is whatever | 
 | 144 |    the :meth:`info` method of the object returned by :func:`urlopen` returned (for | 
 | 145 |    a remote object, possibly cached). Exceptions are the same as for | 
 | 146 |    :func:`urlopen`. | 
 | 147 |  | 
 | 148 |    The second argument, if present, specifies the file location to copy to (if | 
 | 149 |    absent, the location will be a tempfile with a generated name). The third | 
 | 150 |    argument, if present, is a hook function that will be called once on | 
 | 151 |    establishment of the network connection and once after each block read | 
 | 152 |    thereafter.  The hook will be passed three arguments; a count of blocks | 
 | 153 |    transferred so far, a block size in bytes, and the total size of the file.  The | 
 | 154 |    third argument may be ``-1`` on older FTP servers which do not return a file | 
 | 155 |    size in response to a retrieval request. | 
 | 156 |  | 
 | 157 |    If the *url* uses the :file:`http:` scheme identifier, the optional *data* | 
 | 158 |    argument may be given to specify a ``POST`` request (normally the request type | 
 | 159 |    is ``GET``).  The *data* argument must in standard | 
 | 160 |    :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` | 
 | 161 |    function below. | 
 | 162 |  | 
 | 163 |    .. versionchanged:: 2.5 | 
 | 164 |       :func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that | 
 | 165 |       the amount of data available  was less than the expected amount (which is the | 
 | 166 |       size reported by a  *Content-Length* header). This can occur, for example, when | 
 | 167 |       the  download is interrupted. | 
 | 168 |  | 
 | 169 |       The *Content-Length* is treated as a lower bound: if there's more data  to read, | 
| Eli Bendersky | ad72bb1 | 2011-04-16 15:28:42 +0300 | [diff] [blame] | 170 |       :func:`urlretrieve` reads more data, but if less data is available,  it raises | 
 | 171 |       the exception. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 172 |  | 
 | 173 |       You can still retrieve the downloaded data in this case, it is stored  in the | 
 | 174 |       :attr:`content` attribute of the exception instance. | 
 | 175 |  | 
| Eli Bendersky | ad72bb1 | 2011-04-16 15:28:42 +0300 | [diff] [blame] | 176 |       If no *Content-Length* header was supplied, :func:`urlretrieve` can not check | 
 | 177 |       the size of the data it has downloaded, and just returns it.  In this case you | 
 | 178 |       just have to assume that the download was successful. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 179 |  | 
 | 180 |  | 
 | 181 | .. data:: _urlopener | 
 | 182 |  | 
 | 183 |    The public functions :func:`urlopen` and :func:`urlretrieve` create an instance | 
 | 184 |    of the :class:`FancyURLopener` class and use it to perform their requested | 
 | 185 |    actions.  To override this functionality, programmers can create a subclass of | 
 | 186 |    :class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that | 
 | 187 |    class to the ``urllib._urlopener`` variable before calling the desired function. | 
 | 188 |    For example, applications may want to specify a different | 
 | 189 |    :mailheader:`User-Agent` header than :class:`URLopener` defines.  This can be | 
 | 190 |    accomplished with the following code:: | 
 | 191 |  | 
 | 192 |       import urllib | 
 | 193 |  | 
 | 194 |       class AppURLopener(urllib.FancyURLopener): | 
 | 195 |           version = "App/1.7" | 
 | 196 |  | 
 | 197 |       urllib._urlopener = AppURLopener() | 
 | 198 |  | 
 | 199 |  | 
 | 200 | .. function:: urlcleanup() | 
 | 201 |  | 
 | 202 |    Clear the cache that may have been built up by previous calls to | 
 | 203 |    :func:`urlretrieve`. | 
 | 204 |  | 
 | 205 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 206 | Utility functions | 
 | 207 | ----------------- | 
 | 208 |  | 
| Senthil Kumaran | 880685f | 2010-07-22 01:47:30 +0000 | [diff] [blame] | 209 | .. function:: quote(string[, safe]) | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 210 |  | 
 | 211 |    Replace special characters in *string* using the ``%xx`` escape. Letters, | 
| Senthil Kumaran | 9016137 | 2009-08-31 16:40:27 +0000 | [diff] [blame] | 212 |    digits, and the characters ``'_.-'`` are never quoted. By default, this | 
| R David Murray | 1d33651 | 2011-06-22 20:00:27 -0400 | [diff] [blame] | 213 |    function is intended for quoting the path section of the URL. The optional | 
| Senthil Kumaran | 9016137 | 2009-08-31 16:40:27 +0000 | [diff] [blame] | 214 |    *safe* parameter specifies additional characters that should not be quoted | 
 | 215 |    --- its default value is ``'/'``. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 216 |  | 
 | 217 |    Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. | 
 | 218 |  | 
 | 219 |  | 
| Senthil Kumaran | 880685f | 2010-07-22 01:47:30 +0000 | [diff] [blame] | 220 | .. function:: quote_plus(string[, safe]) | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 221 |  | 
 | 222 |    Like :func:`quote`, but also replaces spaces by plus signs, as required for | 
| Georg Brandl | 8d31f54 | 2009-07-28 18:55:32 +0000 | [diff] [blame] | 223 |    quoting HTML form values when building up a query string to go into a URL. | 
 | 224 |    Plus signs in the original string are escaped unless they are included in | 
 | 225 |    *safe*.  It also does not have *safe* default to ``'/'``. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 226 |  | 
 | 227 |  | 
 | 228 | .. function:: unquote(string) | 
 | 229 |  | 
 | 230 |    Replace ``%xx`` escapes by their single-character equivalent. | 
 | 231 |  | 
 | 232 |    Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. | 
 | 233 |  | 
 | 234 |  | 
 | 235 | .. function:: unquote_plus(string) | 
 | 236 |  | 
 | 237 |    Like :func:`unquote`, but also replaces plus signs by spaces, as required for | 
 | 238 |    unquoting HTML form values. | 
 | 239 |  | 
 | 240 |  | 
 | 241 | .. function:: urlencode(query[, doseq]) | 
 | 242 |  | 
| Benjamin Peterson | 53e812a | 2010-06-06 00:50:58 +0000 | [diff] [blame] | 243 |    Convert a mapping object or a sequence of two-element tuples to a | 
| Senthil Kumaran | bd13f45 | 2010-08-09 20:14:11 +0000 | [diff] [blame] | 244 |    "percent-encoded" string, suitable to pass to :func:`urlopen` above as the | 
| Senthil Kumaran | 98bc31f | 2010-06-02 02:19:15 +0000 | [diff] [blame] | 245 |    optional *data* argument.  This is useful to pass a dictionary of form | 
 | 246 |    fields to a ``POST`` request.  The resulting string is a series of | 
 | 247 |    ``key=value`` pairs separated by ``'&'`` characters, where both *key* and | 
 | 248 |    *value* are quoted using :func:`quote_plus` above.  When a sequence of | 
 | 249 |    two-element tuples is used as the *query* argument, the first element of | 
 | 250 |    each tuple is a key and the second is a value. The value element in itself | 
 | 251 |    can be a sequence and in that case, if the optional parameter *doseq* is | 
| Benjamin Peterson | 11591c3 | 2010-06-06 00:54:29 +0000 | [diff] [blame] | 252 |    evaluates to *True*, individual ``key=value`` pairs separated by ``'&'`` are | 
| Senthil Kumaran | 98bc31f | 2010-06-02 02:19:15 +0000 | [diff] [blame] | 253 |    generated for each element of the value sequence for the key.  The order of | 
 | 254 |    parameters in the encoded string will match the order of parameter tuples in | 
 | 255 |    the sequence. The :mod:`urlparse` module provides the functions | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 256 |    :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings | 
 | 257 |    into Python data structures. | 
 | 258 |  | 
 | 259 |  | 
 | 260 | .. function:: pathname2url(path) | 
 | 261 |  | 
 | 262 |    Convert the pathname *path* from the local syntax for a path to the form used in | 
 | 263 |    the path component of a URL.  This does not produce a complete URL.  The return | 
 | 264 |    value will already be quoted using the :func:`quote` function. | 
 | 265 |  | 
 | 266 |  | 
 | 267 | .. function:: url2pathname(path) | 
 | 268 |  | 
| Senthil Kumaran | bd13f45 | 2010-08-09 20:14:11 +0000 | [diff] [blame] | 269 |    Convert the path component *path* from an percent-encoded URL to the local syntax for a | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 270 |    path.  This does not accept a complete URL.  This function uses :func:`unquote` | 
 | 271 |    to decode *path*. | 
 | 272 |  | 
 | 273 |  | 
| Senthil Kumaran | c994186 | 2010-02-26 00:47:05 +0000 | [diff] [blame] | 274 | .. function:: getproxies() | 
 | 275 |  | 
 | 276 |    This helper function returns a dictionary of scheme to proxy server URL | 
| Senthil Kumaran | 8070ddc | 2012-01-11 01:35:02 +0800 | [diff] [blame] | 277 |    mappings. It scans the environment for variables named ``<scheme>_proxy``, | 
 | 278 |    in case insensitive way, for all operating systems first, and when it cannot | 
 | 279 |    find it, looks for proxy information from Mac OSX System Configuration for | 
 | 280 |    Mac OS X and Windows Systems Registry for Windows. | 
| Senthil Kumaran | c994186 | 2010-02-26 00:47:05 +0000 | [diff] [blame] | 281 |  | 
 | 282 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 283 | URL Opener objects | 
 | 284 | ------------------ | 
 | 285 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 286 | .. class:: URLopener([proxies[, **x509]]) | 
 | 287 |  | 
 | 288 |    Base class for opening and reading URLs.  Unless you need to support opening | 
 | 289 |    objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`, | 
 | 290 |    you probably want to use :class:`FancyURLopener`. | 
 | 291 |  | 
 | 292 |    By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header | 
 | 293 |    of ``urllib/VVV``, where *VVV* is the :mod:`urllib` version number. | 
 | 294 |    Applications can define their own :mailheader:`User-Agent` header by subclassing | 
 | 295 |    :class:`URLopener` or :class:`FancyURLopener` and setting the class attribute | 
 | 296 |    :attr:`version` to an appropriate string value in the subclass definition. | 
 | 297 |  | 
 | 298 |    The optional *proxies* parameter should be a dictionary mapping scheme names to | 
 | 299 |    proxy URLs, where an empty dictionary turns proxies off completely.  Its default | 
 | 300 |    value is ``None``, in which case environmental proxy settings will be used if | 
 | 301 |    present, as discussed in the definition of :func:`urlopen`, above. | 
 | 302 |  | 
 | 303 |    Additional keyword parameters, collected in *x509*, may be used for | 
 | 304 |    authentication of the client when using the :file:`https:` scheme.  The keywords | 
 | 305 |    *key_file* and *cert_file* are supported to provide an  SSL key and certificate; | 
 | 306 |    both are needed to support client authentication. | 
 | 307 |  | 
 | 308 |    :class:`URLopener` objects will raise an :exc:`IOError` exception if the server | 
 | 309 |    returns an error code. | 
 | 310 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 311 |     .. method:: open(fullurl[, data]) | 
 | 312 |  | 
 | 313 |        Open *fullurl* using the appropriate protocol.  This method sets up cache and | 
 | 314 |        proxy information, then calls the appropriate open method with its input | 
 | 315 |        arguments.  If the scheme is not recognized, :meth:`open_unknown` is called. | 
 | 316 |        The *data* argument has the same meaning as the *data* argument of | 
 | 317 |        :func:`urlopen`. | 
 | 318 |  | 
 | 319 |  | 
 | 320 |     .. method:: open_unknown(fullurl[, data]) | 
 | 321 |  | 
 | 322 |        Overridable interface to open unknown URL types. | 
 | 323 |  | 
 | 324 |  | 
 | 325 |     .. method:: retrieve(url[, filename[, reporthook[, data]]]) | 
 | 326 |  | 
 | 327 |        Retrieves the contents of *url* and places it in *filename*.  The return value | 
 | 328 |        is a tuple consisting of a local filename and either a | 
 | 329 |        :class:`mimetools.Message` object containing the response headers (for remote | 
 | 330 |        URLs) or ``None`` (for local URLs).  The caller must then open and read the | 
 | 331 |        contents of *filename*.  If *filename* is not given and the URL refers to a | 
 | 332 |        local file, the input filename is returned.  If the URL is non-local and | 
 | 333 |        *filename* is not given, the filename is the output of :func:`tempfile.mktemp` | 
 | 334 |        with a suffix that matches the suffix of the last path component of the input | 
 | 335 |        URL.  If *reporthook* is given, it must be a function accepting three numeric | 
 | 336 |        parameters.  It will be called after each chunk of data is read from the | 
 | 337 |        network.  *reporthook* is ignored for local URLs. | 
 | 338 |  | 
 | 339 |        If the *url* uses the :file:`http:` scheme identifier, the optional *data* | 
 | 340 |        argument may be given to specify a ``POST`` request (normally the request type | 
 | 341 |        is ``GET``).  The *data* argument must in standard | 
 | 342 |        :mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode` | 
 | 343 |        function below. | 
 | 344 |  | 
 | 345 |  | 
 | 346 |     .. attribute:: version | 
 | 347 |  | 
 | 348 |        Variable that specifies the user agent of the opener object.  To get | 
 | 349 |        :mod:`urllib` to tell servers that it is a particular user agent, set this in a | 
 | 350 |        subclass as a class variable or in the constructor before calling the base | 
 | 351 |        constructor. | 
 | 352 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 353 |  | 
 | 354 | .. class:: FancyURLopener(...) | 
 | 355 |  | 
 | 356 |    :class:`FancyURLopener` subclasses :class:`URLopener` providing default handling | 
 | 357 |    for the following HTTP response codes: 301, 302, 303, 307 and 401.  For the 30x | 
 | 358 |    response codes listed above, the :mailheader:`Location` header is used to fetch | 
 | 359 |    the actual URL.  For 401 response codes (authentication required), basic HTTP | 
 | 360 |    authentication is performed.  For the 30x response codes, recursion is bounded | 
 | 361 |    by the value of the *maxtries* attribute, which defaults to 10. | 
 | 362 |  | 
 | 363 |    For all other response codes, the method :meth:`http_error_default` is called | 
 | 364 |    which you can override in subclasses to handle the error appropriately. | 
 | 365 |  | 
 | 366 |    .. note:: | 
 | 367 |  | 
 | 368 |       According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests | 
 | 369 |       must not be automatically redirected without confirmation by the user.  In | 
 | 370 |       reality, browsers do allow automatic redirection of these responses, changing | 
 | 371 |       the POST to a GET, and :mod:`urllib` reproduces this behaviour. | 
 | 372 |  | 
 | 373 |    The parameters to the constructor are the same as those for :class:`URLopener`. | 
 | 374 |  | 
 | 375 |    .. note:: | 
 | 376 |  | 
 | 377 |       When performing basic authentication, a :class:`FancyURLopener` instance calls | 
 | 378 |       its :meth:`prompt_user_passwd` method.  The default implementation asks the | 
 | 379 |       users for the required information on the controlling terminal.  A subclass may | 
 | 380 |       override this method to support more appropriate behavior if needed. | 
 | 381 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 382 |     The :class:`FancyURLopener` class offers one additional method that should be | 
 | 383 |     overloaded to provide the appropriate behavior: | 
 | 384 |  | 
 | 385 |     .. method:: prompt_user_passwd(host, realm) | 
 | 386 |  | 
 | 387 |        Return information needed to authenticate the user at the given host in the | 
 | 388 |        specified security realm.  The return value should be a tuple, ``(user, | 
 | 389 |        password)``, which can be used for basic authentication. | 
 | 390 |  | 
 | 391 |        The implementation prompts for this information on the terminal; an application | 
 | 392 |        should override this method to use an appropriate interaction model in the local | 
 | 393 |        environment. | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 394 |  | 
 | 395 | .. exception:: ContentTooShortError(msg[, content]) | 
 | 396 |  | 
 | 397 |    This exception is raised when the :func:`urlretrieve` function detects that the | 
 | 398 |    amount of the downloaded data is less than the  expected amount (given by the | 
 | 399 |    *Content-Length* header). The :attr:`content` attribute stores the downloaded | 
 | 400 |    (and supposedly truncated) data. | 
 | 401 |  | 
 | 402 |    .. versionadded:: 2.5 | 
 | 403 |  | 
| Georg Brandl | 6264765 | 2008-01-07 18:23:27 +0000 | [diff] [blame] | 404 |  | 
 | 405 | :mod:`urllib` Restrictions | 
 | 406 | -------------------------- | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 407 |  | 
 | 408 |   .. index:: | 
 | 409 |      pair: HTTP; protocol | 
 | 410 |      pair: FTP; protocol | 
 | 411 |  | 
 | 412 | * Currently, only the following protocols are supported: HTTP, (versions 0.9 and | 
 | 413 |   1.0),  FTP, and local files. | 
 | 414 |  | 
 | 415 | * The caching feature of :func:`urlretrieve` has been disabled until I find the | 
 | 416 |   time to hack proper processing of Expiration time headers. | 
 | 417 |  | 
 | 418 | * There should be a function to query whether a particular URL is in the cache. | 
 | 419 |  | 
 | 420 | * For backward compatibility, if a URL appears to point to a local file but the | 
 | 421 |   file can't be opened, the URL is re-interpreted using the FTP protocol.  This | 
 | 422 |   can sometimes cause confusing error messages. | 
 | 423 |  | 
 | 424 | * The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily | 
 | 425 |   long delays while waiting for a network connection to be set up.  This means | 
 | 426 |   that it is difficult to build an interactive Web client using these functions | 
 | 427 |   without using threads. | 
 | 428 |  | 
 | 429 |   .. index:: | 
 | 430 |      single: HTML | 
 | 431 |      pair: HTTP; protocol | 
 | 432 |      module: htmllib | 
 | 433 |  | 
 | 434 | * The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data | 
 | 435 |   returned by the server.  This may be binary data (such as an image), plain text | 
 | 436 |   or (for example) HTML.  The HTTP protocol provides type information in the reply | 
 | 437 |   header, which can be inspected by looking at the :mailheader:`Content-Type` | 
 | 438 |   header.  If the returned data is HTML, you can use the module :mod:`htmllib` to | 
 | 439 |   parse it. | 
 | 440 |  | 
 | 441 |   .. index:: single: FTP | 
 | 442 |  | 
 | 443 | * The code handling the FTP protocol cannot differentiate between a file and a | 
 | 444 |   directory.  This can lead to unexpected behavior when attempting to read a URL | 
 | 445 |   that points to a file that is not accessible.  If the URL ends in a ``/``, it is | 
 | 446 |   assumed to refer to a directory and will be handled accordingly.  But if an | 
 | 447 |   attempt to read a file leads to a 550 error (meaning the URL cannot be found or | 
 | 448 |   is not accessible, often for permission reasons), then the path is treated as a | 
 | 449 |   directory in order to handle the case when a directory is specified by a URL but | 
 | 450 |   the trailing ``/`` has been left off.  This can cause misleading results when | 
 | 451 |   you try to fetch a file whose read permissions make it inaccessible; the FTP | 
 | 452 |   code will try to read it, fail with a 550 error, and then perform a directory | 
 | 453 |   listing for the unreadable file. If fine-grained control is needed, consider | 
| Éric Araujo | c75f265 | 2011-03-20 18:34:24 +0100 | [diff] [blame] | 454 |   using the :mod:`ftplib` module, subclassing :class:`FancyURLopener`, or changing | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 455 |   *_urlopener* to meet your needs. | 
 | 456 |  | 
 | 457 | * This module does not support the use of proxies which require authentication. | 
 | 458 |   This may be implemented in the future. | 
 | 459 |  | 
 | 460 |   .. index:: module: urlparse | 
 | 461 |  | 
 | 462 | * Although the :mod:`urllib` module contains (undocumented) routines to parse | 
 | 463 |   and unparse URL strings, the recommended interface for URL manipulation is in | 
 | 464 |   module :mod:`urlparse`. | 
 | 465 |  | 
 | 466 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 467 | .. _urllib-examples: | 
 | 468 |  | 
 | 469 | Examples | 
 | 470 | -------- | 
 | 471 |  | 
 | 472 | Here is an example session that uses the ``GET`` method to retrieve a URL | 
 | 473 | containing parameters:: | 
 | 474 |  | 
 | 475 |    >>> import urllib | 
 | 476 |    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) | 
 | 477 |    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params) | 
 | 478 |    >>> print f.read() | 
 | 479 |  | 
 | 480 | The following example uses the ``POST`` method instead:: | 
 | 481 |  | 
 | 482 |    >>> import urllib | 
 | 483 |    >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) | 
 | 484 |    >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) | 
 | 485 |    >>> print f.read() | 
 | 486 |  | 
 | 487 | The following example uses an explicitly specified HTTP proxy, overriding | 
 | 488 | environment settings:: | 
 | 489 |  | 
 | 490 |    >>> import urllib | 
 | 491 |    >>> proxies = {'http': 'http://proxy.example.com:8080/'} | 
 | 492 |    >>> opener = urllib.FancyURLopener(proxies) | 
 | 493 |    >>> f = opener.open("http://www.python.org") | 
 | 494 |    >>> f.read() | 
 | 495 |  | 
 | 496 | The following example uses no proxies at all, overriding environment settings:: | 
 | 497 |  | 
 | 498 |    >>> import urllib | 
 | 499 |    >>> opener = urllib.FancyURLopener({}) | 
 | 500 |    >>> f = opener.open("http://www.python.org/") | 
 | 501 |    >>> f.read() | 
 | 502 |  |