Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 1 | ============================================== |
| 2 | HOWTO Fetch Internet Resources Using urllib2 |
| 3 | ============================================== |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 4 | ---------------------------- |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 5 | Fetching URLs With Python |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 6 | ---------------------------- |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 7 | |
| 8 | |
| 9 | .. note:: |
| 10 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 11 | There is an French translation of an earlier revision of this |
| 12 | HOWTO, available at `urllib2 - Le Manuel manquant |
George Yoshida | 00f6e19 | 2006-05-21 04:40:32 +0000 | [diff] [blame] | 13 | <http://www.voidspace/python/articles/urllib2_francais.shtml>`_. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 14 | |
| 15 | .. contents:: urllib2 Tutorial |
| 16 | |
| 17 | |
| 18 | Introduction |
| 19 | ============ |
| 20 | |
| 21 | .. sidebar:: Related Articles |
| 22 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 23 | You may also find useful the following article on fetching web |
| 24 | resources with Python : |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 25 | |
| 26 | * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ |
| 27 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 28 | A tutorial on *Basic Authentication*, with examples in Python. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 29 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 30 | This HOWTO is written by `Michael Foord |
| 31 | <http://www.voidspace.org.uk/python/index.shtml>`_. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 32 | |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 33 | **urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs |
| 34 | (Uniform Resource Locators). It offers a very simple interface, in the form of |
| 35 | the *urlopen* function. This is capable of fetching URLs using a variety |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 36 | of different protocols. It also offers a slightly more complex |
| 37 | interface for handling common situations - like basic authentication, |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 38 | cookies, proxies and so on. These are provided by objects called |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 39 | handlers and openers. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 40 | |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 41 | urllib2 supports fetching URLs for many "URL schemes" (identified by the string |
| 42 | before the ":" in URL - for example "ftp" is the URL scheme of |
| 43 | "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). |
| 44 | This tutorial focuses on the most common case, HTTP. |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 45 | |
| 46 | For straightforward situations *urlopen* is very easy to use. But as |
| 47 | soon as you encounter errors or non-trivial cases when opening HTTP |
| 48 | URLs, you will need some understanding of the HyperText Transfer |
| 49 | Protocol. The most comprehensive and authoritative reference to HTTP |
| 50 | is :RFC:`2616`. This is a technical document and not intended to be |
| 51 | easy to read. This HOWTO aims to illustrate using *urllib2*, with |
| 52 | enough detail about HTTP to help you through. It is not intended to |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 53 | replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ , |
| 54 | but is supplementary to them. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 55 | |
| 56 | |
| 57 | Fetching URLs |
| 58 | ============= |
| 59 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 60 | The simplest way to use urllib2 is as follows : :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 61 | |
| 62 | import urllib2 |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 63 | response = urllib2.urlopen('http://python.org/') |
| 64 | html = response.read() |
| 65 | |
| 66 | Many uses of urllib2 will be that simple (note that instead of an |
| 67 | 'http:' URL we could have used an URL starting with 'ftp:', 'file:', |
| 68 | etc.). However, it's the purpose of this tutorial to explain the more |
| 69 | complicated cases, concentrating on HTTP. |
| 70 | |
| 71 | HTTP is based on requests and responses - the client makes requests |
| 72 | and servers send responses. urllib2 mirrors this with a ``Request`` |
| 73 | object which represents the HTTP request you are making. In its |
| 74 | simplest form you create a Request object that specifies the URL you |
| 75 | want to fetch. Calling ``urlopen`` with this Request object returns a |
| 76 | response object for the URL requested. This response is a file-like |
| 77 | object, which means you can for example call .read() on the response : |
| 78 | :: |
| 79 | |
| 80 | import urllib2 |
| 81 | |
| 82 | req = urllib2.Request('http://www.voidspace.org.uk') |
| 83 | response = urllib2.urlopen(req) |
| 84 | the_page = response.read() |
| 85 | |
| 86 | Note that urllib2 makes use of the same Request interface to handle |
| 87 | all URL schemes. For example, you can make an FTP request like so: :: |
| 88 | |
| 89 | req = urllib2.Request('ftp://example.com/') |
| 90 | |
| 91 | In the case of HTTP, there are two extra things that Request objects |
| 92 | allow you to do: First, you can pass data to be sent to the server. |
| 93 | Second, you can pass extra information ("metadata") *about* the data |
| 94 | or the about request itself, to the server - this information is sent |
| 95 | as HTTP "headers". Let's look at each of these in turn. |
| 96 | |
| 97 | Data |
| 98 | ---- |
| 99 | |
| 100 | Sometimes you want to send data to a URL (often the URL will refer to |
| 101 | a CGI (Common Gateway Interface) script [#]_ or other web |
| 102 | application). With HTTP, this is often done using what's known as a |
| 103 | **POST** request. This is often what your browser does when you submit |
| 104 | a HTML form that you filled in on the web. Not all POSTs have to come |
| 105 | from forms: you can use a POST to transmit arbitrary data to your own |
| 106 | application. In the common case of HTML forms, the data needs to be |
| 107 | encoded in a standard way, and then passed to the Request object as |
| 108 | the ``data`` argument. The encoding is done using a function from the |
| 109 | ``urllib`` library *not* from ``urllib2``. :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 110 | |
| 111 | import urllib |
| 112 | import urllib2 |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 113 | |
| 114 | url = 'http://www.someserver.com/cgi-bin/register.cgi' |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 115 | values = {'name' : 'Michael Foord', |
| 116 | 'location' : 'Northampton', |
| 117 | 'language' : 'Python' } |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 118 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 119 | data = urllib.urlencode(values) |
| 120 | req = urllib2.Request(url, data) |
| 121 | response = urllib2.urlopen(req) |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 122 | the_page = response.read() |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 123 | |
| 124 | Note that other encodings are sometimes required (e.g. for file upload |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 125 | from HTML forms - see |
| 126 | `HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ |
| 127 | for more details). |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 128 | |
| 129 | If you do not pass the ``data`` argument, urllib2 uses a **GET** |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 130 | request. One way in which GET and POST requests differ is that POST |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 131 | requests often have "side-effects": they change the state of the |
| 132 | system in some way (for example by placing an order with the website |
| 133 | for a hundredweight of tinned spam to be delivered to your door). |
| 134 | Though the HTTP standard makes it clear that POSTs are intended to |
| 135 | *always* cause side-effects, and GET requests *never* to cause |
| 136 | side-effects, nothing prevents a GET request from having side-effects, |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 137 | nor a POST requests from having no side-effects. Data can also be |
| 138 | passed in an HTTP GET request by encoding it in the URL itself. |
| 139 | |
| 140 | This is done as follows:: |
| 141 | |
| 142 | >>> import urllib2 |
| 143 | >>> import urllib |
| 144 | >>> data = {} |
| 145 | >>> data['name'] = 'Somebody Here' |
| 146 | >>> data['location'] = 'Northampton' |
| 147 | >>> data['language'] = 'Python' |
| 148 | >>> url_values = urllib.urlencode(data) |
| 149 | >>> print url_values |
| 150 | name=Somebody+Here&language=Python&location=Northampton |
| 151 | >>> url = 'http://www.example.com/example.cgi' |
| 152 | >>> full_url = url + '?' + url_values |
| 153 | >>> data = urllib2.open(full_url) |
| 154 | |
| 155 | Notice that the full URL is created by adding a ``?`` to the URL, followed by |
| 156 | the encoded values. |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 157 | |
| 158 | Headers |
| 159 | ------- |
| 160 | |
| 161 | We'll discuss here one particular HTTP header, to illustrate how to |
| 162 | add headers to your HTTP request. |
| 163 | |
| 164 | Some websites [#]_ dislike being browsed by programs, or send |
| 165 | different versions to different browsers [#]_ . By default urllib2 |
| 166 | identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are |
| 167 | the major and minor version numbers of the Python release, |
| 168 | e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain |
| 169 | not work. The way a browser identifies itself is through the |
| 170 | ``User-Agent`` header [#]_. When you create a Request object you can |
| 171 | pass a dictionary of headers in. The following example makes the same |
| 172 | request as above, but identifies itself as a version of Internet |
| 173 | Explorer [#]_. :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 174 | |
| 175 | import urllib |
| 176 | import urllib2 |
| 177 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 178 | url = 'http://www.someserver.com/cgi-bin/register.cgi' |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 179 | user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' |
| 180 | values = {'name' : 'Michael Foord', |
| 181 | 'location' : 'Northampton', |
| 182 | 'language' : 'Python' } |
| 183 | headers = { 'User-Agent' : user_agent } |
| 184 | |
| 185 | data = urllib.urlencode(values) |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 186 | req = urllib2.Request(url, data, headers) |
| 187 | response = urllib2.urlopen(req) |
| 188 | the_page = response.read() |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 189 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 190 | The response also has two useful methods. See the section on `info and |
| 191 | geturl`_ which comes after we have a look at what happens when things |
| 192 | go wrong. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 193 | |
| 194 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 195 | Handling Exceptions |
| 196 | =================== |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 197 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 198 | *urlopen* raises ``URLError`` when it cannot handle a response (though |
| 199 | as usual with Python APIs, builtin exceptions such as ValueError, |
| 200 | TypeError etc. may also be raised). |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 201 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 202 | ``HTTPError`` is the subclass of ``URLError`` raised in the specific |
| 203 | case of HTTP URLs. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 204 | |
| 205 | URLError |
| 206 | -------- |
| 207 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 208 | Often, URLError is raised because there is no network connection (no |
| 209 | route to the specified server), or the specified server doesn't exist. |
| 210 | In this case, the exception raised will have a 'reason' attribute, |
| 211 | which is a tuple containing an error code and a text error message. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 212 | |
| 213 | e.g. :: |
| 214 | |
| 215 | >>> req = urllib2.Request('http://www.pretend_server.org') |
| 216 | >>> try: urllib2.urlopen(req) |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 217 | >>> except URLError, e: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 218 | >>> print e.reason |
| 219 | >>> |
| 220 | (4, 'getaddrinfo failed') |
| 221 | |
| 222 | |
| 223 | HTTPError |
| 224 | --------- |
| 225 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 226 | Every HTTP response from the server contains a numeric "status |
| 227 | code". Sometimes the status code indicates that the server is unable |
| 228 | to fulfil the request. The default handlers will handle some of these |
| 229 | responses for you (for example, if the response is a "redirection" |
| 230 | that requests the client fetch the document from a different URL, |
| 231 | urllib2 will handle that for you). For those it can't handle, urlopen |
| 232 | will raise an ``HTTPError``. Typical errors include '404' (page not |
| 233 | found), '403' (request forbidden), and '401' (authentication |
| 234 | required). |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 235 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 236 | See section 10 of RFC 2616 for a reference on all the HTTP error |
| 237 | codes. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 238 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 239 | The ``HTTPError`` instance raised will have an integer 'code' |
| 240 | attribute, which corresponds to the error sent by the server. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 241 | |
| 242 | Error Codes |
| 243 | ~~~~~~~~~~~ |
| 244 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 245 | Because the default handlers handle redirects (codes in the 300 |
| 246 | range), and codes in the 100-299 range indicate success, you will |
| 247 | usually only see error codes in the 400-599 range. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 248 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 249 | ``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful |
| 250 | dictionary of response codes in that shows all the response codes used |
| 251 | by RFC 2616. The dictionary is reproduced here for convenience :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 252 | |
| 253 | # Table mapping response codes to messages; entries have the |
| 254 | # form {code: (shortmessage, longmessage)}. |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 255 | responses = { |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 256 | 100: ('Continue', 'Request received, please continue'), |
| 257 | 101: ('Switching Protocols', |
| 258 | 'Switching to new protocol; obey Upgrade header'), |
| 259 | |
| 260 | 200: ('OK', 'Request fulfilled, document follows'), |
| 261 | 201: ('Created', 'Document created, URL follows'), |
| 262 | 202: ('Accepted', |
| 263 | 'Request accepted, processing continues off-line'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 264 | 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), |
| 265 | 204: ('No Content', 'Request fulfilled, nothing follows'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 266 | 205: ('Reset Content', 'Clear input form for further input.'), |
| 267 | 206: ('Partial Content', 'Partial content follows.'), |
| 268 | |
| 269 | 300: ('Multiple Choices', |
| 270 | 'Object has several resources -- see URI list'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 271 | 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 272 | 302: ('Found', 'Object moved temporarily -- see URI list'), |
| 273 | 303: ('See Other', 'Object moved -- see Method and URL list'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 274 | 304: ('Not Modified', |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 275 | 'Document has not changed since given time'), |
| 276 | 305: ('Use Proxy', |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 277 | 'You must use proxy specified in Location to access this ' |
| 278 | 'resource.'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 279 | 307: ('Temporary Redirect', |
| 280 | 'Object moved temporarily -- see URI list'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 281 | |
| 282 | 400: ('Bad Request', |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 283 | 'Bad request syntax or unsupported method'), |
| 284 | 401: ('Unauthorized', |
| 285 | 'No permission -- see authorization schemes'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 286 | 402: ('Payment Required', |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 287 | 'No payment -- see charging schemes'), |
| 288 | 403: ('Forbidden', |
| 289 | 'Request forbidden -- authorization will not help'), |
| 290 | 404: ('Not Found', 'Nothing matches the given URI'), |
| 291 | 405: ('Method Not Allowed', |
| 292 | 'Specified method is invalid for this server.'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 293 | 406: ('Not Acceptable', 'URI not available in preferred format.'), |
| 294 | 407: ('Proxy Authentication Required', 'You must authenticate with ' |
| 295 | 'this proxy before proceeding.'), |
| 296 | 408: ('Request Timeout', 'Request timed out; try again later.'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 297 | 409: ('Conflict', 'Request conflict.'), |
| 298 | 410: ('Gone', |
| 299 | 'URI no longer exists and has been permanently removed.'), |
| 300 | 411: ('Length Required', 'Client must specify Content-Length.'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 301 | 412: ('Precondition Failed', 'Precondition in headers is false.'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 302 | 413: ('Request Entity Too Large', 'Entity is too large.'), |
| 303 | 414: ('Request-URI Too Long', 'URI is too long.'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 304 | 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 305 | 416: ('Requested Range Not Satisfiable', |
| 306 | 'Cannot satisfy request range.'), |
| 307 | 417: ('Expectation Failed', |
| 308 | 'Expect condition could not be satisfied.'), |
| 309 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 310 | 500: ('Internal Server Error', 'Server got itself in trouble'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 311 | 501: ('Not Implemented', |
| 312 | 'Server does not support this operation'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 313 | 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), |
| 314 | 503: ('Service Unavailable', |
| 315 | 'The server cannot process the request due to a high load'), |
| 316 | 504: ('Gateway Timeout', |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 317 | 'The gateway server did not receive a timely response'), |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 318 | 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 319 | } |
| 320 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 321 | When an error is raised the server responds by returning an HTTP error |
| 322 | code *and* an error page. You can use the ``HTTPError`` instance as a |
| 323 | response on the page returned. This means that as well as the code |
| 324 | attribute, it also has read, geturl, and info, methods. :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 325 | |
| 326 | >>> req = urllib2.Request('http://www.python.org/fish.html') |
| 327 | >>> try: |
| 328 | >>> urllib2.urlopen(req) |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 329 | >>> except URLError, e: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 330 | >>> print e.code |
| 331 | >>> print e.read() |
| 332 | >>> |
| 333 | 404 |
| 334 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| 335 | "http://www.w3.org/TR/html4/loose.dtd"> |
| 336 | <?xml-stylesheet href="./css/ht2html.css" |
| 337 | type="text/css"?> |
| 338 | <html><head><title>Error 404: File Not Found</title> |
| 339 | ...... etc... |
| 340 | |
| 341 | Wrapping it Up |
| 342 | -------------- |
| 343 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 344 | So if you want to be prepared for ``HTTPError`` *or* ``URLError`` |
| 345 | there are two basic approaches. I prefer the second approach. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 346 | |
| 347 | Number 1 |
| 348 | ~~~~~~~~ |
| 349 | |
| 350 | :: |
| 351 | |
| 352 | |
| 353 | from urllib2 import Request, urlopen, URLError, HTTPError |
| 354 | req = Request(someurl) |
| 355 | try: |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 356 | response = urlopen(req) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 357 | except HTTPError, e: |
| 358 | print 'The server couldn\'t fulfill the request.' |
| 359 | print 'Error code: ', e.code |
| 360 | except URLError, e: |
| 361 | print 'We failed to reach a server.' |
| 362 | print 'Reason: ', e.reason |
| 363 | else: |
| 364 | # everything is fine |
| 365 | |
| 366 | |
| 367 | .. note:: |
| 368 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 369 | The ``except HTTPError`` *must* come first, otherwise ``except URLError`` |
| 370 | will *also* catch an ``HTTPError``. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 371 | |
| 372 | Number 2 |
| 373 | ~~~~~~~~ |
| 374 | |
| 375 | :: |
| 376 | |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 377 | from urllib2 import Request, urlopen, URLError |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 378 | req = Request(someurl) |
| 379 | try: |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 380 | response = urlopen(req) |
| 381 | except URLError, e: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 382 | if hasattr(e, 'reason'): |
| 383 | print 'We failed to reach a server.' |
| 384 | print 'Reason: ', e.reason |
| 385 | elif hasattr(e, 'code'): |
| 386 | print 'The server couldn\'t fulfill the request.' |
| 387 | print 'Error code: ', e.code |
| 388 | else: |
| 389 | # everything is fine |
| 390 | |
| 391 | |
| 392 | info and geturl |
| 393 | =============== |
| 394 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 395 | The response returned by urlopen (or the ``HTTPError`` instance) has |
| 396 | two useful methods ``info`` and ``geturl``. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 397 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 398 | **geturl** - this returns the real URL of the page fetched. This is |
| 399 | useful because ``urlopen`` (or the opener object used) may have |
| 400 | followed a redirect. The URL of the page fetched may not be the same |
| 401 | as the URL requested. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 402 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 403 | **info** - this returns a dictionary-like object that describes the |
| 404 | page fetched, particularly the headers sent by the server. It is |
| 405 | currently an ``httplib.HTTPMessage`` instance. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 406 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 407 | Typical headers include 'Content-length', 'Content-type', and so |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 408 | on. See the |
| 409 | `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ |
| 410 | for a useful listing of HTTP headers with brief explanations of their meaning |
| 411 | and use. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 412 | |
| 413 | |
| 414 | Openers and Handlers |
| 415 | ==================== |
| 416 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 417 | When you fetch a URL you use an opener (an instance of the perhaps |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 418 | confusingly-named ``urllib2.OpenerDirector``). Normally we have been using |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 419 | the default opener - via ``urlopen`` - but you can create custom |
| 420 | openers. Openers use handlers. All the "heavy lifting" is done by the |
| 421 | handlers. Each handler knows how to open URLs for a particular URL |
| 422 | scheme (http, ftp, etc.), or how to handle an aspect of URL opening, |
| 423 | for example HTTP redirections or HTTP cookies. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 424 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 425 | You will want to create openers if you want to fetch URLs with |
| 426 | specific handlers installed, for example to get an opener that handles |
| 427 | cookies, or to get an opener that does not handle redirections. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 428 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 429 | To create an opener, instantiate an OpenerDirector, and then call |
| 430 | .add_handler(some_handler_instance) repeatedly. |
| 431 | |
| 432 | Alternatively, you can use ``build_opener``, which is a convenience |
| 433 | function for creating opener objects with a single function call. |
| 434 | ``build_opener`` adds several handlers by default, but provides a |
| 435 | quick way to add more and/or override the default handlers. |
| 436 | |
| 437 | Other sorts of handlers you might want to can handle proxies, |
| 438 | authentication, and other common but slightly specialised |
| 439 | situations. |
| 440 | |
| 441 | ``install_opener`` can be used to make an ``opener`` object the |
| 442 | (global) default opener. This means that calls to ``urlopen`` will use |
| 443 | the opener you have installed. |
| 444 | |
| 445 | Opener objects have an ``open`` method, which can be called directly |
| 446 | to fetch urls in the same way as the ``urlopen`` function: there's no |
| 447 | need to call ``install_opener``, except as a convenience. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 448 | |
| 449 | |
| 450 | Basic Authentication |
| 451 | ==================== |
| 452 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 453 | To illustrate creating and installing a handler we will use the |
| 454 | ``HTTPBasicAuthHandler``. For a more detailed discussion of this |
| 455 | subject - including an explanation of how Basic Authentication works - |
George Yoshida | b688b6c | 2006-05-20 18:07:26 +0000 | [diff] [blame] | 456 | see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 457 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 458 | When authentication is required, the server sends a header (as well as |
| 459 | the 401 error code) requesting authentication. This specifies the |
| 460 | authentication scheme and a 'realm'. The header looks like : |
| 461 | ``Www-authenticate: SCHEME realm="REALM"``. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 462 | |
| 463 | e.g. :: |
| 464 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 465 | Www-authenticate: Basic realm="cPanel Users" |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 466 | |
| 467 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 468 | The client should then retry the request with the appropriate name and |
| 469 | password for the realm included as a header in the request. This is |
| 470 | 'basic authentication'. In order to simplify this process we can |
| 471 | create an instance of ``HTTPBasicAuthHandler`` and an opener to use |
| 472 | this handler. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 473 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 474 | The ``HTTPBasicAuthHandler`` uses an object called a password manager |
| 475 | to handle the mapping of URLs and realms to passwords and |
| 476 | usernames. If you know what the realm is (from the authentication |
| 477 | header sent by the server), then you can use a |
| 478 | ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In |
| 479 | that case, it is convenient to use |
| 480 | ``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a |
| 481 | default username and password for a URL. This will be supplied in the |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 482 | absence of you providing an alternative combination for a specific |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 483 | realm. We indicate this by providing ``None`` as the realm argument to |
| 484 | the ``add_password`` method. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 485 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 486 | The top-level URL is the first URL that requires authentication. URLs |
| 487 | "deeper" than the URL you pass to .add_password() will also match. :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 488 | |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 489 | # create a password manager |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 490 | password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() |
| 491 | |
| 492 | # Add the username and password. |
| 493 | # If we knew the realm, we could use it instead of ``None``. |
| 494 | top_level_url = "http://example.com/foo/" |
| 495 | password_mgr.add_password(None, top_level_url, username, password) |
| 496 | |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 497 | handler = urllib2.HTTPBasicAuthHandler(password_mgr) |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 498 | |
| 499 | # create "opener" (OpenerDirector instance) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 500 | opener = urllib2.build_opener(handler) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 501 | |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 502 | # use the opener to fetch a URL |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 503 | opener.open(a_url) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 504 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 505 | # Install the opener. |
| 506 | # Now all calls to urllib2.urlopen use our opener. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 507 | urllib2.install_opener(opener) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 508 | |
| 509 | .. note:: |
| 510 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 511 | In the above example we only supplied our ``HHTPBasicAuthHandler`` |
| 512 | to ``build_opener``. By default openers have the handlers for |
| 513 | normal situations - ``ProxyHandler``, ``UnknownHandler``, |
| 514 | ``HTTPHandler``, ``HTTPDefaultErrorHandler``, |
| 515 | ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``, |
| 516 | ``HTTPErrorProcessor``. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 517 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 518 | top_level_url is in fact *either* a full URL (including the 'http:' |
| 519 | scheme component and the hostname and optionally the port number) |
| 520 | e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, |
| 521 | optionally including the port number) e.g. "example.com" or |
| 522 | "example.com:8080" (the latter example includes a port number). The |
| 523 | authority, if present, must NOT contain the "userinfo" component - for |
| 524 | example "joe@password:example.com" is not correct. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 525 | |
| 526 | |
| 527 | Proxies |
| 528 | ======= |
| 529 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 530 | **urllib2** will auto-detect your proxy settings and use those. This |
| 531 | is through the ``ProxyHandler`` which is part of the normal handler |
| 532 | chain. Normally that's a good thing, but there are occasions when it |
| 533 | may not be helpful [#]_. One way to do this is to setup our own |
| 534 | ``ProxyHandler``, with no proxies defined. This is done using similar |
| 535 | steps to setting up a `Basic Authentication`_ handler : :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 536 | |
| 537 | >>> proxy_support = urllib2.ProxyHandler({}) |
| 538 | >>> opener = urllib2.build_opener(proxy_support) |
| 539 | >>> urllib2.install_opener(opener) |
| 540 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 541 | .. note:: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 542 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 543 | Currently ``urllib2`` *does not* support fetching of ``https`` |
| 544 | locations through a proxy. This can be a problem. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 545 | |
| 546 | Sockets and Layers |
| 547 | ================== |
| 548 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 549 | The Python support for fetching resources from the web is |
| 550 | layered. urllib2 uses the httplib library, which in turn uses the |
| 551 | socket library. |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 552 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 553 | As of Python 2.3 you can specify how long a socket should wait for a |
| 554 | response before timing out. This can be useful in applications which |
| 555 | have to fetch web pages. By default the socket module has *no timeout* |
| 556 | and can hang. Currently, the socket timeout is not exposed at the |
| 557 | httplib or urllib2 levels. However, you can set the default timeout |
| 558 | globally for all sockets using : :: |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 559 | |
| 560 | import socket |
| 561 | import urllib2 |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 562 | |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 563 | # timeout in seconds |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 564 | timeout = 10 |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 565 | socket.setdefaulttimeout(timeout) |
| 566 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 567 | # this call to urllib2.urlopen now uses the default timeout |
| 568 | # we have set in the socket module |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 569 | req = urllib2.Request('http://www.voidspace.org.uk') |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 570 | response = urllib2.urlopen(req) |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 571 | |
| 572 | |
| 573 | ------- |
| 574 | |
| 575 | |
| 576 | Footnotes |
George Yoshida | 00f6e19 | 2006-05-21 04:40:32 +0000 | [diff] [blame] | 577 | ========= |
Andrew M. Kuchling | 4b5caae | 2006-04-30 21:19:31 +0000 | [diff] [blame] | 578 | |
Andrew M. Kuchling | fb10858 | 2006-05-07 17:12:12 +0000 | [diff] [blame] | 579 | This document was reviewed and revised by John Lee. |
| 580 | |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 581 | .. [#] For an introduction to the CGI protocol see |
| 582 | `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. |
| 583 | .. [#] Like Google for example. The *proper* way to use google from a program |
George Yoshida | 00f6e19 | 2006-05-21 04:40:32 +0000 | [diff] [blame] | 584 | is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See |
Georg Brandl | d419a93 | 2006-05-17 14:11:36 +0000 | [diff] [blame] | 585 | `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ |
| 586 | for some examples of using the Google API. |
| 587 | .. [#] Browser sniffing is a very bad practise for website design - building |
| 588 | sites using web standards is much more sensible. Unfortunately a lot of |
| 589 | sites still send different versions to different browsers. |
| 590 | .. [#] The user agent for MSIE 6 is |
| 591 | *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* |
| 592 | .. [#] For details of more HTTP request headers, see |
| 593 | `Quick Reference to HTTP Headers`_. |
| 594 | .. [#] In my case I have to use a proxy to access the internet at work. If you |
| 595 | attempt to fetch *localhost* URLs through this proxy it blocks them. IE |
| 596 | is set to use the proxy, which urllib2 picks up on. In order to test |
| 597 | scripts with a localhost server, I have to prevent urllib2 from using |
| 598 | the proxy. |