| Georg Brandl | 9e4ff75 | 2009-12-19 17:57:51 +0000 | [diff] [blame] | 1 | .. _urllib-howto: | 
 | 2 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 3 | *********************************************************** | 
 | 4 |   HOWTO Fetch Internet Resources Using The urllib Package | 
 | 5 | *********************************************************** | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 6 |  | 
 | 7 | :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ | 
 | 8 |  | 
 | 9 | .. note:: | 
 | 10 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 11 |     There is a French translation of an earlier revision of this | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 12 |     HOWTO, available at `urllib2 - Le Manuel manquant | 
| Christian Heimes | dd15f6c | 2008-03-16 00:07:10 +0000 | [diff] [blame] | 13 |     <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 14 |  | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 15 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 16 |  | 
 | 17 | Introduction | 
 | 18 | ============ | 
 | 19 |  | 
 | 20 | .. sidebar:: Related Articles | 
 | 21 |  | 
 | 22 |     You may also find useful the following article on fetching web resources | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 23 |     with Python: | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 24 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 25 |     * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 26 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 27 |         A tutorial on *Basic Authentication*, with examples in Python. | 
 | 28 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 29 | **urllib.request** is a `Python <http://www.python.org>`_ module for fetching URLs | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 30 | (Uniform Resource Locators). It offers a very simple interface, in the form of | 
 | 31 | the *urlopen* function. This is capable of fetching URLs using a variety of | 
 | 32 | different protocols. It also offers a slightly more complex interface for | 
 | 33 | handling common situations - like basic authentication, cookies, proxies and so | 
 | 34 | on. These are provided by objects called handlers and openers. | 
 | 35 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 36 | urllib.request supports fetching URLs for many "URL schemes" (identified by the string | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 37 | before the ":" in URL - for example "ftp" is the URL scheme of | 
 | 38 | "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). | 
 | 39 | This tutorial focuses on the most common case, HTTP. | 
 | 40 |  | 
 | 41 | For straightforward situations *urlopen* is very easy to use. But as soon as you | 
 | 42 | encounter errors or non-trivial cases when opening HTTP URLs, you will need some | 
 | 43 | understanding of the HyperText Transfer Protocol. The most comprehensive and | 
 | 44 | authoritative reference to HTTP is :rfc:`2616`. This is a technical document and | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 45 | not intended to be easy to read. This HOWTO aims to illustrate using *urllib*, | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 46 | with enough detail about HTTP to help you through. It is not intended to replace | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 47 | the :mod:`urllib.request` docs, but is supplementary to them. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 48 |  | 
 | 49 |  | 
 | 50 | Fetching URLs | 
 | 51 | ============= | 
 | 52 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 53 | The simplest way to use urllib.request is as follows:: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 54 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 55 |     import urllib.request | 
 | 56 |     response = urllib.request.urlopen('http://python.org/') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 57 |     html = response.read() | 
 | 58 |  | 
| Senthil Kumaran | e24f96a | 2012-03-13 19:29:33 -0700 | [diff] [blame] | 59 | If you wish to retrieve a resource via URL and store it in a temporary location, | 
| Serhiy Storchaka | bfdcd43 | 2013-10-13 23:09:14 +0300 | [diff] [blame] | 60 | you can do so via the :func:`~urllib.request.urlretrieve` function:: | 
| Senthil Kumaran | e24f96a | 2012-03-13 19:29:33 -0700 | [diff] [blame] | 61 |  | 
 | 62 |     import urllib.request | 
 | 63 |     local_filename, headers = urllib.request.urlretrieve('http://python.org/') | 
 | 64 |     html = open(local_filename) | 
 | 65 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 66 | Many uses of urllib will be that simple (note that instead of an 'http:' URL we | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 67 | could have used an URL starting with 'ftp:', 'file:', etc.).  However, it's the | 
 | 68 | purpose of this tutorial to explain the more complicated cases, concentrating on | 
 | 69 | HTTP. | 
 | 70 |  | 
 | 71 | HTTP is based on requests and responses - the client makes requests and servers | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 72 | send responses. urllib.request mirrors this with a ``Request`` object which represents | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 73 | the HTTP request you are making. In its simplest form you create a Request | 
 | 74 | object that specifies the URL you want to fetch. Calling ``urlopen`` with this | 
 | 75 | Request object returns a response object for the URL requested. This response is | 
 | 76 | a file-like object, which means you can for example call ``.read()`` on the | 
 | 77 | response:: | 
 | 78 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 79 |     import urllib.request | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 80 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 81 |     req = urllib.request.Request('http://www.voidspace.org.uk') | 
 | 82 |     response = urllib.request.urlopen(req) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 83 |     the_page = response.read() | 
 | 84 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 85 | Note that urllib.request makes use of the same Request interface to handle all URL | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 86 | schemes.  For example, you can make an FTP request like so:: | 
 | 87 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 88 |     req = urllib.request.Request('ftp://example.com/') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 89 |  | 
 | 90 | In the case of HTTP, there are two extra things that Request objects allow you | 
 | 91 | to do: First, you can pass data to be sent to the server.  Second, you can pass | 
 | 92 | extra information ("metadata") *about* the data or the about request itself, to | 
 | 93 | the server - this information is sent as HTTP "headers".  Let's look at each of | 
 | 94 | these in turn. | 
 | 95 |  | 
 | 96 | Data | 
 | 97 | ---- | 
 | 98 |  | 
 | 99 | Sometimes you want to send data to a URL (often the URL will refer to a CGI | 
 | 100 | (Common Gateway Interface) script [#]_ or other web application). With HTTP, | 
 | 101 | this is often done using what's known as a **POST** request. This is often what | 
 | 102 | your browser does when you submit a HTML form that you filled in on the web. Not | 
 | 103 | all POSTs have to come from forms: you can use a POST to transmit arbitrary data | 
 | 104 | to your own application. In the common case of HTML forms, the data needs to be | 
 | 105 | encoded in a standard way, and then passed to the Request object as the ``data`` | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 106 | argument. The encoding is done using a function from the :mod:`urllib.parse` | 
 | 107 | library. :: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 108 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 109 |     import urllib.parse | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 110 |     import urllib.request | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 111 |  | 
 | 112 |     url = 'http://www.someserver.com/cgi-bin/register.cgi' | 
 | 113 |     values = {'name' : 'Michael Foord', | 
 | 114 |               'location' : 'Northampton', | 
 | 115 |               'language' : 'Python' } | 
 | 116 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 117 |     data = urllib.parse.urlencode(values) | 
| Senthil Kumaran | 87684e6 | 2012-03-14 18:08:13 -0700 | [diff] [blame] | 118 |     data = data.encode('utf-8') # data should be bytes | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 119 |     req = urllib.request.Request(url, data) | 
 | 120 |     response = urllib.request.urlopen(req) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 121 |     the_page = response.read() | 
 | 122 |  | 
 | 123 | Note that other encodings are sometimes required (e.g. for file upload from HTML | 
 | 124 | forms - see `HTML Specification, Form Submission | 
 | 125 | <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more | 
 | 126 | details). | 
 | 127 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 128 | If you do not pass the ``data`` argument, urllib uses a **GET** request. One | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 129 | way in which GET and POST requests differ is that POST requests often have | 
 | 130 | "side-effects": they change the state of the system in some way (for example by | 
 | 131 | placing an order with the website for a hundredweight of tinned spam to be | 
 | 132 | delivered to your door).  Though the HTTP standard makes it clear that POSTs are | 
 | 133 | intended to *always* cause side-effects, and GET requests *never* to cause | 
 | 134 | side-effects, nothing prevents a GET request from having side-effects, nor a | 
 | 135 | POST requests from having no side-effects. Data can also be passed in an HTTP | 
 | 136 | GET request by encoding it in the URL itself. | 
 | 137 |  | 
 | 138 | This is done as follows:: | 
 | 139 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 140 |     >>> import urllib.request | 
 | 141 |     >>> import urllib.parse | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 142 |     >>> data = {} | 
 | 143 |     >>> data['name'] = 'Somebody Here' | 
 | 144 |     >>> data['location'] = 'Northampton' | 
 | 145 |     >>> data['language'] = 'Python' | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 146 |     >>> url_values = urllib.parse.urlencode(data) | 
| Senthil Kumaran | 570bc4c | 2012-10-09 00:38:17 -0700 | [diff] [blame] | 147 |     >>> print(url_values)  # The order may differ from below.  #doctest: +SKIP | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 148 |     name=Somebody+Here&language=Python&location=Northampton | 
 | 149 |     >>> url = 'http://www.example.com/example.cgi' | 
 | 150 |     >>> full_url = url + '?' + url_values | 
| Georg Brandl | 06ad13e | 2011-07-23 08:04:40 +0200 | [diff] [blame] | 151 |     >>> data = urllib.request.urlopen(full_url) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 152 |  | 
 | 153 | Notice that the full URL is created by adding a ``?`` to the URL, followed by | 
 | 154 | the encoded values. | 
 | 155 |  | 
 | 156 | Headers | 
 | 157 | ------- | 
 | 158 |  | 
 | 159 | We'll discuss here one particular HTTP header, to illustrate how to add headers | 
 | 160 | to your HTTP request. | 
 | 161 |  | 
 | 162 | Some websites [#]_ dislike being browsed by programs, or send different versions | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 163 | to different browsers [#]_ . By default urllib identifies itself as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 164 | ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version | 
 | 165 | numbers of the Python release, | 
 | 166 | e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain | 
 | 167 | not work. The way a browser identifies itself is through the | 
 | 168 | ``User-Agent`` header [#]_. When you create a Request object you can | 
 | 169 | pass a dictionary of headers in. The following example makes the same | 
 | 170 | request as above, but identifies itself as a version of Internet | 
 | 171 | Explorer [#]_. :: | 
 | 172 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 173 |     import urllib.parse | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 174 |     import urllib.request | 
 | 175 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 176 |     url = 'http://www.someserver.com/cgi-bin/register.cgi' | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 177 |     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 178 |     values = {'name' : 'Michael Foord', | 
 | 179 |               'location' : 'Northampton', | 
 | 180 |               'language' : 'Python' } | 
 | 181 |     headers = { 'User-Agent' : user_agent } | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 182 |  | 
| Senthil Kumaran | 87684e6 | 2012-03-14 18:08:13 -0700 | [diff] [blame] | 183 |     data  = urllib.parse.urlencode(values) | 
 | 184 |     data = data.encode('utf-8') | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 185 |     req = urllib.request.Request(url, data, headers) | 
 | 186 |     response = urllib.request.urlopen(req) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 187 |     the_page = response.read() | 
 | 188 |  | 
 | 189 | The response also has two useful methods. See the section on `info and geturl`_ | 
 | 190 | which comes after we have a look at what happens when things go wrong. | 
 | 191 |  | 
 | 192 |  | 
 | 193 | Handling Exceptions | 
 | 194 | =================== | 
 | 195 |  | 
| Georg Brandl | 22b3431 | 2009-07-26 14:54:51 +0000 | [diff] [blame] | 196 | *urlopen* raises :exc:`URLError` when it cannot handle a response (though as | 
 | 197 | usual with Python APIs, built-in exceptions such as :exc:`ValueError`, | 
 | 198 | :exc:`TypeError` etc. may also be raised). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 199 |  | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 200 | :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 201 | HTTP URLs. | 
 | 202 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 203 | The exception classes are exported from the :mod:`urllib.error` module. | 
 | 204 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 205 | URLError | 
 | 206 | -------- | 
 | 207 |  | 
 | 208 | Often, URLError is raised because there is no network connection (no route to | 
 | 209 | the specified server), or the specified server doesn't exist.  In this case, the | 
 | 210 | exception raised will have a 'reason' attribute, which is a tuple containing an | 
 | 211 | error code and a text error message. | 
 | 212 |  | 
 | 213 | e.g. :: | 
 | 214 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 215 |     >>> req = urllib.request.Request('http://www.pretend_server.org') | 
 | 216 |     >>> try: urllib.request.urlopen(req) | 
| Senthil Kumaran | 570bc4c | 2012-10-09 00:38:17 -0700 | [diff] [blame] | 217 |     ... except urllib.error.URLError as e: | 
 | 218 |     ...    print(e.reason)      #doctest: +SKIP | 
 | 219 |     ... | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 220 |     (4, 'getaddrinfo failed') | 
 | 221 |  | 
 | 222 |  | 
 | 223 | HTTPError | 
 | 224 | --------- | 
 | 225 |  | 
 | 226 | Every HTTP response from the server contains a numeric "status code". Sometimes | 
 | 227 | the status code indicates that the server is unable to fulfil the request. The | 
 | 228 | default handlers will handle some of these responses for you (for example, if | 
 | 229 | the response is a "redirection" that requests the client fetch the document from | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 230 | a different URL, urllib will handle that for you). For those it can't handle, | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 231 | urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 232 | found), '403' (request forbidden), and '401' (authentication required). | 
 | 233 |  | 
 | 234 | See section 10 of RFC 2616 for a reference on all the HTTP error codes. | 
 | 235 |  | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 236 | The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 237 | corresponds to the error sent by the server. | 
 | 238 |  | 
 | 239 | Error Codes | 
 | 240 | ~~~~~~~~~~~ | 
 | 241 |  | 
 | 242 | Because the default handlers handle redirects (codes in the 300 range), and | 
 | 243 | codes in the 100-299 range indicate success, you will usually only see error | 
 | 244 | codes in the 400-599 range. | 
 | 245 |  | 
| Georg Brandl | 2442015 | 2008-05-26 16:32:26 +0000 | [diff] [blame] | 246 | :attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 247 | response codes in that shows all the response codes used by RFC 2616. The | 
 | 248 | dictionary is reproduced here for convenience :: | 
 | 249 |  | 
 | 250 |     # Table mapping response codes to messages; entries have the | 
 | 251 |     # form {code: (shortmessage, longmessage)}. | 
 | 252 |     responses = { | 
 | 253 |         100: ('Continue', 'Request received, please continue'), | 
 | 254 |         101: ('Switching Protocols', | 
 | 255 |               'Switching to new protocol; obey Upgrade header'), | 
 | 256 |  | 
 | 257 |         200: ('OK', 'Request fulfilled, document follows'), | 
 | 258 |         201: ('Created', 'Document created, URL follows'), | 
 | 259 |         202: ('Accepted', | 
 | 260 |               'Request accepted, processing continues off-line'), | 
 | 261 |         203: ('Non-Authoritative Information', 'Request fulfilled from cache'), | 
 | 262 |         204: ('No Content', 'Request fulfilled, nothing follows'), | 
 | 263 |         205: ('Reset Content', 'Clear input form for further input.'), | 
 | 264 |         206: ('Partial Content', 'Partial content follows.'), | 
 | 265 |  | 
 | 266 |         300: ('Multiple Choices', | 
 | 267 |               'Object has several resources -- see URI list'), | 
 | 268 |         301: ('Moved Permanently', 'Object moved permanently -- see URI list'), | 
 | 269 |         302: ('Found', 'Object moved temporarily -- see URI list'), | 
 | 270 |         303: ('See Other', 'Object moved -- see Method and URL list'), | 
 | 271 |         304: ('Not Modified', | 
 | 272 |               'Document has not changed since given time'), | 
 | 273 |         305: ('Use Proxy', | 
 | 274 |               'You must use proxy specified in Location to access this ' | 
 | 275 |               'resource.'), | 
 | 276 |         307: ('Temporary Redirect', | 
 | 277 |               'Object moved temporarily -- see URI list'), | 
 | 278 |  | 
 | 279 |         400: ('Bad Request', | 
 | 280 |               'Bad request syntax or unsupported method'), | 
 | 281 |         401: ('Unauthorized', | 
 | 282 |               'No permission -- see authorization schemes'), | 
 | 283 |         402: ('Payment Required', | 
 | 284 |               'No payment -- see charging schemes'), | 
 | 285 |         403: ('Forbidden', | 
 | 286 |               'Request forbidden -- authorization will not help'), | 
 | 287 |         404: ('Not Found', 'Nothing matches the given URI'), | 
 | 288 |         405: ('Method Not Allowed', | 
 | 289 |               'Specified method is invalid for this server.'), | 
 | 290 |         406: ('Not Acceptable', 'URI not available in preferred format.'), | 
 | 291 |         407: ('Proxy Authentication Required', 'You must authenticate with ' | 
 | 292 |               'this proxy before proceeding.'), | 
 | 293 |         408: ('Request Timeout', 'Request timed out; try again later.'), | 
 | 294 |         409: ('Conflict', 'Request conflict.'), | 
 | 295 |         410: ('Gone', | 
 | 296 |               'URI no longer exists and has been permanently removed.'), | 
 | 297 |         411: ('Length Required', 'Client must specify Content-Length.'), | 
 | 298 |         412: ('Precondition Failed', 'Precondition in headers is false.'), | 
 | 299 |         413: ('Request Entity Too Large', 'Entity is too large.'), | 
 | 300 |         414: ('Request-URI Too Long', 'URI is too long.'), | 
 | 301 |         415: ('Unsupported Media Type', 'Entity body in unsupported format.'), | 
 | 302 |         416: ('Requested Range Not Satisfiable', | 
 | 303 |               'Cannot satisfy request range.'), | 
 | 304 |         417: ('Expectation Failed', | 
 | 305 |               'Expect condition could not be satisfied.'), | 
 | 306 |  | 
 | 307 |         500: ('Internal Server Error', 'Server got itself in trouble'), | 
 | 308 |         501: ('Not Implemented', | 
 | 309 |               'Server does not support this operation'), | 
 | 310 |         502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), | 
 | 311 |         503: ('Service Unavailable', | 
 | 312 |               'The server cannot process the request due to a high load'), | 
 | 313 |         504: ('Gateway Timeout', | 
 | 314 |               'The gateway server did not receive a timely response'), | 
 | 315 |         505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), | 
 | 316 |         } | 
 | 317 |  | 
 | 318 | When an error is raised the server responds by returning an HTTP error code | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 319 | *and* an error page. You can use the :exc:`HTTPError` instance as a response on the | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 320 | page returned. This means that as well as the code attribute, it also has read, | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 321 | geturl, and info, methods as returned by the ``urllib.response`` module:: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 322 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 323 |     >>> req = urllib.request.Request('http://www.python.org/fish.html') | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 324 |     >>> try: | 
| Senthil Kumaran | 570bc4c | 2012-10-09 00:38:17 -0700 | [diff] [blame] | 325 |     ...     urllib.request.urlopen(req) | 
 | 326 |     ... except urllib.error.HTTPError as e: | 
 | 327 |     ...     print(e.code) | 
 | 328 |     ...     print(e.read())  #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE | 
 | 329 |     ... | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 330 |     404 | 
| Senthil Kumaran | 570bc4c | 2012-10-09 00:38:17 -0700 | [diff] [blame] | 331 |     b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" | 
 | 332 |       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html | 
 | 333 |       ... | 
 | 334 |       <title>Page Not Found</title>\n | 
 | 335 |       ... | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 336 |  | 
 | 337 | Wrapping it Up | 
 | 338 | -------------- | 
 | 339 |  | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 340 | So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 341 | basic approaches. I prefer the second approach. | 
 | 342 |  | 
 | 343 | Number 1 | 
 | 344 | ~~~~~~~~ | 
 | 345 |  | 
 | 346 | :: | 
 | 347 |  | 
 | 348 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 349 |     from urllib.request import Request, urlopen | 
 | 350 |     from urllib.error import URLError, HTTPError | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 351 |     req = Request(someurl) | 
 | 352 |     try: | 
 | 353 |         response = urlopen(req) | 
| Michael Foord | 20b50b1 | 2009-05-12 11:19:14 +0000 | [diff] [blame] | 354 |     except HTTPError as e: | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 355 |         print('The server couldn\'t fulfill the request.') | 
 | 356 |         print('Error code: ', e.code) | 
| Michael Foord | 20b50b1 | 2009-05-12 11:19:14 +0000 | [diff] [blame] | 357 |     except URLError as e: | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 358 |         print('We failed to reach a server.') | 
 | 359 |         print('Reason: ', e.reason) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 360 |     else: | 
 | 361 |         # everything is fine | 
 | 362 |  | 
 | 363 |  | 
 | 364 | .. note:: | 
 | 365 |  | 
 | 366 |     The ``except HTTPError`` *must* come first, otherwise ``except URLError`` | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 367 |     will *also* catch an :exc:`HTTPError`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 368 |  | 
 | 369 | Number 2 | 
 | 370 | ~~~~~~~~ | 
 | 371 |  | 
 | 372 | :: | 
 | 373 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 374 |     from urllib.request import Request, urlopen | 
 | 375 |     from urllib.error import  URLError | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 376 |     req = Request(someurl) | 
 | 377 |     try: | 
 | 378 |         response = urlopen(req) | 
| Michael Foord | 20b50b1 | 2009-05-12 11:19:14 +0000 | [diff] [blame] | 379 |     except URLError as e: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 380 |         if hasattr(e, 'reason'): | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 381 |             print('We failed to reach a server.') | 
 | 382 |             print('Reason: ', e.reason) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 383 |         elif hasattr(e, 'code'): | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 384 |             print('The server couldn\'t fulfill the request.') | 
 | 385 |             print('Error code: ', e.code) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 386 |     else: | 
 | 387 |         # everything is fine | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 388 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 389 |  | 
 | 390 | info and geturl | 
 | 391 | =============== | 
 | 392 |  | 
| Benjamin Peterson | e5384b0 | 2008-10-04 22:00:42 +0000 | [diff] [blame] | 393 | The response returned by urlopen (or the :exc:`HTTPError` instance) has two | 
 | 394 | useful methods :meth:`info` and :meth:`geturl` and is defined in the module | 
 | 395 | :mod:`urllib.response`.. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 396 |  | 
 | 397 | **geturl** - this returns the real URL of the page fetched. This is useful | 
 | 398 | because ``urlopen`` (or the opener object used) may have followed a | 
 | 399 | redirect. The URL of the page fetched may not be the same as the URL requested. | 
 | 400 |  | 
 | 401 | **info** - this returns a dictionary-like object that describes the page | 
 | 402 | fetched, particularly the headers sent by the server. It is currently an | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 403 | :class:`http.client.HTTPMessage` instance. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 404 |  | 
 | 405 | Typical headers include 'Content-length', 'Content-type', and so on. See the | 
 | 406 | `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ | 
 | 407 | for a useful listing of HTTP headers with brief explanations of their meaning | 
 | 408 | and use. | 
 | 409 |  | 
 | 410 |  | 
 | 411 | Openers and Handlers | 
 | 412 | ==================== | 
 | 413 |  | 
 | 414 | When you fetch a URL you use an opener (an instance of the perhaps | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 415 | confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 416 | the default opener - via ``urlopen`` - but you can create custom | 
 | 417 | openers. Openers use handlers. All the "heavy lifting" is done by the | 
 | 418 | handlers. Each handler knows how to open URLs for a particular URL scheme (http, | 
 | 419 | ftp, etc.), or how to handle an aspect of URL opening, for example HTTP | 
 | 420 | redirections or HTTP cookies. | 
 | 421 |  | 
 | 422 | You will want to create openers if you want to fetch URLs with specific handlers | 
 | 423 | installed, for example to get an opener that handles cookies, or to get an | 
 | 424 | opener that does not handle redirections. | 
 | 425 |  | 
 | 426 | To create an opener, instantiate an ``OpenerDirector``, and then call | 
 | 427 | ``.add_handler(some_handler_instance)`` repeatedly. | 
 | 428 |  | 
 | 429 | Alternatively, you can use ``build_opener``, which is a convenience function for | 
 | 430 | creating opener objects with a single function call.  ``build_opener`` adds | 
 | 431 | several handlers by default, but provides a quick way to add more and/or | 
 | 432 | override the default handlers. | 
 | 433 |  | 
 | 434 | Other sorts of handlers you might want to can handle proxies, authentication, | 
 | 435 | and other common but slightly specialised situations. | 
 | 436 |  | 
 | 437 | ``install_opener`` can be used to make an ``opener`` object the (global) default | 
 | 438 | opener. This means that calls to ``urlopen`` will use the opener you have | 
 | 439 | installed. | 
 | 440 |  | 
 | 441 | Opener objects have an ``open`` method, which can be called directly to fetch | 
 | 442 | urls in the same way as the ``urlopen`` function: there's no need to call | 
 | 443 | ``install_opener``, except as a convenience. | 
 | 444 |  | 
 | 445 |  | 
 | 446 | Basic Authentication | 
 | 447 | ==================== | 
 | 448 |  | 
 | 449 | To illustrate creating and installing a handler we will use the | 
 | 450 | ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- | 
 | 451 | including an explanation of how Basic Authentication works - see the `Basic | 
 | 452 | Authentication Tutorial | 
 | 453 | <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. | 
 | 454 |  | 
 | 455 | When authentication is required, the server sends a header (as well as the 401 | 
 | 456 | error code) requesting authentication.  This specifies the authentication scheme | 
| Sandro Tosi | 08ccbf4 | 2012-04-24 17:36:41 +0200 | [diff] [blame] | 457 | and a 'realm'. The header looks like : ``WWW-Authenticate: SCHEME | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 458 | realm="REALM"``. | 
 | 459 |  | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 460 | e.g. :: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 461 |  | 
| Sandro Tosi | 08ccbf4 | 2012-04-24 17:36:41 +0200 | [diff] [blame] | 462 |     WWW-Authenticate: Basic realm="cPanel Users" | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 463 |  | 
 | 464 |  | 
 | 465 | The client should then retry the request with the appropriate name and password | 
 | 466 | for the realm included as a header in the request. This is 'basic | 
 | 467 | authentication'. In order to simplify this process we can create an instance of | 
 | 468 | ``HTTPBasicAuthHandler`` and an opener to use this handler. | 
 | 469 |  | 
 | 470 | The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle | 
 | 471 | the mapping of URLs and realms to passwords and usernames. If you know what the | 
 | 472 | realm is (from the authentication header sent by the server), then you can use a | 
 | 473 | ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that | 
 | 474 | case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows | 
 | 475 | you to specify a default username and password for a URL. This will be supplied | 
 | 476 | in the absence of you providing an alternative combination for a specific | 
 | 477 | realm. We indicate this by providing ``None`` as the realm argument to the | 
 | 478 | ``add_password`` method. | 
 | 479 |  | 
 | 480 | The top-level URL is the first URL that requires authentication. URLs "deeper" | 
 | 481 | than the URL you pass to .add_password() will also match. :: | 
 | 482 |  | 
 | 483 |     # create a password manager | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 484 |     password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 485 |  | 
 | 486 |     # Add the username and password. | 
| Georg Brandl | 1f01deb | 2009-01-03 22:47:39 +0000 | [diff] [blame] | 487 |     # If we knew the realm, we could use it instead of None. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 488 |     top_level_url = "http://example.com/foo/" | 
 | 489 |     password_mgr.add_password(None, top_level_url, username, password) | 
 | 490 |  | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 491 |     handler = urllib.request.HTTPBasicAuthHandler(password_mgr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 492 |  | 
 | 493 |     # create "opener" (OpenerDirector instance) | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 494 |     opener = urllib.request.build_opener(handler) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 495 |  | 
 | 496 |     # use the opener to fetch a URL | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 497 |     opener.open(a_url) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 498 |  | 
 | 499 |     # Install the opener. | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 500 |     # Now all calls to urllib.request.urlopen use our opener. | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 501 |     urllib.request.install_opener(opener) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 502 |  | 
 | 503 | .. note:: | 
 | 504 |  | 
| Ezio Melotti | 8e87fec | 2009-07-21 20:37:52 +0000 | [diff] [blame] | 505 |     In the above example we only supplied our ``HTTPBasicAuthHandler`` to | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 506 |     ``build_opener``. By default openers have the handlers for normal situations | 
| R David Murray | 5aea37a | 2013-04-28 11:07:16 -0400 | [diff] [blame] | 507 |     -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy` | 
 | 508 |     environment variable is set), ``UnknownHandler``, ``HTTPHandler``, | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 509 |     ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, | 
| R David Murray | 5aea37a | 2013-04-28 11:07:16 -0400 | [diff] [blame] | 510 |     ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 511 |  | 
 | 512 | ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme | 
 | 513 | component and the hostname and optionally the port number) | 
 | 514 | e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, | 
 | 515 | optionally including the port number) e.g. "example.com" or "example.com:8080" | 
 | 516 | (the latter example includes a port number).  The authority, if present, must | 
 | 517 | NOT contain the "userinfo" component - for example "joe@password:example.com" is | 
 | 518 | not correct. | 
 | 519 |  | 
 | 520 |  | 
 | 521 | Proxies | 
 | 522 | ======= | 
 | 523 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 524 | **urllib** will auto-detect your proxy settings and use those. This is through | 
| R David Murray | 5aea37a | 2013-04-28 11:07:16 -0400 | [diff] [blame] | 525 | the ``ProxyHandler``, which is part of the normal handler chain when a proxy | 
| R David Murray | 9330a94 | 2013-04-28 11:24:35 -0400 | [diff] [blame] | 526 | setting is detected.  Normally that's a good thing, but there are occasions | 
 | 527 | when it may not be helpful [#]_. One way to do this is to setup our own | 
 | 528 | ``ProxyHandler``, with no proxies defined. This is done using similar steps to | 
 | 529 | setting up a `Basic Authentication`_ handler : :: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 530 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 531 |     >>> proxy_support = urllib.request.ProxyHandler({}) | 
 | 532 |     >>> opener = urllib.request.build_opener(proxy_support) | 
 | 533 |     >>> urllib.request.install_opener(opener) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 534 |  | 
 | 535 | .. note:: | 
 | 536 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 537 |     Currently ``urllib.request`` *does not* support fetching of ``https`` locations | 
 | 538 |     through a proxy.  However, this can be enabled by extending urllib.request as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 539 |     shown in the recipe [#]_. | 
 | 540 |  | 
 | 541 |  | 
 | 542 | Sockets and Layers | 
 | 543 | ================== | 
 | 544 |  | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 545 | The Python support for fetching resources from the web is layered.  urllib uses | 
 | 546 | the :mod:`http.client` library, which in turn uses the socket library. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 547 |  | 
 | 548 | As of Python 2.3 you can specify how long a socket should wait for a response | 
 | 549 | before timing out. This can be useful in applications which have to fetch web | 
 | 550 | pages. By default the socket module has *no timeout* and can hang. Currently, | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 551 | the socket timeout is not exposed at the http.client or urllib.request levels. | 
| Georg Brandl | 2442015 | 2008-05-26 16:32:26 +0000 | [diff] [blame] | 552 | However, you can set the default timeout globally for all sockets using :: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 553 |  | 
 | 554 |     import socket | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 555 |     import urllib.request | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 556 |  | 
 | 557 |     # timeout in seconds | 
 | 558 |     timeout = 10 | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 559 |     socket.setdefaulttimeout(timeout) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 560 |  | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 561 |     # this call to urllib.request.urlopen now uses the default timeout | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 562 |     # we have set in the socket module | 
| Senthil Kumaran | aca8fd7 | 2008-06-23 04:41:59 +0000 | [diff] [blame] | 563 |     req = urllib.request.Request('http://www.voidspace.org.uk') | 
 | 564 |     response = urllib.request.urlopen(req) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 565 |  | 
 | 566 |  | 
 | 567 | ------- | 
 | 568 |  | 
 | 569 |  | 
 | 570 | Footnotes | 
 | 571 | ========= | 
 | 572 |  | 
 | 573 | This document was reviewed and revised by John Lee. | 
 | 574 |  | 
 | 575 | .. [#] For an introduction to the CGI protocol see | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 576 |        `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 577 | .. [#] Like Google for example. The *proper* way to use google from a program | 
 | 578 |        is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See | 
 | 579 |        `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ | 
 | 580 |        for some examples of using the Google API. | 
 | 581 | .. [#] Browser sniffing is a very bad practise for website design - building | 
 | 582 |        sites using web standards is much more sensible. Unfortunately a lot of | 
 | 583 |        sites still send different versions to different browsers. | 
 | 584 | .. [#] The user agent for MSIE 6 is | 
 | 585 |        *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* | 
 | 586 | .. [#] For details of more HTTP request headers, see | 
 | 587 |        `Quick Reference to HTTP Headers`_. | 
 | 588 | .. [#] In my case I have to use a proxy to access the internet at work. If you | 
 | 589 |        attempt to fetch *localhost* URLs through this proxy it blocks them. IE | 
| Georg Brandl | 0f7ede4 | 2008-06-23 11:23:31 +0000 | [diff] [blame] | 590 |        is set to use the proxy, which urllib picks up on. In order to test | 
 | 591 |        scripts with a localhost server, I have to prevent urllib from using | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 592 |        the proxy. | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 593 | .. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 594 |        <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_. | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 595 |  |