blob: 858c9b1074acd81d1a5222f65ff33011ede2c174 [file] [log] [blame]
Thomas Wouters477c8d52006-05-27 19:21:47 +00001==============================================
2 HOWTO Fetch Internet Resources Using urllib2
3==============================================
4----------------------------
5 Fetching URLs With Python
6----------------------------
7
8
9.. note::
10
11 There is an French translation of an earlier revision of this
12 HOWTO, available at `urllib2 - Le Manuel manquant
13 <http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
14
15.. contents:: urllib2 Tutorial
16
17
18Introduction
19============
20
21.. sidebar:: Related Articles
22
23 You may also find useful the following article on fetching web
24 resources with Python :
25
26 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
27
28 A tutorial on *Basic Authentication*, with examples in Python.
29
30 This HOWTO is written by `Michael Foord
31 <http://www.voidspace.org.uk/python/index.shtml>`_.
32
33**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
34(Uniform Resource Locators). It offers a very simple interface, in the form of
35the *urlopen* function. This is capable of fetching URLs using a variety
36of different protocols. It also offers a slightly more complex
37interface for handling common situations - like basic authentication,
38cookies, proxies and so on. These are provided by objects called
39handlers and openers.
40
41urllib2 supports fetching URLs for many "URL schemes" (identified by the string
42before the ":" in URL - for example "ftp" is the URL scheme of
43"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
44This tutorial focuses on the most common case, HTTP.
45
46For straightforward situations *urlopen* is very easy to use. But as
47soon as you encounter errors or non-trivial cases when opening HTTP
48URLs, you will need some understanding of the HyperText Transfer
49Protocol. The most comprehensive and authoritative reference to HTTP
50is :RFC:`2616`. This is a technical document and not intended to be
51easy to read. This HOWTO aims to illustrate using *urllib2*, with
52enough detail about HTTP to help you through. It is not intended to
53replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ ,
54but is supplementary to them.
55
56
57Fetching URLs
58=============
59
60The simplest way to use urllib2 is as follows : ::
61
62 import urllib2
63 response = urllib2.urlopen('http://python.org/')
64 html = response.read()
65
66Many uses of urllib2 will be that simple (note that instead of an
67'http:' URL we could have used an URL starting with 'ftp:', 'file:',
68etc.). However, it's the purpose of this tutorial to explain the more
69complicated cases, concentrating on HTTP.
70
71HTTP is based on requests and responses - the client makes requests
72and servers send responses. urllib2 mirrors this with a ``Request``
73object which represents the HTTP request you are making. In its
74simplest form you create a Request object that specifies the URL you
75want to fetch. Calling ``urlopen`` with this Request object returns a
76response object for the URL requested. This response is a file-like
77object, which means you can for example call .read() on the response :
78::
79
80 import urllib2
81
82 req = urllib2.Request('http://www.voidspace.org.uk')
83 response = urllib2.urlopen(req)
84 the_page = response.read()
85
86Note that urllib2 makes use of the same Request interface to handle
87all URL schemes. For example, you can make an FTP request like so: ::
88
89 req = urllib2.Request('ftp://example.com/')
90
91In the case of HTTP, there are two extra things that Request objects
92allow you to do: First, you can pass data to be sent to the server.
93Second, you can pass extra information ("metadata") *about* the data
94or the about request itself, to the server - this information is sent
95as HTTP "headers". Let's look at each of these in turn.
96
97Data
98----
99
100Sometimes you want to send data to a URL (often the URL will refer to
101a CGI (Common Gateway Interface) script [#]_ or other web
102application). With HTTP, this is often done using what's known as a
103**POST** request. This is often what your browser does when you submit
104a HTML form that you filled in on the web. Not all POSTs have to come
105from forms: you can use a POST to transmit arbitrary data to your own
106application. In the common case of HTML forms, the data needs to be
107encoded in a standard way, and then passed to the Request object as
108the ``data`` argument. The encoding is done using a function from the
109``urllib`` library *not* from ``urllib2``. ::
110
111 import urllib
112 import urllib2
113
114 url = 'http://www.someserver.com/cgi-bin/register.cgi'
115 values = {'name' : 'Michael Foord',
116 'location' : 'Northampton',
117 'language' : 'Python' }
118
119 data = urllib.urlencode(values)
120 req = urllib2.Request(url, data)
121 response = urllib2.urlopen(req)
122 the_page = response.read()
123
124Note that other encodings are sometimes required (e.g. for file upload
125from HTML forms - see
126`HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_
127for more details).
128
129If you do not pass the ``data`` argument, urllib2 uses a **GET**
130request. One way in which GET and POST requests differ is that POST
131requests often have "side-effects": they change the state of the
132system in some way (for example by placing an order with the website
133for a hundredweight of tinned spam to be delivered to your door).
134Though the HTTP standard makes it clear that POSTs are intended to
135*always* cause side-effects, and GET requests *never* to cause
136side-effects, nothing prevents a GET request from having side-effects,
137nor a POST requests from having no side-effects. Data can also be
138passed in an HTTP GET request by encoding it in the URL itself.
139
140This is done as follows::
141
142 >>> import urllib2
143 >>> import urllib
144 >>> data = {}
145 >>> data['name'] = 'Somebody Here'
146 >>> data['location'] = 'Northampton'
147 >>> data['language'] = 'Python'
148 >>> url_values = urllib.urlencode(data)
149 >>> print url_values
150 name=Somebody+Here&language=Python&location=Northampton
151 >>> url = 'http://www.example.com/example.cgi'
152 >>> full_url = url + '?' + url_values
153 >>> data = urllib2.open(full_url)
154
155Notice that the full URL is created by adding a ``?`` to the URL, followed by
156the encoded values.
157
158Headers
159-------
160
161We'll discuss here one particular HTTP header, to illustrate how to
162add headers to your HTTP request.
163
164Some websites [#]_ dislike being browsed by programs, or send
165different versions to different browsers [#]_ . By default urllib2
166identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are
167the major and minor version numbers of the Python release,
168e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
169not work. The way a browser identifies itself is through the
170``User-Agent`` header [#]_. When you create a Request object you can
171pass a dictionary of headers in. The following example makes the same
172request as above, but identifies itself as a version of Internet
173Explorer [#]_. ::
174
175 import urllib
176 import urllib2
177
178 url = 'http://www.someserver.com/cgi-bin/register.cgi'
179 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
180 values = {'name' : 'Michael Foord',
181 'location' : 'Northampton',
182 'language' : 'Python' }
183 headers = { 'User-Agent' : user_agent }
184
185 data = urllib.urlencode(values)
186 req = urllib2.Request(url, data, headers)
187 response = urllib2.urlopen(req)
188 the_page = response.read()
189
190The response also has two useful methods. See the section on `info and
191geturl`_ which comes after we have a look at what happens when things
192go wrong.
193
194
195Handling Exceptions
196===================
197
198*urlopen* raises ``URLError`` when it cannot handle a response (though
199as usual with Python APIs, builtin exceptions such as ValueError,
200TypeError etc. may also be raised).
201
202``HTTPError`` is the subclass of ``URLError`` raised in the specific
203case of HTTP URLs.
204
205URLError
206--------
207
208Often, URLError is raised because there is no network connection (no
209route to the specified server), or the specified server doesn't exist.
210In this case, the exception raised will have a 'reason' attribute,
211which is a tuple containing an error code and a text error message.
212
213e.g. ::
214
215 >>> req = urllib2.Request('http://www.pretend_server.org')
216 >>> try: urllib2.urlopen(req)
Guido van Rossumb940e112007-01-10 16:19:56 +0000217 >>> except URLError as e:
Thomas Wouters477c8d52006-05-27 19:21:47 +0000218 >>> print e.reason
219 >>>
220 (4, 'getaddrinfo failed')
221
222
223HTTPError
224---------
225
226Every HTTP response from the server contains a numeric "status
227code". Sometimes the status code indicates that the server is unable
228to fulfil the request. The default handlers will handle some of these
229responses for you (for example, if the response is a "redirection"
230that requests the client fetch the document from a different URL,
231urllib2 will handle that for you). For those it can't handle, urlopen
232will raise an ``HTTPError``. Typical errors include '404' (page not
233found), '403' (request forbidden), and '401' (authentication
234required).
235
236See section 10 of RFC 2616 for a reference on all the HTTP error
237codes.
238
239The ``HTTPError`` instance raised will have an integer 'code'
240attribute, which corresponds to the error sent by the server.
241
242Error Codes
243~~~~~~~~~~~
244
245Because the default handlers handle redirects (codes in the 300
246range), and codes in the 100-299 range indicate success, you will
247usually only see error codes in the 400-599 range.
248
249``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful
250dictionary of response codes in that shows all the response codes used
251by RFC 2616. The dictionary is reproduced here for convenience ::
252
253 # Table mapping response codes to messages; entries have the
254 # form {code: (shortmessage, longmessage)}.
255 responses = {
256 100: ('Continue', 'Request received, please continue'),
257 101: ('Switching Protocols',
258 'Switching to new protocol; obey Upgrade header'),
259
260 200: ('OK', 'Request fulfilled, document follows'),
261 201: ('Created', 'Document created, URL follows'),
262 202: ('Accepted',
263 'Request accepted, processing continues off-line'),
264 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
265 204: ('No Content', 'Request fulfilled, nothing follows'),
266 205: ('Reset Content', 'Clear input form for further input.'),
267 206: ('Partial Content', 'Partial content follows.'),
268
269 300: ('Multiple Choices',
270 'Object has several resources -- see URI list'),
271 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
272 302: ('Found', 'Object moved temporarily -- see URI list'),
273 303: ('See Other', 'Object moved -- see Method and URL list'),
274 304: ('Not Modified',
275 'Document has not changed since given time'),
276 305: ('Use Proxy',
277 'You must use proxy specified in Location to access this '
278 'resource.'),
279 307: ('Temporary Redirect',
280 'Object moved temporarily -- see URI list'),
281
282 400: ('Bad Request',
283 'Bad request syntax or unsupported method'),
284 401: ('Unauthorized',
285 'No permission -- see authorization schemes'),
286 402: ('Payment Required',
287 'No payment -- see charging schemes'),
288 403: ('Forbidden',
289 'Request forbidden -- authorization will not help'),
290 404: ('Not Found', 'Nothing matches the given URI'),
291 405: ('Method Not Allowed',
292 'Specified method is invalid for this server.'),
293 406: ('Not Acceptable', 'URI not available in preferred format.'),
294 407: ('Proxy Authentication Required', 'You must authenticate with '
295 'this proxy before proceeding.'),
296 408: ('Request Timeout', 'Request timed out; try again later.'),
297 409: ('Conflict', 'Request conflict.'),
298 410: ('Gone',
299 'URI no longer exists and has been permanently removed.'),
300 411: ('Length Required', 'Client must specify Content-Length.'),
301 412: ('Precondition Failed', 'Precondition in headers is false.'),
302 413: ('Request Entity Too Large', 'Entity is too large.'),
303 414: ('Request-URI Too Long', 'URI is too long.'),
304 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
305 416: ('Requested Range Not Satisfiable',
306 'Cannot satisfy request range.'),
307 417: ('Expectation Failed',
308 'Expect condition could not be satisfied.'),
309
310 500: ('Internal Server Error', 'Server got itself in trouble'),
311 501: ('Not Implemented',
312 'Server does not support this operation'),
313 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
314 503: ('Service Unavailable',
315 'The server cannot process the request due to a high load'),
316 504: ('Gateway Timeout',
317 'The gateway server did not receive a timely response'),
318 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
319 }
320
321When an error is raised the server responds by returning an HTTP error
322code *and* an error page. You can use the ``HTTPError`` instance as a
323response on the page returned. This means that as well as the code
324attribute, it also has read, geturl, and info, methods. ::
325
326 >>> req = urllib2.Request('http://www.python.org/fish.html')
327 >>> try:
328 >>> urllib2.urlopen(req)
Guido van Rossumb940e112007-01-10 16:19:56 +0000329 >>> except URLError as e:
Thomas Wouters477c8d52006-05-27 19:21:47 +0000330 >>> print e.code
331 >>> print e.read()
332 >>>
333 404
334 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
335 "http://www.w3.org/TR/html4/loose.dtd">
336 <?xml-stylesheet href="./css/ht2html.css"
337 type="text/css"?>
338 <html><head><title>Error 404: File Not Found</title>
339 ...... etc...
340
341Wrapping it Up
342--------------
343
344So if you want to be prepared for ``HTTPError`` *or* ``URLError``
345there are two basic approaches. I prefer the second approach.
346
347Number 1
348~~~~~~~~
349
350::
351
352
353 from urllib2 import Request, urlopen, URLError, HTTPError
354 req = Request(someurl)
355 try:
356 response = urlopen(req)
Guido van Rossumb940e112007-01-10 16:19:56 +0000357 except HTTPError as e:
Thomas Wouters477c8d52006-05-27 19:21:47 +0000358 print 'The server couldn\'t fulfill the request.'
359 print 'Error code: ', e.code
Guido van Rossumb940e112007-01-10 16:19:56 +0000360 except URLError as e:
Thomas Wouters477c8d52006-05-27 19:21:47 +0000361 print 'We failed to reach a server.'
362 print 'Reason: ', e.reason
363 else:
364 # everything is fine
365
366
367.. note::
368
369 The ``except HTTPError`` *must* come first, otherwise ``except URLError``
370 will *also* catch an ``HTTPError``.
371
372Number 2
373~~~~~~~~
374
375::
376
377 from urllib2 import Request, urlopen, URLError
378 req = Request(someurl)
379 try:
380 response = urlopen(req)
Guido van Rossumb940e112007-01-10 16:19:56 +0000381 except URLError as e:
Thomas Wouters477c8d52006-05-27 19:21:47 +0000382 if hasattr(e, 'reason'):
383 print 'We failed to reach a server.'
384 print 'Reason: ', e.reason
385 elif hasattr(e, 'code'):
386 print 'The server couldn\'t fulfill the request.'
387 print 'Error code: ', e.code
388 else:
389 # everything is fine
390
391
392info and geturl
393===============
394
395The response returned by urlopen (or the ``HTTPError`` instance) has
396two useful methods ``info`` and ``geturl``.
397
398**geturl** - this returns the real URL of the page fetched. This is
399useful because ``urlopen`` (or the opener object used) may have
400followed a redirect. The URL of the page fetched may not be the same
401as the URL requested.
402
403**info** - this returns a dictionary-like object that describes the
404page fetched, particularly the headers sent by the server. It is
405currently an ``httplib.HTTPMessage`` instance.
406
407Typical headers include 'Content-length', 'Content-type', and so
408on. See the
409`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
410for a useful listing of HTTP headers with brief explanations of their meaning
411and use.
412
413
414Openers and Handlers
415====================
416
417When you fetch a URL you use an opener (an instance of the perhaps
418confusingly-named ``urllib2.OpenerDirector``). Normally we have been using
419the default opener - via ``urlopen`` - but you can create custom
420openers. Openers use handlers. All the "heavy lifting" is done by the
421handlers. Each handler knows how to open URLs for a particular URL
422scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
423for example HTTP redirections or HTTP cookies.
424
425You will want to create openers if you want to fetch URLs with
426specific handlers installed, for example to get an opener that handles
427cookies, or to get an opener that does not handle redirections.
428
429To create an opener, instantiate an OpenerDirector, and then call
430.add_handler(some_handler_instance) repeatedly.
431
432Alternatively, you can use ``build_opener``, which is a convenience
433function for creating opener objects with a single function call.
434``build_opener`` adds several handlers by default, but provides a
435quick way to add more and/or override the default handlers.
436
437Other sorts of handlers you might want to can handle proxies,
438authentication, and other common but slightly specialised
439situations.
440
441``install_opener`` can be used to make an ``opener`` object the
442(global) default opener. This means that calls to ``urlopen`` will use
443the opener you have installed.
444
445Opener objects have an ``open`` method, which can be called directly
446to fetch urls in the same way as the ``urlopen`` function: there's no
447need to call ``install_opener``, except as a convenience.
448
449
450Basic Authentication
451====================
452
453To illustrate creating and installing a handler we will use the
454``HTTPBasicAuthHandler``. For a more detailed discussion of this
455subject - including an explanation of how Basic Authentication works -
456see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
457
458When authentication is required, the server sends a header (as well as
459the 401 error code) requesting authentication. This specifies the
460authentication scheme and a 'realm'. The header looks like :
461``Www-authenticate: SCHEME realm="REALM"``.
462
463e.g. ::
464
465 Www-authenticate: Basic realm="cPanel Users"
466
467
468The client should then retry the request with the appropriate name and
469password for the realm included as a header in the request. This is
470'basic authentication'. In order to simplify this process we can
471create an instance of ``HTTPBasicAuthHandler`` and an opener to use
472this handler.
473
474The ``HTTPBasicAuthHandler`` uses an object called a password manager
475to handle the mapping of URLs and realms to passwords and
476usernames. If you know what the realm is (from the authentication
477header sent by the server), then you can use a
478``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In
479that case, it is convenient to use
480``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a
481default username and password for a URL. This will be supplied in the
482absence of you providing an alternative combination for a specific
483realm. We indicate this by providing ``None`` as the realm argument to
484the ``add_password`` method.
485
486The top-level URL is the first URL that requires authentication. URLs
487"deeper" than the URL you pass to .add_password() will also match. ::
488
489 # create a password manager
490 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
491
492 # Add the username and password.
493 # If we knew the realm, we could use it instead of ``None``.
494 top_level_url = "http://example.com/foo/"
495 password_mgr.add_password(None, top_level_url, username, password)
496
497 handler = urllib2.HTTPBasicAuthHandler(password_mgr)
498
499 # create "opener" (OpenerDirector instance)
500 opener = urllib2.build_opener(handler)
501
502 # use the opener to fetch a URL
503 opener.open(a_url)
504
505 # Install the opener.
506 # Now all calls to urllib2.urlopen use our opener.
507 urllib2.install_opener(opener)
508
509.. note::
510
511 In the above example we only supplied our ``HHTPBasicAuthHandler``
512 to ``build_opener``. By default openers have the handlers for
513 normal situations - ``ProxyHandler``, ``UnknownHandler``,
514 ``HTTPHandler``, ``HTTPDefaultErrorHandler``,
515 ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``,
516 ``HTTPErrorProcessor``.
517
518top_level_url is in fact *either* a full URL (including the 'http:'
519scheme component and the hostname and optionally the port number)
520e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
521optionally including the port number) e.g. "example.com" or
522"example.com:8080" (the latter example includes a port number). The
523authority, if present, must NOT contain the "userinfo" component - for
524example "joe@password:example.com" is not correct.
525
526
527Proxies
528=======
529
530**urllib2** will auto-detect your proxy settings and use those. This
531is through the ``ProxyHandler`` which is part of the normal handler
532chain. Normally that's a good thing, but there are occasions when it
533may not be helpful [#]_. One way to do this is to setup our own
534``ProxyHandler``, with no proxies defined. This is done using similar
535steps to setting up a `Basic Authentication`_ handler : ::
536
537 >>> proxy_support = urllib2.ProxyHandler({})
538 >>> opener = urllib2.build_opener(proxy_support)
539 >>> urllib2.install_opener(opener)
540
541.. note::
542
543 Currently ``urllib2`` *does not* support fetching of ``https``
544 locations through a proxy. This can be a problem.
545
546Sockets and Layers
547==================
548
549The Python support for fetching resources from the web is
550layered. urllib2 uses the httplib library, which in turn uses the
551socket library.
552
553As of Python 2.3 you can specify how long a socket should wait for a
554response before timing out. This can be useful in applications which
555have to fetch web pages. By default the socket module has *no timeout*
556and can hang. Currently, the socket timeout is not exposed at the
557httplib or urllib2 levels. However, you can set the default timeout
558globally for all sockets using : ::
559
560 import socket
561 import urllib2
562
563 # timeout in seconds
564 timeout = 10
565 socket.setdefaulttimeout(timeout)
566
567 # this call to urllib2.urlopen now uses the default timeout
568 # we have set in the socket module
569 req = urllib2.Request('http://www.voidspace.org.uk')
570 response = urllib2.urlopen(req)
571
572
573-------
574
575
576Footnotes
577=========
578
579This document was reviewed and revised by John Lee.
580
581.. [#] For an introduction to the CGI protocol see
582 `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
583.. [#] Like Google for example. The *proper* way to use google from a program
584 is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
585 `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
586 for some examples of using the Google API.
587.. [#] Browser sniffing is a very bad practise for website design - building
588 sites using web standards is much more sensible. Unfortunately a lot of
589 sites still send different versions to different browsers.
590.. [#] The user agent for MSIE 6 is
591 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
592.. [#] For details of more HTTP request headers, see
593 `Quick Reference to HTTP Headers`_.
594.. [#] In my case I have to use a proxy to access the internet at work. If you
595 attempt to fetch *localhost* URLs through this proxy it blocks them. IE
596 is set to use the proxy, which urllib2 picks up on. In order to test
597 scripts with a localhost server, I have to prevent urllib2 from using
598 the proxy.