blob: 69ce5082e4935589e9aa72973cab98071f50756a [file] [log] [blame]
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +00001==============================================
2 HOWTO Fetch Internet Resources Using urllib2
3==============================================
Georg Brandld419a932006-05-17 14:11:36 +00004----------------------------
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +00005 Fetching URLs With Python
Georg Brandld419a932006-05-17 14:11:36 +00006----------------------------
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +00007
8
9.. note::
10
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000011 There is an French translation of an earlier revision of this
12 HOWTO, available at `urllib2 - Le Manuel manquant
George Yoshida00f6e192006-05-21 04:40:32 +000013 <http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000014
15.. contents:: urllib2 Tutorial
16
17
18Introduction
19============
20
21.. sidebar:: Related Articles
22
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000023 You may also find useful the following article on fetching web
24 resources with Python :
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000025
26 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
27
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000028 A tutorial on *Basic Authentication*, with examples in Python.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000029
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000030 This HOWTO is written by `Michael Foord
31 <http://www.voidspace.org.uk/python/index.shtml>`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000032
Georg Brandld419a932006-05-17 14:11:36 +000033**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
34(Uniform Resource Locators). It offers a very simple interface, in the form of
35the *urlopen* function. This is capable of fetching URLs using a variety
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000036of different protocols. It also offers a slightly more complex
37interface for handling common situations - like basic authentication,
Georg Brandld419a932006-05-17 14:11:36 +000038cookies, proxies and so on. These are provided by objects called
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000039handlers and openers.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000040
Georg Brandld419a932006-05-17 14:11:36 +000041urllib2 supports fetching URLs for many "URL schemes" (identified by the string
42before the ":" in URL - for example "ftp" is the URL scheme of
43"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
44This tutorial focuses on the most common case, HTTP.
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000045
46For straightforward situations *urlopen* is very easy to use. But as
47soon as you encounter errors or non-trivial cases when opening HTTP
48URLs, you will need some understanding of the HyperText Transfer
49Protocol. The most comprehensive and authoritative reference to HTTP
50is :RFC:`2616`. This is a technical document and not intended to be
51easy to read. This HOWTO aims to illustrate using *urllib2*, with
52enough detail about HTTP to help you through. It is not intended to
Georg Brandld419a932006-05-17 14:11:36 +000053replace the `urllib2 docs <http://docs.python.org/lib/module-urllib2.html>`_ ,
54but is supplementary to them.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000055
56
57Fetching URLs
58=============
59
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000060The simplest way to use urllib2 is as follows : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000061
62 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000063 response = urllib2.urlopen('http://python.org/')
64 html = response.read()
65
66Many uses of urllib2 will be that simple (note that instead of an
67'http:' URL we could have used an URL starting with 'ftp:', 'file:',
68etc.). However, it's the purpose of this tutorial to explain the more
69complicated cases, concentrating on HTTP.
70
71HTTP is based on requests and responses - the client makes requests
72and servers send responses. urllib2 mirrors this with a ``Request``
73object which represents the HTTP request you are making. In its
74simplest form you create a Request object that specifies the URL you
75want to fetch. Calling ``urlopen`` with this Request object returns a
76response object for the URL requested. This response is a file-like
77object, which means you can for example call .read() on the response :
78::
79
80 import urllib2
81
82 req = urllib2.Request('http://www.voidspace.org.uk')
83 response = urllib2.urlopen(req)
84 the_page = response.read()
85
86Note that urllib2 makes use of the same Request interface to handle
87all URL schemes. For example, you can make an FTP request like so: ::
88
89 req = urllib2.Request('ftp://example.com/')
90
91In the case of HTTP, there are two extra things that Request objects
92allow you to do: First, you can pass data to be sent to the server.
93Second, you can pass extra information ("metadata") *about* the data
94or the about request itself, to the server - this information is sent
95as HTTP "headers". Let's look at each of these in turn.
96
97Data
98----
99
100Sometimes you want to send data to a URL (often the URL will refer to
101a CGI (Common Gateway Interface) script [#]_ or other web
102application). With HTTP, this is often done using what's known as a
103**POST** request. This is often what your browser does when you submit
104a HTML form that you filled in on the web. Not all POSTs have to come
105from forms: you can use a POST to transmit arbitrary data to your own
106application. In the common case of HTML forms, the data needs to be
107encoded in a standard way, and then passed to the Request object as
108the ``data`` argument. The encoding is done using a function from the
109``urllib`` library *not* from ``urllib2``. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000110
111 import urllib
112 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000113
114 url = 'http://www.someserver.com/cgi-bin/register.cgi'
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000115 values = {'name' : 'Michael Foord',
116 'location' : 'Northampton',
117 'language' : 'Python' }
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000118
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000119 data = urllib.urlencode(values)
120 req = urllib2.Request(url, data)
121 response = urllib2.urlopen(req)
Georg Brandld419a932006-05-17 14:11:36 +0000122 the_page = response.read()
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000123
124Note that other encodings are sometimes required (e.g. for file upload
Georg Brandld419a932006-05-17 14:11:36 +0000125from HTML forms - see
126`HTML Specification, Form Submission <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_
127for more details).
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000128
129If you do not pass the ``data`` argument, urllib2 uses a **GET**
Georg Brandld419a932006-05-17 14:11:36 +0000130request. One way in which GET and POST requests differ is that POST
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000131requests often have "side-effects": they change the state of the
132system in some way (for example by placing an order with the website
133for a hundredweight of tinned spam to be delivered to your door).
134Though the HTTP standard makes it clear that POSTs are intended to
135*always* cause side-effects, and GET requests *never* to cause
136side-effects, nothing prevents a GET request from having side-effects,
Georg Brandld419a932006-05-17 14:11:36 +0000137nor a POST requests from having no side-effects. Data can also be
138passed in an HTTP GET request by encoding it in the URL itself.
139
140This is done as follows::
141
142 >>> import urllib2
143 >>> import urllib
144 >>> data = {}
145 >>> data['name'] = 'Somebody Here'
146 >>> data['location'] = 'Northampton'
147 >>> data['language'] = 'Python'
148 >>> url_values = urllib.urlencode(data)
149 >>> print url_values
150 name=Somebody+Here&language=Python&location=Northampton
151 >>> url = 'http://www.example.com/example.cgi'
152 >>> full_url = url + '?' + url_values
153 >>> data = urllib2.open(full_url)
154
155Notice that the full URL is created by adding a ``?`` to the URL, followed by
156the encoded values.
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000157
158Headers
159-------
160
161We'll discuss here one particular HTTP header, to illustrate how to
162add headers to your HTTP request.
163
164Some websites [#]_ dislike being browsed by programs, or send
165different versions to different browsers [#]_ . By default urllib2
166identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are
167the major and minor version numbers of the Python release,
168e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
169not work. The way a browser identifies itself is through the
170``User-Agent`` header [#]_. When you create a Request object you can
171pass a dictionary of headers in. The following example makes the same
172request as above, but identifies itself as a version of Internet
173Explorer [#]_. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000174
175 import urllib
176 import urllib2
177
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000178 url = 'http://www.someserver.com/cgi-bin/register.cgi'
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000179 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
180 values = {'name' : 'Michael Foord',
181 'location' : 'Northampton',
182 'language' : 'Python' }
183 headers = { 'User-Agent' : user_agent }
184
185 data = urllib.urlencode(values)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000186 req = urllib2.Request(url, data, headers)
187 response = urllib2.urlopen(req)
188 the_page = response.read()
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000189
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000190The response also has two useful methods. See the section on `info and
191geturl`_ which comes after we have a look at what happens when things
192go wrong.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000193
194
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000195Handling Exceptions
196===================
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000197
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000198*urlopen* raises ``URLError`` when it cannot handle a response (though
199as usual with Python APIs, builtin exceptions such as ValueError,
200TypeError etc. may also be raised).
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000201
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000202``HTTPError`` is the subclass of ``URLError`` raised in the specific
203case of HTTP URLs.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000204
205URLError
206--------
207
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000208Often, URLError is raised because there is no network connection (no
209route to the specified server), or the specified server doesn't exist.
210In this case, the exception raised will have a 'reason' attribute,
211which is a tuple containing an error code and a text error message.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000212
213e.g. ::
214
215 >>> req = urllib2.Request('http://www.pretend_server.org')
216 >>> try: urllib2.urlopen(req)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000217 >>> except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000218 >>> print e.reason
219 >>>
220 (4, 'getaddrinfo failed')
221
222
223HTTPError
224---------
225
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000226Every HTTP response from the server contains a numeric "status
227code". Sometimes the status code indicates that the server is unable
228to fulfil the request. The default handlers will handle some of these
229responses for you (for example, if the response is a "redirection"
230that requests the client fetch the document from a different URL,
231urllib2 will handle that for you). For those it can't handle, urlopen
232will raise an ``HTTPError``. Typical errors include '404' (page not
233found), '403' (request forbidden), and '401' (authentication
234required).
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000235
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000236See section 10 of RFC 2616 for a reference on all the HTTP error
237codes.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000238
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000239The ``HTTPError`` instance raised will have an integer 'code'
240attribute, which corresponds to the error sent by the server.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000241
242Error Codes
243~~~~~~~~~~~
244
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000245Because the default handlers handle redirects (codes in the 300
246range), and codes in the 100-299 range indicate success, you will
247usually only see error codes in the 400-599 range.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000248
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000249``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful
250dictionary of response codes in that shows all the response codes used
251by RFC 2616. The dictionary is reproduced here for convenience ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000252
253 # Table mapping response codes to messages; entries have the
254 # form {code: (shortmessage, longmessage)}.
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000255 responses = {
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000256 100: ('Continue', 'Request received, please continue'),
257 101: ('Switching Protocols',
258 'Switching to new protocol; obey Upgrade header'),
259
260 200: ('OK', 'Request fulfilled, document follows'),
261 201: ('Created', 'Document created, URL follows'),
262 202: ('Accepted',
263 'Request accepted, processing continues off-line'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000264 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
265 204: ('No Content', 'Request fulfilled, nothing follows'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000266 205: ('Reset Content', 'Clear input form for further input.'),
267 206: ('Partial Content', 'Partial content follows.'),
268
269 300: ('Multiple Choices',
270 'Object has several resources -- see URI list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000271 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000272 302: ('Found', 'Object moved temporarily -- see URI list'),
273 303: ('See Other', 'Object moved -- see Method and URL list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000274 304: ('Not Modified',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000275 'Document has not changed since given time'),
276 305: ('Use Proxy',
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000277 'You must use proxy specified in Location to access this '
278 'resource.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000279 307: ('Temporary Redirect',
280 'Object moved temporarily -- see URI list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000281
282 400: ('Bad Request',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000283 'Bad request syntax or unsupported method'),
284 401: ('Unauthorized',
285 'No permission -- see authorization schemes'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000286 402: ('Payment Required',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000287 'No payment -- see charging schemes'),
288 403: ('Forbidden',
289 'Request forbidden -- authorization will not help'),
290 404: ('Not Found', 'Nothing matches the given URI'),
291 405: ('Method Not Allowed',
292 'Specified method is invalid for this server.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000293 406: ('Not Acceptable', 'URI not available in preferred format.'),
294 407: ('Proxy Authentication Required', 'You must authenticate with '
295 'this proxy before proceeding.'),
296 408: ('Request Timeout', 'Request timed out; try again later.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000297 409: ('Conflict', 'Request conflict.'),
298 410: ('Gone',
299 'URI no longer exists and has been permanently removed.'),
300 411: ('Length Required', 'Client must specify Content-Length.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000301 412: ('Precondition Failed', 'Precondition in headers is false.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000302 413: ('Request Entity Too Large', 'Entity is too large.'),
303 414: ('Request-URI Too Long', 'URI is too long.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000304 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000305 416: ('Requested Range Not Satisfiable',
306 'Cannot satisfy request range.'),
307 417: ('Expectation Failed',
308 'Expect condition could not be satisfied.'),
309
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000310 500: ('Internal Server Error', 'Server got itself in trouble'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000311 501: ('Not Implemented',
312 'Server does not support this operation'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000313 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
314 503: ('Service Unavailable',
315 'The server cannot process the request due to a high load'),
316 504: ('Gateway Timeout',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000317 'The gateway server did not receive a timely response'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000318 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000319 }
320
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000321When an error is raised the server responds by returning an HTTP error
322code *and* an error page. You can use the ``HTTPError`` instance as a
323response on the page returned. This means that as well as the code
324attribute, it also has read, geturl, and info, methods. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000325
326 >>> req = urllib2.Request('http://www.python.org/fish.html')
327 >>> try:
328 >>> urllib2.urlopen(req)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000329 >>> except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000330 >>> print e.code
331 >>> print e.read()
332 >>>
333 404
334 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
335 "http://www.w3.org/TR/html4/loose.dtd">
336 <?xml-stylesheet href="./css/ht2html.css"
337 type="text/css"?>
338 <html><head><title>Error 404: File Not Found</title>
339 ...... etc...
340
341Wrapping it Up
342--------------
343
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000344So if you want to be prepared for ``HTTPError`` *or* ``URLError``
345there are two basic approaches. I prefer the second approach.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000346
347Number 1
348~~~~~~~~
349
350::
351
352
353 from urllib2 import Request, urlopen, URLError, HTTPError
354 req = Request(someurl)
355 try:
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000356 response = urlopen(req)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000357 except HTTPError, e:
358 print 'The server couldn\'t fulfill the request.'
359 print 'Error code: ', e.code
360 except URLError, e:
361 print 'We failed to reach a server.'
362 print 'Reason: ', e.reason
363 else:
364 # everything is fine
365
366
367.. note::
368
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000369 The ``except HTTPError`` *must* come first, otherwise ``except URLError``
370 will *also* catch an ``HTTPError``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000371
372Number 2
373~~~~~~~~
374
375::
376
Georg Brandld419a932006-05-17 14:11:36 +0000377 from urllib2 import Request, urlopen, URLError
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000378 req = Request(someurl)
379 try:
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000380 response = urlopen(req)
381 except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000382 if hasattr(e, 'reason'):
383 print 'We failed to reach a server.'
384 print 'Reason: ', e.reason
385 elif hasattr(e, 'code'):
386 print 'The server couldn\'t fulfill the request.'
387 print 'Error code: ', e.code
388 else:
389 # everything is fine
390
391
392info and geturl
393===============
394
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000395The response returned by urlopen (or the ``HTTPError`` instance) has
396two useful methods ``info`` and ``geturl``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000397
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000398**geturl** - this returns the real URL of the page fetched. This is
399useful because ``urlopen`` (or the opener object used) may have
400followed a redirect. The URL of the page fetched may not be the same
401as the URL requested.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000402
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000403**info** - this returns a dictionary-like object that describes the
404page fetched, particularly the headers sent by the server. It is
405currently an ``httplib.HTTPMessage`` instance.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000406
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000407Typical headers include 'Content-length', 'Content-type', and so
Georg Brandld419a932006-05-17 14:11:36 +0000408on. See the
409`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
410for a useful listing of HTTP headers with brief explanations of their meaning
411and use.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000412
413
414Openers and Handlers
415====================
416
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000417When you fetch a URL you use an opener (an instance of the perhaps
Georg Brandld419a932006-05-17 14:11:36 +0000418confusingly-named ``urllib2.OpenerDirector``). Normally we have been using
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000419the default opener - via ``urlopen`` - but you can create custom
420openers. Openers use handlers. All the "heavy lifting" is done by the
421handlers. Each handler knows how to open URLs for a particular URL
422scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
423for example HTTP redirections or HTTP cookies.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000424
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000425You will want to create openers if you want to fetch URLs with
426specific handlers installed, for example to get an opener that handles
427cookies, or to get an opener that does not handle redirections.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000428
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000429To create an opener, instantiate an OpenerDirector, and then call
430.add_handler(some_handler_instance) repeatedly.
431
432Alternatively, you can use ``build_opener``, which is a convenience
433function for creating opener objects with a single function call.
434``build_opener`` adds several handlers by default, but provides a
435quick way to add more and/or override the default handlers.
436
437Other sorts of handlers you might want to can handle proxies,
438authentication, and other common but slightly specialised
439situations.
440
441``install_opener`` can be used to make an ``opener`` object the
442(global) default opener. This means that calls to ``urlopen`` will use
443the opener you have installed.
444
445Opener objects have an ``open`` method, which can be called directly
446to fetch urls in the same way as the ``urlopen`` function: there's no
447need to call ``install_opener``, except as a convenience.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000448
449
450Basic Authentication
451====================
452
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000453To illustrate creating and installing a handler we will use the
454``HTTPBasicAuthHandler``. For a more detailed discussion of this
455subject - including an explanation of how Basic Authentication works -
George Yoshidab688b6c2006-05-20 18:07:26 +0000456see the `Basic Authentication Tutorial <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000457
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000458When authentication is required, the server sends a header (as well as
459the 401 error code) requesting authentication. This specifies the
460authentication scheme and a 'realm'. The header looks like :
461``Www-authenticate: SCHEME realm="REALM"``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000462
463e.g. ::
464
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000465 Www-authenticate: Basic realm="cPanel Users"
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000466
467
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000468The client should then retry the request with the appropriate name and
469password for the realm included as a header in the request. This is
470'basic authentication'. In order to simplify this process we can
471create an instance of ``HTTPBasicAuthHandler`` and an opener to use
472this handler.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000473
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000474The ``HTTPBasicAuthHandler`` uses an object called a password manager
475to handle the mapping of URLs and realms to passwords and
476usernames. If you know what the realm is (from the authentication
477header sent by the server), then you can use a
478``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In
479that case, it is convenient to use
480``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a
481default username and password for a URL. This will be supplied in the
Georg Brandld419a932006-05-17 14:11:36 +0000482absence of you providing an alternative combination for a specific
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000483realm. We indicate this by providing ``None`` as the realm argument to
484the ``add_password`` method.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000485
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000486The top-level URL is the first URL that requires authentication. URLs
487"deeper" than the URL you pass to .add_password() will also match. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000488
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000489 # create a password manager
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000490 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
491
492 # Add the username and password.
493 # If we knew the realm, we could use it instead of ``None``.
494 top_level_url = "http://example.com/foo/"
495 password_mgr.add_password(None, top_level_url, username, password)
496
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000497 handler = urllib2.HTTPBasicAuthHandler(password_mgr)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000498
499 # create "opener" (OpenerDirector instance)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000500 opener = urllib2.build_opener(handler)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000501
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000502 # use the opener to fetch a URL
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000503 opener.open(a_url)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000504
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000505 # Install the opener.
506 # Now all calls to urllib2.urlopen use our opener.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000507 urllib2.install_opener(opener)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000508
509.. note::
510
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000511 In the above example we only supplied our ``HHTPBasicAuthHandler``
512 to ``build_opener``. By default openers have the handlers for
513 normal situations - ``ProxyHandler``, ``UnknownHandler``,
514 ``HTTPHandler``, ``HTTPDefaultErrorHandler``,
515 ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``,
516 ``HTTPErrorProcessor``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000517
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000518top_level_url is in fact *either* a full URL (including the 'http:'
519scheme component and the hostname and optionally the port number)
520e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
521optionally including the port number) e.g. "example.com" or
522"example.com:8080" (the latter example includes a port number). The
523authority, if present, must NOT contain the "userinfo" component - for
524example "joe@password:example.com" is not correct.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000525
526
527Proxies
528=======
529
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000530**urllib2** will auto-detect your proxy settings and use those. This
531is through the ``ProxyHandler`` which is part of the normal handler
532chain. Normally that's a good thing, but there are occasions when it
533may not be helpful [#]_. One way to do this is to setup our own
534``ProxyHandler``, with no proxies defined. This is done using similar
535steps to setting up a `Basic Authentication`_ handler : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000536
537 >>> proxy_support = urllib2.ProxyHandler({})
538 >>> opener = urllib2.build_opener(proxy_support)
539 >>> urllib2.install_opener(opener)
540
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000541.. note::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000542
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000543 Currently ``urllib2`` *does not* support fetching of ``https``
544 locations through a proxy. This can be a problem.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000545
546Sockets and Layers
547==================
548
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000549The Python support for fetching resources from the web is
550layered. urllib2 uses the httplib library, which in turn uses the
551socket library.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000552
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000553As of Python 2.3 you can specify how long a socket should wait for a
554response before timing out. This can be useful in applications which
555have to fetch web pages. By default the socket module has *no timeout*
556and can hang. Currently, the socket timeout is not exposed at the
557httplib or urllib2 levels. However, you can set the default timeout
558globally for all sockets using : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000559
560 import socket
561 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000562
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000563 # timeout in seconds
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000564 timeout = 10
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000565 socket.setdefaulttimeout(timeout)
566
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000567 # this call to urllib2.urlopen now uses the default timeout
568 # we have set in the socket module
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000569 req = urllib2.Request('http://www.voidspace.org.uk')
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000570 response = urllib2.urlopen(req)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000571
572
573-------
574
575
576Footnotes
George Yoshida00f6e192006-05-21 04:40:32 +0000577=========
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000578
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000579This document was reviewed and revised by John Lee.
580
Georg Brandld419a932006-05-17 14:11:36 +0000581.. [#] For an introduction to the CGI protocol see
582 `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
583.. [#] Like Google for example. The *proper* way to use google from a program
George Yoshida00f6e192006-05-21 04:40:32 +0000584 is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
Georg Brandld419a932006-05-17 14:11:36 +0000585 `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
586 for some examples of using the Google API.
587.. [#] Browser sniffing is a very bad practise for website design - building
588 sites using web standards is much more sensible. Unfortunately a lot of
589 sites still send different versions to different browsers.
590.. [#] The user agent for MSIE 6 is
591 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
592.. [#] For details of more HTTP request headers, see
593 `Quick Reference to HTTP Headers`_.
594.. [#] In my case I have to use a proxy to access the internet at work. If you
595 attempt to fetch *localhost* URLs through this proxy it blocks them. IE
596 is set to use the proxy, which urllib2 picks up on. In order to test
597 scripts with a localhost server, I have to prevent urllib2 from using
598 the proxy.