blob: 6feb7c2d9cab77f7caf5f12d5ec2c10d32967df5 [file] [log] [blame]
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +00001==============================================
2 HOWTO Fetch Internet Resources Using urllib2
3==============================================
4------------------------------------------
5 Fetching URLs With Python
6------------------------------------------
7
8
9.. note::
10
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000011 There is an French translation of an earlier revision of this
12 HOWTO, available at `urllib2 - Le Manuel manquant
13 <http://www.voidspace/python/urllib2_francais.shtml>`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000014
15.. contents:: urllib2 Tutorial
16
17
18Introduction
19============
20
21.. sidebar:: Related Articles
22
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000023 You may also find useful the following article on fetching web
24 resources with Python :
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000025
26 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
27
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000028 A tutorial on *Basic Authentication*, with examples in Python.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000029
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000030 This HOWTO is written by `Michael Foord
31 <http://www.voidspace.org.uk/python/index.shtml>`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000032
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000033**urllib2** is a Python_ module for fetching URLs (Uniform Resource
34Locators). It offers a very simple interface, in the form of the
35*urlopen* function. This is capable of fetching URLs using a variety
36of different protocols. It also offers a slightly more complex
37interface for handling common situations - like basic authentication,
38cookies, proxies, and so on. These are provided by objects called
39handlers and openers.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000040
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000041While urllib2 supports fetching URLs for many "URL schemes"
42(identified by the string before the ":" in URL - e.g. "ftp" is the
43URL scheme of "ftp://python.org/") using their associated network
44protocols (e.g. FTP, HTTP), this tutorial focuses on the most common
45case, HTTP.
46
47For straightforward situations *urlopen* is very easy to use. But as
48soon as you encounter errors or non-trivial cases when opening HTTP
49URLs, you will need some understanding of the HyperText Transfer
50Protocol. The most comprehensive and authoritative reference to HTTP
51is :RFC:`2616`. This is a technical document and not intended to be
52easy to read. This HOWTO aims to illustrate using *urllib2*, with
53enough detail about HTTP to help you through. It is not intended to
54replace the `urllib2 docs`_ , but is supplementary to them.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000055
56
57Fetching URLs
58=============
59
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000060The simplest way to use urllib2 is as follows : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +000061
62 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +000063 response = urllib2.urlopen('http://python.org/')
64 html = response.read()
65
66Many uses of urllib2 will be that simple (note that instead of an
67'http:' URL we could have used an URL starting with 'ftp:', 'file:',
68etc.). However, it's the purpose of this tutorial to explain the more
69complicated cases, concentrating on HTTP.
70
71HTTP is based on requests and responses - the client makes requests
72and servers send responses. urllib2 mirrors this with a ``Request``
73object which represents the HTTP request you are making. In its
74simplest form you create a Request object that specifies the URL you
75want to fetch. Calling ``urlopen`` with this Request object returns a
76response object for the URL requested. This response is a file-like
77object, which means you can for example call .read() on the response :
78::
79
80 import urllib2
81
82 req = urllib2.Request('http://www.voidspace.org.uk')
83 response = urllib2.urlopen(req)
84 the_page = response.read()
85
86Note that urllib2 makes use of the same Request interface to handle
87all URL schemes. For example, you can make an FTP request like so: ::
88
89 req = urllib2.Request('ftp://example.com/')
90
91In the case of HTTP, there are two extra things that Request objects
92allow you to do: First, you can pass data to be sent to the server.
93Second, you can pass extra information ("metadata") *about* the data
94or the about request itself, to the server - this information is sent
95as HTTP "headers". Let's look at each of these in turn.
96
97Data
98----
99
100Sometimes you want to send data to a URL (often the URL will refer to
101a CGI (Common Gateway Interface) script [#]_ or other web
102application). With HTTP, this is often done using what's known as a
103**POST** request. This is often what your browser does when you submit
104a HTML form that you filled in on the web. Not all POSTs have to come
105from forms: you can use a POST to transmit arbitrary data to your own
106application. In the common case of HTML forms, the data needs to be
107encoded in a standard way, and then passed to the Request object as
108the ``data`` argument. The encoding is done using a function from the
109``urllib`` library *not* from ``urllib2``. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000110
111 import urllib
112 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000113
114 url = 'http://www.someserver.com/cgi-bin/register.cgi'
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000115 values = {'name' : 'Michael Foord',
116 'location' : 'Northampton',
117 'language' : 'Python' }
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000118
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000119 data = urllib.urlencode(values)
120 req = urllib2.Request(url, data)
121 response = urllib2.urlopen(req)
122 the_page = response.read()
123
124Note that other encodings are sometimes required (e.g. for file upload
125from HTML forms - see `HTML Specification, Form Submission`_ for more
126details).
127
128If you do not pass the ``data`` argument, urllib2 uses a **GET**
129request. One way in which GET and POST requests differ is that POST
130requests often have "side-effects": they change the state of the
131system in some way (for example by placing an order with the website
132for a hundredweight of tinned spam to be delivered to your door).
133Though the HTTP standard makes it clear that POSTs are intended to
134*always* cause side-effects, and GET requests *never* to cause
135side-effects, nothing prevents a GET request from having side-effects,
136nor a POST requests from having no side-effects. Data can also be
137passed in an HTTP request by encoding it in the URL itself.
138
139Headers
140-------
141
142We'll discuss here one particular HTTP header, to illustrate how to
143add headers to your HTTP request.
144
145Some websites [#]_ dislike being browsed by programs, or send
146different versions to different browsers [#]_ . By default urllib2
147identifies itself as ``Python-urllib/x.y`` (where ``x`` and ``y`` are
148the major and minor version numbers of the Python release,
149e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
150not work. The way a browser identifies itself is through the
151``User-Agent`` header [#]_. When you create a Request object you can
152pass a dictionary of headers in. The following example makes the same
153request as above, but identifies itself as a version of Internet
154Explorer [#]_. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000155
156 import urllib
157 import urllib2
158
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000159 url = 'http://www.someserver.com/cgi-bin/register.cgi'
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000160 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
161 values = {'name' : 'Michael Foord',
162 'location' : 'Northampton',
163 'language' : 'Python' }
164 headers = { 'User-Agent' : user_agent }
165
166 data = urllib.urlencode(values)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000167 req = urllib2.Request(url, data, headers)
168 response = urllib2.urlopen(req)
169 the_page = response.read()
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000170
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000171The response also has two useful methods. See the section on `info and
172geturl`_ which comes after we have a look at what happens when things
173go wrong.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000174
175
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000176Handling Exceptions
177===================
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000178
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000179*urlopen* raises ``URLError`` when it cannot handle a response (though
180as usual with Python APIs, builtin exceptions such as ValueError,
181TypeError etc. may also be raised).
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000182
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000183``HTTPError`` is the subclass of ``URLError`` raised in the specific
184case of HTTP URLs.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000185
186URLError
187--------
188
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000189Often, URLError is raised because there is no network connection (no
190route to the specified server), or the specified server doesn't exist.
191In this case, the exception raised will have a 'reason' attribute,
192which is a tuple containing an error code and a text error message.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000193
194e.g. ::
195
196 >>> req = urllib2.Request('http://www.pretend_server.org')
197 >>> try: urllib2.urlopen(req)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000198 >>> except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000199 >>> print e.reason
200 >>>
201 (4, 'getaddrinfo failed')
202
203
204HTTPError
205---------
206
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000207Every HTTP response from the server contains a numeric "status
208code". Sometimes the status code indicates that the server is unable
209to fulfil the request. The default handlers will handle some of these
210responses for you (for example, if the response is a "redirection"
211that requests the client fetch the document from a different URL,
212urllib2 will handle that for you). For those it can't handle, urlopen
213will raise an ``HTTPError``. Typical errors include '404' (page not
214found), '403' (request forbidden), and '401' (authentication
215required).
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000216
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000217See section 10 of RFC 2616 for a reference on all the HTTP error
218codes.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000219
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000220The ``HTTPError`` instance raised will have an integer 'code'
221attribute, which corresponds to the error sent by the server.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000222
223Error Codes
224~~~~~~~~~~~
225
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000226Because the default handlers handle redirects (codes in the 300
227range), and codes in the 100-299 range indicate success, you will
228usually only see error codes in the 400-599 range.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000229
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000230``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful
231dictionary of response codes in that shows all the response codes used
232by RFC 2616. The dictionary is reproduced here for convenience ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000233
234 # Table mapping response codes to messages; entries have the
235 # form {code: (shortmessage, longmessage)}.
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000236 responses = {
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000237 100: ('Continue', 'Request received, please continue'),
238 101: ('Switching Protocols',
239 'Switching to new protocol; obey Upgrade header'),
240
241 200: ('OK', 'Request fulfilled, document follows'),
242 201: ('Created', 'Document created, URL follows'),
243 202: ('Accepted',
244 'Request accepted, processing continues off-line'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000245 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
246 204: ('No Content', 'Request fulfilled, nothing follows'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000247 205: ('Reset Content', 'Clear input form for further input.'),
248 206: ('Partial Content', 'Partial content follows.'),
249
250 300: ('Multiple Choices',
251 'Object has several resources -- see URI list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000252 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000253 302: ('Found', 'Object moved temporarily -- see URI list'),
254 303: ('See Other', 'Object moved -- see Method and URL list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000255 304: ('Not Modified',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000256 'Document has not changed since given time'),
257 305: ('Use Proxy',
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000258 'You must use proxy specified in Location to access this '
259 'resource.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000260 307: ('Temporary Redirect',
261 'Object moved temporarily -- see URI list'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000262
263 400: ('Bad Request',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000264 'Bad request syntax or unsupported method'),
265 401: ('Unauthorized',
266 'No permission -- see authorization schemes'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000267 402: ('Payment Required',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000268 'No payment -- see charging schemes'),
269 403: ('Forbidden',
270 'Request forbidden -- authorization will not help'),
271 404: ('Not Found', 'Nothing matches the given URI'),
272 405: ('Method Not Allowed',
273 'Specified method is invalid for this server.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000274 406: ('Not Acceptable', 'URI not available in preferred format.'),
275 407: ('Proxy Authentication Required', 'You must authenticate with '
276 'this proxy before proceeding.'),
277 408: ('Request Timeout', 'Request timed out; try again later.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000278 409: ('Conflict', 'Request conflict.'),
279 410: ('Gone',
280 'URI no longer exists and has been permanently removed.'),
281 411: ('Length Required', 'Client must specify Content-Length.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000282 412: ('Precondition Failed', 'Precondition in headers is false.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000283 413: ('Request Entity Too Large', 'Entity is too large.'),
284 414: ('Request-URI Too Long', 'URI is too long.'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000285 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000286 416: ('Requested Range Not Satisfiable',
287 'Cannot satisfy request range.'),
288 417: ('Expectation Failed',
289 'Expect condition could not be satisfied.'),
290
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000291 500: ('Internal Server Error', 'Server got itself in trouble'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000292 501: ('Not Implemented',
293 'Server does not support this operation'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000294 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
295 503: ('Service Unavailable',
296 'The server cannot process the request due to a high load'),
297 504: ('Gateway Timeout',
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000298 'The gateway server did not receive a timely response'),
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000299 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000300 }
301
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000302When an error is raised the server responds by returning an HTTP error
303code *and* an error page. You can use the ``HTTPError`` instance as a
304response on the page returned. This means that as well as the code
305attribute, it also has read, geturl, and info, methods. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000306
307 >>> req = urllib2.Request('http://www.python.org/fish.html')
308 >>> try:
309 >>> urllib2.urlopen(req)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000310 >>> except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000311 >>> print e.code
312 >>> print e.read()
313 >>>
314 404
315 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
316 "http://www.w3.org/TR/html4/loose.dtd">
317 <?xml-stylesheet href="./css/ht2html.css"
318 type="text/css"?>
319 <html><head><title>Error 404: File Not Found</title>
320 ...... etc...
321
322Wrapping it Up
323--------------
324
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000325So if you want to be prepared for ``HTTPError`` *or* ``URLError``
326there are two basic approaches. I prefer the second approach.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000327
328Number 1
329~~~~~~~~
330
331::
332
333
334 from urllib2 import Request, urlopen, URLError, HTTPError
335 req = Request(someurl)
336 try:
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000337 response = urlopen(req)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000338 except HTTPError, e:
339 print 'The server couldn\'t fulfill the request.'
340 print 'Error code: ', e.code
341 except URLError, e:
342 print 'We failed to reach a server.'
343 print 'Reason: ', e.reason
344 else:
345 # everything is fine
346
347
348.. note::
349
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000350 The ``except HTTPError`` *must* come first, otherwise ``except URLError``
351 will *also* catch an ``HTTPError``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000352
353Number 2
354~~~~~~~~
355
356::
357
358 from urllib2 import Request, urlopen
359 req = Request(someurl)
360 try:
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000361 response = urlopen(req)
362 except URLError, e:
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000363 if hasattr(e, 'reason'):
364 print 'We failed to reach a server.'
365 print 'Reason: ', e.reason
366 elif hasattr(e, 'code'):
367 print 'The server couldn\'t fulfill the request.'
368 print 'Error code: ', e.code
369 else:
370 # everything is fine
371
372
373info and geturl
374===============
375
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000376The response returned by urlopen (or the ``HTTPError`` instance) has
377two useful methods ``info`` and ``geturl``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000378
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000379**geturl** - this returns the real URL of the page fetched. This is
380useful because ``urlopen`` (or the opener object used) may have
381followed a redirect. The URL of the page fetched may not be the same
382as the URL requested.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000383
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000384**info** - this returns a dictionary-like object that describes the
385page fetched, particularly the headers sent by the server. It is
386currently an ``httplib.HTTPMessage`` instance.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000387
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000388Typical headers include 'Content-length', 'Content-type', and so
389on. See the `Quick Reference to HTTP Headers`_ for a useful listing of
390HTTP headers with brief explanations of their meaning and use.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000391
392
393Openers and Handlers
394====================
395
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000396When you fetch a URL you use an opener (an instance of the perhaps
397confusingly-named urllib2.OpenerDirector). Normally we have been using
398the default opener - via ``urlopen`` - but you can create custom
399openers. Openers use handlers. All the "heavy lifting" is done by the
400handlers. Each handler knows how to open URLs for a particular URL
401scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
402for example HTTP redirections or HTTP cookies.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000403
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000404You will want to create openers if you want to fetch URLs with
405specific handlers installed, for example to get an opener that handles
406cookies, or to get an opener that does not handle redirections.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000407
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000408To create an opener, instantiate an OpenerDirector, and then call
409.add_handler(some_handler_instance) repeatedly.
410
411Alternatively, you can use ``build_opener``, which is a convenience
412function for creating opener objects with a single function call.
413``build_opener`` adds several handlers by default, but provides a
414quick way to add more and/or override the default handlers.
415
416Other sorts of handlers you might want to can handle proxies,
417authentication, and other common but slightly specialised
418situations.
419
420``install_opener`` can be used to make an ``opener`` object the
421(global) default opener. This means that calls to ``urlopen`` will use
422the opener you have installed.
423
424Opener objects have an ``open`` method, which can be called directly
425to fetch urls in the same way as the ``urlopen`` function: there's no
426need to call ``install_opener``, except as a convenience.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000427
428
429Basic Authentication
430====================
431
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000432To illustrate creating and installing a handler we will use the
433``HTTPBasicAuthHandler``. For a more detailed discussion of this
434subject - including an explanation of how Basic Authentication works -
435see the `Basic Authentication Tutorial`_.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000436
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000437When authentication is required, the server sends a header (as well as
438the 401 error code) requesting authentication. This specifies the
439authentication scheme and a 'realm'. The header looks like :
440``Www-authenticate: SCHEME realm="REALM"``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000441
442e.g. ::
443
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000444 Www-authenticate: Basic realm="cPanel Users"
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000445
446
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000447The client should then retry the request with the appropriate name and
448password for the realm included as a header in the request. This is
449'basic authentication'. In order to simplify this process we can
450create an instance of ``HTTPBasicAuthHandler`` and an opener to use
451this handler.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000452
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000453The ``HTTPBasicAuthHandler`` uses an object called a password manager
454to handle the mapping of URLs and realms to passwords and
455usernames. If you know what the realm is (from the authentication
456header sent by the server), then you can use a
457``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In
458that case, it is convenient to use
459``HTTPPasswordMgrWithDefaultRealm``. This allows you to specify a
460default username and password for a URL. This will be supplied in the
461absence of yoou providing an alternative combination for a specific
462realm. We indicate this by providing ``None`` as the realm argument to
463the ``add_password`` method.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000464
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000465The top-level URL is the first URL that requires authentication. URLs
466"deeper" than the URL you pass to .add_password() will also match. ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000467
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000468 # create a password manager
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000469 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
470
471 # Add the username and password.
472 # If we knew the realm, we could use it instead of ``None``.
473 top_level_url = "http://example.com/foo/"
474 password_mgr.add_password(None, top_level_url, username, password)
475
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000476 handler = urllib2.HTTPBasicAuthHandler(password_mgr)
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000477
478 # create "opener" (OpenerDirector instance)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000479 opener = urllib2.build_opener(handler)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000480
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000481 # use the opener to fetch a URL
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000482 opener.open(a_url)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000483
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000484 # Install the opener.
485 # Now all calls to urllib2.urlopen use our opener.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000486 urllib2.install_opener(opener)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000487
488.. note::
489
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000490 In the above example we only supplied our ``HHTPBasicAuthHandler``
491 to ``build_opener``. By default openers have the handlers for
492 normal situations - ``ProxyHandler``, ``UnknownHandler``,
493 ``HTTPHandler``, ``HTTPDefaultErrorHandler``,
494 ``HTTPRedirectHandler``, ``FTPHandler``, ``FileHandler``,
495 ``HTTPErrorProcessor``.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000496
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000497top_level_url is in fact *either* a full URL (including the 'http:'
498scheme component and the hostname and optionally the port number)
499e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
500optionally including the port number) e.g. "example.com" or
501"example.com:8080" (the latter example includes a port number). The
502authority, if present, must NOT contain the "userinfo" component - for
503example "joe@password:example.com" is not correct.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000504
505
506Proxies
507=======
508
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000509**urllib2** will auto-detect your proxy settings and use those. This
510is through the ``ProxyHandler`` which is part of the normal handler
511chain. Normally that's a good thing, but there are occasions when it
512may not be helpful [#]_. One way to do this is to setup our own
513``ProxyHandler``, with no proxies defined. This is done using similar
514steps to setting up a `Basic Authentication`_ handler : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000515
516 >>> proxy_support = urllib2.ProxyHandler({})
517 >>> opener = urllib2.build_opener(proxy_support)
518 >>> urllib2.install_opener(opener)
519
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000520.. note::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000521
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000522 Currently ``urllib2`` *does not* support fetching of ``https``
523 locations through a proxy. This can be a problem.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000524
525Sockets and Layers
526==================
527
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000528The Python support for fetching resources from the web is
529layered. urllib2 uses the httplib library, which in turn uses the
530socket library.
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000531
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000532As of Python 2.3 you can specify how long a socket should wait for a
533response before timing out. This can be useful in applications which
534have to fetch web pages. By default the socket module has *no timeout*
535and can hang. Currently, the socket timeout is not exposed at the
536httplib or urllib2 levels. However, you can set the default timeout
537globally for all sockets using : ::
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000538
539 import socket
540 import urllib2
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000541
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000542 # timeout in seconds
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000543 timeout = 10
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000544 socket.setdefaulttimeout(timeout)
545
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000546 # this call to urllib2.urlopen now uses the default timeout
547 # we have set in the socket module
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000548 req = urllib2.Request('http://www.voidspace.org.uk')
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000549 response = urllib2.urlopen(req)
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000550
551
552-------
553
554
555Footnotes
556===========
557
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000558This document was reviewed and revised by John Lee.
559
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000560.. [#] For an introduction to the CGI protocol see `Writing Web Applications in Python`_.
561.. [#] Like Google for example. The *proper* way to use google from a program is to use PyGoogle_ of course. See `Voidspace Google`_ for some examples of using the Google API.
562.. [#] Browser sniffing is a very bad practise for website design - building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers.
563.. [#] The user agent for MSIE 6 is *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
564.. [#] For details of more HTTP request headers, see `Quick Reference to HTTP Headers`_.
565
566.. [#] In my case I have to use a proxy to access the internet at work. If you attempt to fetch *localhost* URLs through this proxy it blocks them. IE is set to use the proxy, which urllib2 picks up on. In order to test scripts with a localhost server, I have to prevent urllib2 from using the proxy.
567
568.. _Python: http://www.python.org
569.. _urllib2 docs: http://docs.python.org/lib/module-urllib2.html
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000570.. _HTML Specification, Form Submission: http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13
Andrew M. Kuchling4b5caae2006-04-30 21:19:31 +0000571.. _Quick Reference to HTTP Headers: http://www.cs.tut.fi/~jkorpela/http.html
572.. _PyGoogle: http://pygoogle.sourceforge.net
573.. _Voidspace Google: http://www.voidspace.org.uk/python/recipebook.shtml#google
574.. _Writing Web Applications in Python: http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html
Andrew M. Kuchlingfb108582006-05-07 17:12:12 +0000575.. _Basic Authentication Tutorial: http://www.voidspace.org.uk/python/articles/authentication.shtml