blob: ef1791cebec57a24842d7834485f758b0503db19 [file] [log] [blame]
Georg Brandl9e4ff752009-12-19 17:57:51 +00001.. _urllib-howto:
2
Georg Brandl0f7ede42008-06-23 11:23:31 +00003***********************************************************
4 HOWTO Fetch Internet Resources Using The urllib Package
5***********************************************************
Georg Brandl116aa622007-08-15 14:28:22 +00006
7:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
8
9.. note::
10
Georg Brandl0f7ede42008-06-23 11:23:31 +000011 There is a French translation of an earlier revision of this
Georg Brandl116aa622007-08-15 14:28:22 +000012 HOWTO, available at `urllib2 - Le Manuel manquant
Christian Heimesdd15f6c2008-03-16 00:07:10 +000013 <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Georg Brandl116aa622007-08-15 14:28:22 +000014
Georg Brandl48310cd2009-01-03 21:18:54 +000015
Georg Brandl116aa622007-08-15 14:28:22 +000016
17Introduction
18============
19
20.. sidebar:: Related Articles
21
22 You may also find useful the following article on fetching web resources
Georg Brandl0f7ede42008-06-23 11:23:31 +000023 with Python:
Georg Brandl48310cd2009-01-03 21:18:54 +000024
Georg Brandl116aa622007-08-15 14:28:22 +000025 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
Georg Brandl48310cd2009-01-03 21:18:54 +000026
Georg Brandl116aa622007-08-15 14:28:22 +000027 A tutorial on *Basic Authentication*, with examples in Python.
28
Georg Brandle73778c2014-10-29 08:36:35 +010029**urllib.request** is a Python module for fetching URLs
Georg Brandl116aa622007-08-15 14:28:22 +000030(Uniform Resource Locators). It offers a very simple interface, in the form of
31the *urlopen* function. This is capable of fetching URLs using a variety of
32different protocols. It also offers a slightly more complex interface for
33handling common situations - like basic authentication, cookies, proxies and so
34on. These are provided by objects called handlers and openers.
35
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000036urllib.request supports fetching URLs for many "URL schemes" (identified by the string
Serhiy Storchakad97b7dc2017-05-16 23:18:09 +030037before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
38``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
Georg Brandl116aa622007-08-15 14:28:22 +000039This tutorial focuses on the most common case, HTTP.
40
41For straightforward situations *urlopen* is very easy to use. But as soon as you
42encounter errors or non-trivial cases when opening HTTP URLs, you will need some
43understanding of the HyperText Transfer Protocol. The most comprehensive and
44authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000045not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
Georg Brandl116aa622007-08-15 14:28:22 +000046with enough detail about HTTP to help you through. It is not intended to replace
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000047the :mod:`urllib.request` docs, but is supplementary to them.
Georg Brandl116aa622007-08-15 14:28:22 +000048
49
50Fetching URLs
51=============
52
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000053The simplest way to use urllib.request is as follows::
Georg Brandl116aa622007-08-15 14:28:22 +000054
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000055 import urllib.request
Berker Peksag9575e182015-04-12 13:52:49 +030056 with urllib.request.urlopen('http://python.org/') as response:
57 html = response.read()
Georg Brandl116aa622007-08-15 14:28:22 +000058
Andrés Delfinoc89b2212018-04-16 11:02:56 -030059If you wish to retrieve a resource via URL and store it in a temporary
60location, you can do so via the :func:`shutil.copyfileobj` and
61:func:`tempfile.NamedTemporaryFile` functions::
Senthil Kumarane24f96a2012-03-13 19:29:33 -070062
Andrés Delfinoc89b2212018-04-16 11:02:56 -030063 import shutil
64 import tempfile
Senthil Kumarane24f96a2012-03-13 19:29:33 -070065 import urllib.request
Andrés Delfinoc89b2212018-04-16 11:02:56 -030066
67 with urllib.request.urlopen('http://python.org/') as response:
68 with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
69 shutil.copyfileobj(response, tmp_file)
70
71 with open(tmp_file.name) as html:
72 pass
Senthil Kumarane24f96a2012-03-13 19:29:33 -070073
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000074Many uses of urllib will be that simple (note that instead of an 'http:' URL we
Martin Panter6245cb32016-04-15 02:14:19 +000075could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the
Georg Brandl116aa622007-08-15 14:28:22 +000076purpose of this tutorial to explain the more complicated cases, concentrating on
77HTTP.
78
79HTTP is based on requests and responses - the client makes requests and servers
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000080send responses. urllib.request mirrors this with a ``Request`` object which represents
Georg Brandl116aa622007-08-15 14:28:22 +000081the HTTP request you are making. In its simplest form you create a Request
82object that specifies the URL you want to fetch. Calling ``urlopen`` with this
83Request object returns a response object for the URL requested. This response is
84a file-like object, which means you can for example call ``.read()`` on the
85response::
86
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000087 import urllib.request
Georg Brandl116aa622007-08-15 14:28:22 +000088
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000089 req = urllib.request.Request('http://www.voidspace.org.uk')
Berker Peksag9575e182015-04-12 13:52:49 +030090 with urllib.request.urlopen(req) as response:
91 the_page = response.read()
Georg Brandl116aa622007-08-15 14:28:22 +000092
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000093Note that urllib.request makes use of the same Request interface to handle all URL
Georg Brandl116aa622007-08-15 14:28:22 +000094schemes. For example, you can make an FTP request like so::
95
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000096 req = urllib.request.Request('ftp://example.com/')
Georg Brandl116aa622007-08-15 14:28:22 +000097
98In the case of HTTP, there are two extra things that Request objects allow you
99to do: First, you can pass data to be sent to the server. Second, you can pass
100extra information ("metadata") *about* the data or the about request itself, to
101the server - this information is sent as HTTP "headers". Let's look at each of
102these in turn.
103
104Data
105----
106
107Sometimes you want to send data to a URL (often the URL will refer to a CGI
Berker Peksagfd6400a2014-07-01 06:02:42 +0300108(Common Gateway Interface) script or other web application). With HTTP,
Georg Brandl116aa622007-08-15 14:28:22 +0000109this is often done using what's known as a **POST** request. This is often what
110your browser does when you submit a HTML form that you filled in on the web. Not
111all POSTs have to come from forms: you can use a POST to transmit arbitrary data
112to your own application. In the common case of HTML forms, the data needs to be
113encoded in a standard way, and then passed to the Request object as the ``data``
Georg Brandl0f7ede42008-06-23 11:23:31 +0000114argument. The encoding is done using a function from the :mod:`urllib.parse`
115library. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000116
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000117 import urllib.parse
Georg Brandl48310cd2009-01-03 21:18:54 +0000118 import urllib.request
Georg Brandl116aa622007-08-15 14:28:22 +0000119
120 url = 'http://www.someserver.com/cgi-bin/register.cgi'
121 values = {'name' : 'Michael Foord',
122 'location' : 'Northampton',
123 'language' : 'Python' }
124
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000125 data = urllib.parse.urlencode(values)
Martin Pantercda85a02015-11-24 22:33:18 +0000126 data = data.encode('ascii') # data should be bytes
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000127 req = urllib.request.Request(url, data)
Berker Peksag9575e182015-04-12 13:52:49 +0300128 with urllib.request.urlopen(req) as response:
129 the_page = response.read()
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131Note that other encodings are sometimes required (e.g. for file upload from HTML
132forms - see `HTML Specification, Form Submission
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300133<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
Georg Brandl116aa622007-08-15 14:28:22 +0000134details).
135
Georg Brandl0f7ede42008-06-23 11:23:31 +0000136If you do not pass the ``data`` argument, urllib uses a **GET** request. One
Georg Brandl116aa622007-08-15 14:28:22 +0000137way in which GET and POST requests differ is that POST requests often have
138"side-effects": they change the state of the system in some way (for example by
139placing an order with the website for a hundredweight of tinned spam to be
140delivered to your door). Though the HTTP standard makes it clear that POSTs are
141intended to *always* cause side-effects, and GET requests *never* to cause
142side-effects, nothing prevents a GET request from having side-effects, nor a
143POST requests from having no side-effects. Data can also be passed in an HTTP
144GET request by encoding it in the URL itself.
145
146This is done as follows::
147
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000148 >>> import urllib.request
149 >>> import urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +0000150 >>> data = {}
151 >>> data['name'] = 'Somebody Here'
152 >>> data['location'] = 'Northampton'
153 >>> data['language'] = 'Python'
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000154 >>> url_values = urllib.parse.urlencode(data)
Senthil Kumaran570bc4c2012-10-09 00:38:17 -0700155 >>> print(url_values) # The order may differ from below. #doctest: +SKIP
Georg Brandl116aa622007-08-15 14:28:22 +0000156 name=Somebody+Here&language=Python&location=Northampton
157 >>> url = 'http://www.example.com/example.cgi'
158 >>> full_url = url + '?' + url_values
Georg Brandl06ad13e2011-07-23 08:04:40 +0200159 >>> data = urllib.request.urlopen(full_url)
Georg Brandl116aa622007-08-15 14:28:22 +0000160
161Notice that the full URL is created by adding a ``?`` to the URL, followed by
162the encoded values.
163
164Headers
165-------
166
167We'll discuss here one particular HTTP header, to illustrate how to add headers
168to your HTTP request.
169
170Some websites [#]_ dislike being browsed by programs, or send different versions
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200171to different browsers [#]_. By default urllib identifies itself as
Georg Brandl116aa622007-08-15 14:28:22 +0000172``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
173numbers of the Python release,
174e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
175not work. The way a browser identifies itself is through the
176``User-Agent`` header [#]_. When you create a Request object you can
177pass a dictionary of headers in. The following example makes the same
178request as above, but identifies itself as a version of Internet
179Explorer [#]_. ::
180
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000181 import urllib.parse
Georg Brandl48310cd2009-01-03 21:18:54 +0000182 import urllib.request
183
Georg Brandl116aa622007-08-15 14:28:22 +0000184 url = 'http://www.someserver.com/cgi-bin/register.cgi'
Benjamin Peterson95acbce2015-09-20 23:16:45 +0500185 user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
Serhiy Storchakadba90392016-05-10 12:01:23 +0300186 values = {'name': 'Michael Foord',
187 'location': 'Northampton',
188 'language': 'Python' }
189 headers = {'User-Agent': user_agent}
Georg Brandl48310cd2009-01-03 21:18:54 +0000190
Martin Pantercda85a02015-11-24 22:33:18 +0000191 data = urllib.parse.urlencode(values)
192 data = data.encode('ascii')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000193 req = urllib.request.Request(url, data, headers)
Berker Peksag9575e182015-04-12 13:52:49 +0300194 with urllib.request.urlopen(req) as response:
195 the_page = response.read()
Georg Brandl116aa622007-08-15 14:28:22 +0000196
197The response also has two useful methods. See the section on `info and geturl`_
198which comes after we have a look at what happens when things go wrong.
199
200
201Handling Exceptions
202===================
203
Georg Brandl22b34312009-07-26 14:54:51 +0000204*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
205usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
206:exc:`TypeError` etc. may also be raised).
Georg Brandl116aa622007-08-15 14:28:22 +0000207
Benjamin Petersone5384b02008-10-04 22:00:42 +0000208:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
Georg Brandl116aa622007-08-15 14:28:22 +0000209HTTP URLs.
210
Georg Brandl0f7ede42008-06-23 11:23:31 +0000211The exception classes are exported from the :mod:`urllib.error` module.
212
Georg Brandl116aa622007-08-15 14:28:22 +0000213URLError
214--------
215
216Often, URLError is raised because there is no network connection (no route to
217the specified server), or the specified server doesn't exist. In this case, the
218exception raised will have a 'reason' attribute, which is a tuple containing an
219error code and a text error message.
220
221e.g. ::
222
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000223 >>> req = urllib.request.Request('http://www.pretend_server.org')
224 >>> try: urllib.request.urlopen(req)
Senthil Kumaran570bc4c2012-10-09 00:38:17 -0700225 ... except urllib.error.URLError as e:
Serhiy Storchakadba90392016-05-10 12:01:23 +0300226 ... print(e.reason) #doctest: +SKIP
Senthil Kumaran570bc4c2012-10-09 00:38:17 -0700227 ...
Georg Brandl116aa622007-08-15 14:28:22 +0000228 (4, 'getaddrinfo failed')
229
230
231HTTPError
232---------
233
234Every HTTP response from the server contains a numeric "status code". Sometimes
235the status code indicates that the server is unable to fulfil the request. The
236default handlers will handle some of these responses for you (for example, if
237the response is a "redirection" that requests the client fetch the document from
Georg Brandl0f7ede42008-06-23 11:23:31 +0000238a different URL, urllib will handle that for you). For those it can't handle,
Benjamin Petersone5384b02008-10-04 22:00:42 +0000239urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
Georg Brandl116aa622007-08-15 14:28:22 +0000240found), '403' (request forbidden), and '401' (authentication required).
241
242See section 10 of RFC 2616 for a reference on all the HTTP error codes.
243
Benjamin Petersone5384b02008-10-04 22:00:42 +0000244The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
Georg Brandl116aa622007-08-15 14:28:22 +0000245corresponds to the error sent by the server.
246
247Error Codes
248~~~~~~~~~~~
249
250Because the default handlers handle redirects (codes in the 300 range), and
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200251codes in the 100--299 range indicate success, you will usually only see error
252codes in the 400--599 range.
Georg Brandl116aa622007-08-15 14:28:22 +0000253
Georg Brandl24420152008-05-26 16:32:26 +0000254:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
Georg Brandl116aa622007-08-15 14:28:22 +0000255response codes in that shows all the response codes used by RFC 2616. The
256dictionary is reproduced here for convenience ::
257
258 # Table mapping response codes to messages; entries have the
259 # form {code: (shortmessage, longmessage)}.
260 responses = {
261 100: ('Continue', 'Request received, please continue'),
262 101: ('Switching Protocols',
263 'Switching to new protocol; obey Upgrade header'),
264
265 200: ('OK', 'Request fulfilled, document follows'),
266 201: ('Created', 'Document created, URL follows'),
267 202: ('Accepted',
268 'Request accepted, processing continues off-line'),
269 203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
270 204: ('No Content', 'Request fulfilled, nothing follows'),
271 205: ('Reset Content', 'Clear input form for further input.'),
272 206: ('Partial Content', 'Partial content follows.'),
273
274 300: ('Multiple Choices',
275 'Object has several resources -- see URI list'),
276 301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
277 302: ('Found', 'Object moved temporarily -- see URI list'),
278 303: ('See Other', 'Object moved -- see Method and URL list'),
279 304: ('Not Modified',
280 'Document has not changed since given time'),
281 305: ('Use Proxy',
282 'You must use proxy specified in Location to access this '
283 'resource.'),
284 307: ('Temporary Redirect',
285 'Object moved temporarily -- see URI list'),
286
287 400: ('Bad Request',
288 'Bad request syntax or unsupported method'),
289 401: ('Unauthorized',
290 'No permission -- see authorization schemes'),
291 402: ('Payment Required',
292 'No payment -- see charging schemes'),
293 403: ('Forbidden',
294 'Request forbidden -- authorization will not help'),
295 404: ('Not Found', 'Nothing matches the given URI'),
296 405: ('Method Not Allowed',
297 'Specified method is invalid for this server.'),
298 406: ('Not Acceptable', 'URI not available in preferred format.'),
299 407: ('Proxy Authentication Required', 'You must authenticate with '
300 'this proxy before proceeding.'),
301 408: ('Request Timeout', 'Request timed out; try again later.'),
302 409: ('Conflict', 'Request conflict.'),
303 410: ('Gone',
304 'URI no longer exists and has been permanently removed.'),
305 411: ('Length Required', 'Client must specify Content-Length.'),
306 412: ('Precondition Failed', 'Precondition in headers is false.'),
307 413: ('Request Entity Too Large', 'Entity is too large.'),
308 414: ('Request-URI Too Long', 'URI is too long.'),
309 415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
310 416: ('Requested Range Not Satisfiable',
311 'Cannot satisfy request range.'),
312 417: ('Expectation Failed',
313 'Expect condition could not be satisfied.'),
314
315 500: ('Internal Server Error', 'Server got itself in trouble'),
316 501: ('Not Implemented',
317 'Server does not support this operation'),
318 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
319 503: ('Service Unavailable',
320 'The server cannot process the request due to a high load'),
321 504: ('Gateway Timeout',
322 'The gateway server did not receive a timely response'),
323 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
324 }
325
326When an error is raised the server responds by returning an HTTP error code
Benjamin Petersone5384b02008-10-04 22:00:42 +0000327*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
Georg Brandl116aa622007-08-15 14:28:22 +0000328page returned. This means that as well as the code attribute, it also has read,
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000329geturl, and info, methods as returned by the ``urllib.response`` module::
Georg Brandl116aa622007-08-15 14:28:22 +0000330
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000331 >>> req = urllib.request.Request('http://www.python.org/fish.html')
Georg Brandl48310cd2009-01-03 21:18:54 +0000332 >>> try:
Senthil Kumaran570bc4c2012-10-09 00:38:17 -0700333 ... urllib.request.urlopen(req)
334 ... except urllib.error.HTTPError as e:
335 ... print(e.code)
336 ... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
337 ...
Georg Brandl116aa622007-08-15 14:28:22 +0000338 404
Senthil Kumaran570bc4c2012-10-09 00:38:17 -0700339 b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
340 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
341 ...
342 <title>Page Not Found</title>\n
343 ...
Georg Brandl116aa622007-08-15 14:28:22 +0000344
345Wrapping it Up
346--------------
347
Benjamin Petersone5384b02008-10-04 22:00:42 +0000348So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
Georg Brandl116aa622007-08-15 14:28:22 +0000349basic approaches. I prefer the second approach.
350
351Number 1
352~~~~~~~~
353
354::
355
356
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000357 from urllib.request import Request, urlopen
358 from urllib.error import URLError, HTTPError
Georg Brandl116aa622007-08-15 14:28:22 +0000359 req = Request(someurl)
360 try:
361 response = urlopen(req)
Michael Foord20b50b12009-05-12 11:19:14 +0000362 except HTTPError as e:
Georg Brandl6911e3c2007-09-04 07:15:32 +0000363 print('The server couldn\'t fulfill the request.')
364 print('Error code: ', e.code)
Michael Foord20b50b12009-05-12 11:19:14 +0000365 except URLError as e:
Georg Brandl6911e3c2007-09-04 07:15:32 +0000366 print('We failed to reach a server.')
367 print('Reason: ', e.reason)
Georg Brandl116aa622007-08-15 14:28:22 +0000368 else:
369 # everything is fine
370
371
372.. note::
373
374 The ``except HTTPError`` *must* come first, otherwise ``except URLError``
Benjamin Petersone5384b02008-10-04 22:00:42 +0000375 will *also* catch an :exc:`HTTPError`.
Georg Brandl116aa622007-08-15 14:28:22 +0000376
377Number 2
378~~~~~~~~
379
380::
381
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000382 from urllib.request import Request, urlopen
Serhiy Storchakadba90392016-05-10 12:01:23 +0300383 from urllib.error import URLError
Georg Brandl116aa622007-08-15 14:28:22 +0000384 req = Request(someurl)
385 try:
386 response = urlopen(req)
Michael Foord20b50b12009-05-12 11:19:14 +0000387 except URLError as e:
Georg Brandl116aa622007-08-15 14:28:22 +0000388 if hasattr(e, 'reason'):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000389 print('We failed to reach a server.')
390 print('Reason: ', e.reason)
Georg Brandl116aa622007-08-15 14:28:22 +0000391 elif hasattr(e, 'code'):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000392 print('The server couldn\'t fulfill the request.')
393 print('Error code: ', e.code)
Georg Brandl116aa622007-08-15 14:28:22 +0000394 else:
395 # everything is fine
Georg Brandl48310cd2009-01-03 21:18:54 +0000396
Georg Brandl116aa622007-08-15 14:28:22 +0000397
398info and geturl
399===============
400
Benjamin Petersone5384b02008-10-04 22:00:42 +0000401The response returned by urlopen (or the :exc:`HTTPError` instance) has two
402useful methods :meth:`info` and :meth:`geturl` and is defined in the module
403:mod:`urllib.response`..
Georg Brandl116aa622007-08-15 14:28:22 +0000404
405**geturl** - this returns the real URL of the page fetched. This is useful
406because ``urlopen`` (or the opener object used) may have followed a
407redirect. The URL of the page fetched may not be the same as the URL requested.
408
409**info** - this returns a dictionary-like object that describes the page
410fetched, particularly the headers sent by the server. It is currently an
Georg Brandl0f7ede42008-06-23 11:23:31 +0000411:class:`http.client.HTTPMessage` instance.
Georg Brandl116aa622007-08-15 14:28:22 +0000412
413Typical headers include 'Content-length', 'Content-type', and so on. See the
Sanyam Khurana338cd832018-01-20 05:55:37 +0530414`Quick Reference to HTTP Headers <http://jkorpela.fi/http.html>`_
Georg Brandl116aa622007-08-15 14:28:22 +0000415for a useful listing of HTTP headers with brief explanations of their meaning
416and use.
417
418
419Openers and Handlers
420====================
421
422When you fetch a URL you use an opener (an instance of the perhaps
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000423confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
Georg Brandl116aa622007-08-15 14:28:22 +0000424the default opener - via ``urlopen`` - but you can create custom
425openers. Openers use handlers. All the "heavy lifting" is done by the
426handlers. Each handler knows how to open URLs for a particular URL scheme (http,
427ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
428redirections or HTTP cookies.
429
430You will want to create openers if you want to fetch URLs with specific handlers
431installed, for example to get an opener that handles cookies, or to get an
432opener that does not handle redirections.
433
434To create an opener, instantiate an ``OpenerDirector``, and then call
435``.add_handler(some_handler_instance)`` repeatedly.
436
437Alternatively, you can use ``build_opener``, which is a convenience function for
438creating opener objects with a single function call. ``build_opener`` adds
439several handlers by default, but provides a quick way to add more and/or
440override the default handlers.
441
442Other sorts of handlers you might want to can handle proxies, authentication,
443and other common but slightly specialised situations.
444
445``install_opener`` can be used to make an ``opener`` object the (global) default
446opener. This means that calls to ``urlopen`` will use the opener you have
447installed.
448
449Opener objects have an ``open`` method, which can be called directly to fetch
450urls in the same way as the ``urlopen`` function: there's no need to call
451``install_opener``, except as a convenience.
452
453
454Basic Authentication
455====================
456
457To illustrate creating and installing a handler we will use the
458``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
459including an explanation of how Basic Authentication works - see the `Basic
460Authentication Tutorial
461<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
462
463When authentication is required, the server sends a header (as well as the 401
464error code) requesting authentication. This specifies the authentication scheme
Serhiy Storchakaf47036c2013-12-24 11:04:36 +0200465and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
Georg Brandl116aa622007-08-15 14:28:22 +0000466realm="REALM"``.
467
Serhiy Storchaka46936d52018-04-08 19:18:04 +0300468e.g.
469
470.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000471
Sandro Tosi08ccbf42012-04-24 17:36:41 +0200472 WWW-Authenticate: Basic realm="cPanel Users"
Georg Brandl116aa622007-08-15 14:28:22 +0000473
474
475The client should then retry the request with the appropriate name and password
476for the realm included as a header in the request. This is 'basic
477authentication'. In order to simplify this process we can create an instance of
478``HTTPBasicAuthHandler`` and an opener to use this handler.
479
480The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
481the mapping of URLs and realms to passwords and usernames. If you know what the
482realm is (from the authentication header sent by the server), then you can use a
483``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
484case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
485you to specify a default username and password for a URL. This will be supplied
486in the absence of you providing an alternative combination for a specific
487realm. We indicate this by providing ``None`` as the realm argument to the
488``add_password`` method.
489
490The top-level URL is the first URL that requires authentication. URLs "deeper"
491than the URL you pass to .add_password() will also match. ::
492
493 # create a password manager
Georg Brandl48310cd2009-01-03 21:18:54 +0000494 password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
Georg Brandl116aa622007-08-15 14:28:22 +0000495
496 # Add the username and password.
Georg Brandl1f01deb2009-01-03 22:47:39 +0000497 # If we knew the realm, we could use it instead of None.
Georg Brandl116aa622007-08-15 14:28:22 +0000498 top_level_url = "http://example.com/foo/"
499 password_mgr.add_password(None, top_level_url, username, password)
500
Georg Brandl48310cd2009-01-03 21:18:54 +0000501 handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
Georg Brandl116aa622007-08-15 14:28:22 +0000502
503 # create "opener" (OpenerDirector instance)
Georg Brandl48310cd2009-01-03 21:18:54 +0000504 opener = urllib.request.build_opener(handler)
Georg Brandl116aa622007-08-15 14:28:22 +0000505
506 # use the opener to fetch a URL
Georg Brandl48310cd2009-01-03 21:18:54 +0000507 opener.open(a_url)
Georg Brandl116aa622007-08-15 14:28:22 +0000508
509 # Install the opener.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000510 # Now all calls to urllib.request.urlopen use our opener.
Georg Brandl48310cd2009-01-03 21:18:54 +0000511 urllib.request.install_opener(opener)
Georg Brandl116aa622007-08-15 14:28:22 +0000512
513.. note::
514
Ezio Melotti8e87fec2009-07-21 20:37:52 +0000515 In the above example we only supplied our ``HTTPBasicAuthHandler`` to
Georg Brandl116aa622007-08-15 14:28:22 +0000516 ``build_opener``. By default openers have the handlers for normal situations
R David Murray5aea37a2013-04-28 11:07:16 -0400517 -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
518 environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
Georg Brandl116aa622007-08-15 14:28:22 +0000519 ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
R David Murray5aea37a2013-04-28 11:07:16 -0400520 ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
Georg Brandl116aa622007-08-15 14:28:22 +0000521
522``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
523component and the hostname and optionally the port number)
Serhiy Storchakad97b7dc2017-05-16 23:18:09 +0300524e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
525optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
Georg Brandl116aa622007-08-15 14:28:22 +0000526(the latter example includes a port number). The authority, if present, must
Serhiy Storchakad97b7dc2017-05-16 23:18:09 +0300527NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
Georg Brandl116aa622007-08-15 14:28:22 +0000528not correct.
529
530
531Proxies
532=======
533
Georg Brandl0f7ede42008-06-23 11:23:31 +0000534**urllib** will auto-detect your proxy settings and use those. This is through
R David Murray5aea37a2013-04-28 11:07:16 -0400535the ``ProxyHandler``, which is part of the normal handler chain when a proxy
R David Murray9330a942013-04-28 11:24:35 -0400536setting is detected. Normally that's a good thing, but there are occasions
537when it may not be helpful [#]_. One way to do this is to setup our own
538``ProxyHandler``, with no proxies defined. This is done using similar steps to
Serhiy Storchakaf47036c2013-12-24 11:04:36 +0200539setting up a `Basic Authentication`_ handler: ::
Georg Brandl116aa622007-08-15 14:28:22 +0000540
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000541 >>> proxy_support = urllib.request.ProxyHandler({})
542 >>> opener = urllib.request.build_opener(proxy_support)
543 >>> urllib.request.install_opener(opener)
Georg Brandl116aa622007-08-15 14:28:22 +0000544
545.. note::
546
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000547 Currently ``urllib.request`` *does not* support fetching of ``https`` locations
548 through a proxy. However, this can be enabled by extending urllib.request as
Georg Brandl116aa622007-08-15 14:28:22 +0000549 shown in the recipe [#]_.
550
Senthil Kumaran4cbb23f2016-07-30 23:24:16 -0700551.. note::
552
Senthil Kumaran17742f22016-07-30 23:39:06 -0700553 ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
554 the documentation on :func:`~urllib.request.getproxies`.
Senthil Kumaran4cbb23f2016-07-30 23:24:16 -0700555
Georg Brandl116aa622007-08-15 14:28:22 +0000556
557Sockets and Layers
558==================
559
Georg Brandl0f7ede42008-06-23 11:23:31 +0000560The Python support for fetching resources from the web is layered. urllib uses
561the :mod:`http.client` library, which in turn uses the socket library.
Georg Brandl116aa622007-08-15 14:28:22 +0000562
563As of Python 2.3 you can specify how long a socket should wait for a response
564before timing out. This can be useful in applications which have to fetch web
565pages. By default the socket module has *no timeout* and can hang. Currently,
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000566the socket timeout is not exposed at the http.client or urllib.request levels.
Georg Brandl24420152008-05-26 16:32:26 +0000567However, you can set the default timeout globally for all sockets using ::
Georg Brandl116aa622007-08-15 14:28:22 +0000568
569 import socket
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000570 import urllib.request
Georg Brandl116aa622007-08-15 14:28:22 +0000571
572 # timeout in seconds
573 timeout = 10
Georg Brandl48310cd2009-01-03 21:18:54 +0000574 socket.setdefaulttimeout(timeout)
Georg Brandl116aa622007-08-15 14:28:22 +0000575
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000576 # this call to urllib.request.urlopen now uses the default timeout
Georg Brandl116aa622007-08-15 14:28:22 +0000577 # we have set in the socket module
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000578 req = urllib.request.Request('http://www.voidspace.org.uk')
579 response = urllib.request.urlopen(req)
Georg Brandl116aa622007-08-15 14:28:22 +0000580
581
582-------
583
584
585Footnotes
586=========
587
588This document was reviewed and revised by John Lee.
589
Benjamin Peterson16ad5cf2015-09-20 23:17:41 +0500590.. [#] Google for example.
Martin Panter898573a2016-12-10 05:12:56 +0000591.. [#] Browser sniffing is a very bad practice for website design - building
Georg Brandl116aa622007-08-15 14:28:22 +0000592 sites using web standards is much more sensible. Unfortunately a lot of
593 sites still send different versions to different browsers.
594.. [#] The user agent for MSIE 6 is
595 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
596.. [#] For details of more HTTP request headers, see
597 `Quick Reference to HTTP Headers`_.
598.. [#] In my case I have to use a proxy to access the internet at work. If you
599 attempt to fetch *localhost* URLs through this proxy it blocks them. IE
Georg Brandl0f7ede42008-06-23 11:23:31 +0000600 is set to use the proxy, which urllib picks up on. In order to test
601 scripts with a localhost server, I have to prevent urllib from using
Georg Brandl116aa622007-08-15 14:28:22 +0000602 the proxy.
Georg Brandl48310cd2009-01-03 21:18:54 +0000603.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300604 <https://code.activestate.com/recipes/456195/>`_.
Georg Brandl48310cd2009-01-03 21:18:54 +0000605