Blame - Doc/howto/urllib2.rst - platform/external/python/cpython3

blob: ef1791cebec57a24842d7834485f758b0503db19 [file] [log] [blame]

Georg Brandl	9e4ff75	2009-12-19 17:57:51 +0000	[diff] [blame]	1	.. _urllib-howto:
				2
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	3	***********************************************************
				4	HOWTO Fetch Internet Resources Using The urllib Package
				5	***********************************************************
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	6
				7	:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
				8
				9	.. note::
				10
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	11	There is a French translation of an earlier revision of this
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	12	HOWTO, available at `urllib2 - Le Manuel manquant
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	13	<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
				17	Introduction
				18	============
				19
				20	.. sidebar:: Related Articles
				21
				22	You may also find useful the following article on fetching web resources
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	23	with Python:
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	24
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	25	* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	26
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	27	A tutorial on Basic Authentication, with examples in Python.
				28
Georg Brandl	e73778c	2014-10-29 08:36:35 +0100	[diff] [blame]	29	urllib.request is a Python module for fetching URLs
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	30	(Uniform Resource Locators). It offers a very simple interface, in the form of
				31	the urlopen function. This is capable of fetching URLs using a variety of
				32	different protocols. It also offers a slightly more complex interface for
				33	handling common situations - like basic authentication, cookies, proxies and so
				34	on. These are provided by objects called handlers and openers.
				35
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	36	urllib.request supports fetching URLs for many "URL schemes" (identified by the string
Serhiy Storchaka	d97b7dc	2017-05-16 23:18:09 +0300	[diff] [blame]	37	before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
				38	``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39	This tutorial focuses on the most common case, HTTP.
				40
				41	For straightforward situations urlopen is very easy to use. But as soon as you
				42	encounter errors or non-trivial cases when opening HTTP URLs, you will need some
				43	understanding of the HyperText Transfer Protocol. The most comprehensive and
				44	authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	45	not intended to be easy to read. This HOWTO aims to illustrate using urllib,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	with enough detail about HTTP to help you through. It is not intended to replace
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	47	the :mod:`urllib.request` docs, but is supplementary to them.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48
				49
				50	Fetching URLs
				51	=============
				52
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	53	The simplest way to use urllib.request is as follows::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	54
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	55	import urllib.request
Berker Peksag	9575e18	2015-04-12 13:52:49 +0300	[diff] [blame]	56	with urllib.request.urlopen('http://python.org/') as response:
				57	html = response.read()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	58
Andrés Delfino	c89b221	2018-04-16 11:02:56 -0300	[diff] [blame^]	59	If you wish to retrieve a resource via URL and store it in a temporary
				60	location, you can do so via the :func:`shutil.copyfileobj` and
				61	:func:`tempfile.NamedTemporaryFile` functions::
Senthil Kumaran	e24f96a	2012-03-13 19:29:33 -0700	[diff] [blame]	62
Andrés Delfino	c89b221	2018-04-16 11:02:56 -0300	[diff] [blame^]	63	import shutil
				64	import tempfile
Senthil Kumaran	e24f96a	2012-03-13 19:29:33 -0700	[diff] [blame]	65	import urllib.request
Andrés Delfino	c89b221	2018-04-16 11:02:56 -0300	[diff] [blame^]	66
				67	with urllib.request.urlopen('http://python.org/') as response:
				68	with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
				69	shutil.copyfileobj(response, tmp_file)
				70
				71	with open(tmp_file.name) as html:
				72	pass
Senthil Kumaran	e24f96a	2012-03-13 19:29:33 -0700	[diff] [blame]	73
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	74	Many uses of urllib will be that simple (note that instead of an 'http:' URL we
Martin Panter	6245cb3	2016-04-15 02:14:19 +0000	[diff] [blame]	75	could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	76	purpose of this tutorial to explain the more complicated cases, concentrating on
				77	HTTP.
				78
				79	HTTP is based on requests and responses - the client makes requests and servers
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	80	send responses. urllib.request mirrors this with a ``Request`` object which represents
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	81	the HTTP request you are making. In its simplest form you create a Request
				82	object that specifies the URL you want to fetch. Calling ``urlopen`` with this
				83	Request object returns a response object for the URL requested. This response is
				84	a file-like object, which means you can for example call ``.read()`` on the
				85	response::
				86
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	87	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	89	req = urllib.request.Request('http://www.voidspace.org.uk')
Berker Peksag	9575e18	2015-04-12 13:52:49 +0300	[diff] [blame]	90	with urllib.request.urlopen(req) as response:
				91	the_page = response.read()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	92
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	93	Note that urllib.request makes use of the same Request interface to handle all URL
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	94	schemes. For example, you can make an FTP request like so::
				95
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	96	req = urllib.request.Request('ftp://example.com/')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	97
				98	In the case of HTTP, there are two extra things that Request objects allow you
				99	to do: First, you can pass data to be sent to the server. Second, you can pass
				100	extra information ("metadata") about the data or the about request itself, to
				101	the server - this information is sent as HTTP "headers". Let's look at each of
				102	these in turn.
				103
				104	Data
				105	----
				106
				107	Sometimes you want to send data to a URL (often the URL will refer to a CGI
Berker Peksag	fd6400a	2014-07-01 06:02:42 +0300	[diff] [blame]	108	(Common Gateway Interface) script or other web application). With HTTP,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109	this is often done using what's known as a POST request. This is often what
				110	your browser does when you submit a HTML form that you filled in on the web. Not
				111	all POSTs have to come from forms: you can use a POST to transmit arbitrary data
				112	to your own application. In the common case of HTML forms, the data needs to be
				113	encoded in a standard way, and then passed to the Request object as the ``data``
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	114	argument. The encoding is done using a function from the :mod:`urllib.parse`
				115	library. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	116
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	117	import urllib.parse
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	118	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	119
				120	url = 'http://www.someserver.com/cgi-bin/register.cgi'
				121	values = {'name' : 'Michael Foord',
				122	'location' : 'Northampton',
				123	'language' : 'Python' }
				124
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	125	data = urllib.parse.urlencode(values)
Martin Panter	cda85a0	2015-11-24 22:33:18 +0000	[diff] [blame]	126	data = data.encode('ascii') # data should be bytes
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	127	req = urllib.request.Request(url, data)
Berker Peksag	9575e18	2015-04-12 13:52:49 +0300	[diff] [blame]	128	with urllib.request.urlopen(req) as response:
				129	the_page = response.read()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
				131	Note that other encodings are sometimes required (e.g. for file upload from HTML
				132	forms - see `HTML Specification, Form Submission
Serhiy Storchaka	6dff020	2016-05-07 10:49:07 +0300	[diff] [blame]	133	<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	134	details).
				135
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	136	If you do not pass the ``data`` argument, urllib uses a GET request. One
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	137	way in which GET and POST requests differ is that POST requests often have
				138	"side-effects": they change the state of the system in some way (for example by
				139	placing an order with the website for a hundredweight of tinned spam to be
				140	delivered to your door). Though the HTTP standard makes it clear that POSTs are
				141	intended to always cause side-effects, and GET requests never to cause
				142	side-effects, nothing prevents a GET request from having side-effects, nor a
				143	POST requests from having no side-effects. Data can also be passed in an HTTP
				144	GET request by encoding it in the URL itself.
				145
				146	This is done as follows::
				147
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	148	>>> import urllib.request
				149	>>> import urllib.parse
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	150	>>> data = {}
				151	>>> data['name'] = 'Somebody Here'
				152	>>> data['location'] = 'Northampton'
				153	>>> data['language'] = 'Python'
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	154	>>> url_values = urllib.parse.urlencode(data)
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	155	>>> print(url_values) # The order may differ from below. #doctest: +SKIP
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	156	name=Somebody+Here&language=Python&location=Northampton
				157	>>> url = 'http://www.example.com/example.cgi'
				158	>>> full_url = url + '?' + url_values
Georg Brandl	06ad13e	2011-07-23 08:04:40 +0200	[diff] [blame]	159	>>> data = urllib.request.urlopen(full_url)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	160
				161	Notice that the full URL is created by adding a ``?`` to the URL, followed by
				162	the encoded values.
				163
				164	Headers
				165	-------
				166
				167	We'll discuss here one particular HTTP header, to illustrate how to add headers
				168	to your HTTP request.
				169
				170	Some websites [#]_ dislike being browsed by programs, or send different versions
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	171	to different browsers [#]_. By default urllib identifies itself as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172	``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
				173	numbers of the Python release,
				174	e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
				175	not work. The way a browser identifies itself is through the
				176	``User-Agent`` header [#]_. When you create a Request object you can
				177	pass a dictionary of headers in. The following example makes the same
				178	request as above, but identifies itself as a version of Internet
				179	Explorer [#]_. ::
				180
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	181	import urllib.parse
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	182	import urllib.request
				183
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	184	url = 'http://www.someserver.com/cgi-bin/register.cgi'
Benjamin Peterson	95acbce	2015-09-20 23:16:45 +0500	[diff] [blame]	185	user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	186	values = {'name': 'Michael Foord',
				187	'location': 'Northampton',
				188	'language': 'Python' }
				189	headers = {'User-Agent': user_agent}
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	190
Martin Panter	cda85a0	2015-11-24 22:33:18 +0000	[diff] [blame]	191	data = urllib.parse.urlencode(values)
				192	data = data.encode('ascii')
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	193	req = urllib.request.Request(url, data, headers)
Berker Peksag	9575e18	2015-04-12 13:52:49 +0300	[diff] [blame]	194	with urllib.request.urlopen(req) as response:
				195	the_page = response.read()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	196
				197	The response also has two useful methods. See the section on `info and geturl`_
				198	which comes after we have a look at what happens when things go wrong.
				199
				200
				201	Handling Exceptions
				202	===================
				203
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	204	urlopen raises :exc:`URLError` when it cannot handle a response (though as
				205	usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
				206	:exc:`TypeError` etc. may also be raised).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	207
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	208	:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	209	HTTP URLs.
				210
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	211	The exception classes are exported from the :mod:`urllib.error` module.
				212
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	213	URLError
				214	--------
				215
				216	Often, URLError is raised because there is no network connection (no route to
				217	the specified server), or the specified server doesn't exist. In this case, the
				218	exception raised will have a 'reason' attribute, which is a tuple containing an
				219	error code and a text error message.
				220
				221	e.g. ::
				222
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	223	>>> req = urllib.request.Request('http://www.pretend_server.org')
				224	>>> try: urllib.request.urlopen(req)
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	225	... except urllib.error.URLError as e:
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	226	... print(e.reason) #doctest: +SKIP
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	227	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	228	(4, 'getaddrinfo failed')
				229
				230
				231	HTTPError
				232	---------
				233
				234	Every HTTP response from the server contains a numeric "status code". Sometimes
				235	the status code indicates that the server is unable to fulfil the request. The
				236	default handlers will handle some of these responses for you (for example, if
				237	the response is a "redirection" that requests the client fetch the document from
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	238	a different URL, urllib will handle that for you). For those it can't handle,
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	239	urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	240	found), '403' (request forbidden), and '401' (authentication required).
				241
				242	See section 10 of RFC 2616 for a reference on all the HTTP error codes.
				243
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	244	The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	245	corresponds to the error sent by the server.
				246
				247	Error Codes
				248	~~~~~~~~~~~
				249
				250	Because the default handlers handle redirects (codes in the 300 range), and
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	251	codes in the 100--299 range indicate success, you will usually only see error
				252	codes in the 400--599 range.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	253
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	254	:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	255	response codes in that shows all the response codes used by RFC 2616. The
				256	dictionary is reproduced here for convenience ::
				257
				258	# Table mapping response codes to messages; entries have the
				259	# form {code: (shortmessage, longmessage)}.
				260	responses = {
				261	100: ('Continue', 'Request received, please continue'),
				262	101: ('Switching Protocols',
				263	'Switching to new protocol; obey Upgrade header'),
				264
				265	200: ('OK', 'Request fulfilled, document follows'),
				266	201: ('Created', 'Document created, URL follows'),
				267	202: ('Accepted',
				268	'Request accepted, processing continues off-line'),
				269	203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
				270	204: ('No Content', 'Request fulfilled, nothing follows'),
				271	205: ('Reset Content', 'Clear input form for further input.'),
				272	206: ('Partial Content', 'Partial content follows.'),
				273
				274	300: ('Multiple Choices',
				275	'Object has several resources -- see URI list'),
				276	301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
				277	302: ('Found', 'Object moved temporarily -- see URI list'),
				278	303: ('See Other', 'Object moved -- see Method and URL list'),
				279	304: ('Not Modified',
				280	'Document has not changed since given time'),
				281	305: ('Use Proxy',
				282	'You must use proxy specified in Location to access this '
				283	'resource.'),
				284	307: ('Temporary Redirect',
				285	'Object moved temporarily -- see URI list'),
				286
				287	400: ('Bad Request',
				288	'Bad request syntax or unsupported method'),
				289	401: ('Unauthorized',
				290	'No permission -- see authorization schemes'),
				291	402: ('Payment Required',
				292	'No payment -- see charging schemes'),
				293	403: ('Forbidden',
				294	'Request forbidden -- authorization will not help'),
				295	404: ('Not Found', 'Nothing matches the given URI'),
				296	405: ('Method Not Allowed',
				297	'Specified method is invalid for this server.'),
				298	406: ('Not Acceptable', 'URI not available in preferred format.'),
				299	407: ('Proxy Authentication Required', 'You must authenticate with '
				300	'this proxy before proceeding.'),
				301	408: ('Request Timeout', 'Request timed out; try again later.'),
				302	409: ('Conflict', 'Request conflict.'),
				303	410: ('Gone',
				304	'URI no longer exists and has been permanently removed.'),
				305	411: ('Length Required', 'Client must specify Content-Length.'),
				306	412: ('Precondition Failed', 'Precondition in headers is false.'),
				307	413: ('Request Entity Too Large', 'Entity is too large.'),
				308	414: ('Request-URI Too Long', 'URI is too long.'),
				309	415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
				310	416: ('Requested Range Not Satisfiable',
				311	'Cannot satisfy request range.'),
				312	417: ('Expectation Failed',
				313	'Expect condition could not be satisfied.'),
				314
				315	500: ('Internal Server Error', 'Server got itself in trouble'),
				316	501: ('Not Implemented',
				317	'Server does not support this operation'),
				318	502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
				319	503: ('Service Unavailable',
				320	'The server cannot process the request due to a high load'),
				321	504: ('Gateway Timeout',
				322	'The gateway server did not receive a timely response'),
				323	505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
				324	}
				325
				326	When an error is raised the server responds by returning an HTTP error code
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	327	and an error page. You can use the :exc:`HTTPError` instance as a response on the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	328	page returned. This means that as well as the code attribute, it also has read,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	329	geturl, and info, methods as returned by the ``urllib.response`` module::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	330
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	331	>>> req = urllib.request.Request('http://www.python.org/fish.html')
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	332	>>> try:
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	333	... urllib.request.urlopen(req)
				334	... except urllib.error.HTTPError as e:
				335	... print(e.code)
				336	... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
				337	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338	404
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	339	b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
				340	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
				341	...
				342	<title>Page Not Found</title>\n
				343	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344
				345	Wrapping it Up
				346	--------------
				347
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	348	So if you want to be prepared for :exc:`HTTPError` or :exc:`URLError` there are two
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	349	basic approaches. I prefer the second approach.
				350
				351	Number 1
				352	~~~~~~~~
				353
				354	::
				355
				356
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	357	from urllib.request import Request, urlopen
				358	from urllib.error import URLError, HTTPError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	359	req = Request(someurl)
				360	try:
				361	response = urlopen(req)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	362	except HTTPError as e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	363	print('The server couldn\'t fulfill the request.')
				364	print('Error code: ', e.code)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	365	except URLError as e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	366	print('We failed to reach a server.')
				367	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	368	else:
				369	# everything is fine
				370
				371
				372	.. note::
				373
				374	The ``except HTTPError`` must come first, otherwise ``except URLError``
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	375	will also catch an :exc:`HTTPError`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376
				377	Number 2
				378	~~~~~~~~
				379
				380	::
				381
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	382	from urllib.request import Request, urlopen
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	383	from urllib.error import URLError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	384	req = Request(someurl)
				385	try:
				386	response = urlopen(req)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	387	except URLError as e:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	388	if hasattr(e, 'reason'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	389	print('We failed to reach a server.')
				390	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	391	elif hasattr(e, 'code'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	392	print('The server couldn\'t fulfill the request.')
				393	print('Error code: ', e.code)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394	else:
				395	# everything is fine
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	396
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	397
				398	info and geturl
				399	===============
				400
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	401	The response returned by urlopen (or the :exc:`HTTPError` instance) has two
				402	useful methods :meth:`info` and :meth:`geturl` and is defined in the module
				403	:mod:`urllib.response`..
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	404
				405	geturl - this returns the real URL of the page fetched. This is useful
				406	because ``urlopen`` (or the opener object used) may have followed a
				407	redirect. The URL of the page fetched may not be the same as the URL requested.
				408
				409	info - this returns a dictionary-like object that describes the page
				410	fetched, particularly the headers sent by the server. It is currently an
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	411	:class:`http.client.HTTPMessage` instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	412
				413	Typical headers include 'Content-length', 'Content-type', and so on. See the
Sanyam Khurana	338cd83	2018-01-20 05:55:37 +0530	[diff] [blame]	414	`Quick Reference to HTTP Headers <http://jkorpela.fi/http.html>`_
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	415	for a useful listing of HTTP headers with brief explanations of their meaning
				416	and use.
				417
				418
				419	Openers and Handlers
				420	====================
				421
				422	When you fetch a URL you use an opener (an instance of the perhaps
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	423	confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	424	the default opener - via ``urlopen`` - but you can create custom
				425	openers. Openers use handlers. All the "heavy lifting" is done by the
				426	handlers. Each handler knows how to open URLs for a particular URL scheme (http,
				427	ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
				428	redirections or HTTP cookies.
				429
				430	You will want to create openers if you want to fetch URLs with specific handlers
				431	installed, for example to get an opener that handles cookies, or to get an
				432	opener that does not handle redirections.
				433
				434	To create an opener, instantiate an ``OpenerDirector``, and then call
				435	``.add_handler(some_handler_instance)`` repeatedly.
				436
				437	Alternatively, you can use ``build_opener``, which is a convenience function for
				438	creating opener objects with a single function call. ``build_opener`` adds
				439	several handlers by default, but provides a quick way to add more and/or
				440	override the default handlers.
				441
				442	Other sorts of handlers you might want to can handle proxies, authentication,
				443	and other common but slightly specialised situations.
				444
				445	``install_opener`` can be used to make an ``opener`` object the (global) default
				446	opener. This means that calls to ``urlopen`` will use the opener you have
				447	installed.
				448
				449	Opener objects have an ``open`` method, which can be called directly to fetch
				450	urls in the same way as the ``urlopen`` function: there's no need to call
				451	``install_opener``, except as a convenience.
				452
				453
				454	Basic Authentication
				455	====================
				456
				457	To illustrate creating and installing a handler we will use the
				458	``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
				459	including an explanation of how Basic Authentication works - see the `Basic
				460	Authentication Tutorial
				461	<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
				462
				463	When authentication is required, the server sends a header (as well as the 401
				464	error code) requesting authentication. This specifies the authentication scheme
Serhiy Storchaka	f47036c	2013-12-24 11:04:36 +0200	[diff] [blame]	465	and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	466	realm="REALM"``.
				467
Serhiy Storchaka	46936d5	2018-04-08 19:18:04 +0300	[diff] [blame]	468	e.g.
				469
				470	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	471
Sandro Tosi	08ccbf4	2012-04-24 17:36:41 +0200	[diff] [blame]	472	WWW-Authenticate: Basic realm="cPanel Users"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	473
				474
				475	The client should then retry the request with the appropriate name and password
				476	for the realm included as a header in the request. This is 'basic
				477	authentication'. In order to simplify this process we can create an instance of
				478	``HTTPBasicAuthHandler`` and an opener to use this handler.
				479
				480	The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
				481	the mapping of URLs and realms to passwords and usernames. If you know what the
				482	realm is (from the authentication header sent by the server), then you can use a
				483	``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
				484	case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
				485	you to specify a default username and password for a URL. This will be supplied
				486	in the absence of you providing an alternative combination for a specific
				487	realm. We indicate this by providing ``None`` as the realm argument to the
				488	``add_password`` method.
				489
				490	The top-level URL is the first URL that requires authentication. URLs "deeper"
				491	than the URL you pass to .add_password() will also match. ::
				492
				493	# create a password manager
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	494	password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	495
				496	# Add the username and password.
Georg Brandl	1f01deb	2009-01-03 22:47:39 +0000	[diff] [blame]	497	# If we knew the realm, we could use it instead of None.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	498	top_level_url = "http://example.com/foo/"
				499	password_mgr.add_password(None, top_level_url, username, password)
				500
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	501	handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	502
				503	# create "opener" (OpenerDirector instance)
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	504	opener = urllib.request.build_opener(handler)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	505
				506	# use the opener to fetch a URL
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	507	opener.open(a_url)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	508
				509	# Install the opener.
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	510	# Now all calls to urllib.request.urlopen use our opener.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	511	urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	512
				513	.. note::
				514
Ezio Melotti	8e87fec	2009-07-21 20:37:52 +0000	[diff] [blame]	515	In the above example we only supplied our ``HTTPBasicAuthHandler`` to
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	516	``build_opener``. By default openers have the handlers for normal situations
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	517	-- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
				518	environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	519	``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	520	``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	521
				522	``top_level_url`` is in fact either a full URL (including the 'http:' scheme
				523	component and the hostname and optionally the port number)
Serhiy Storchaka	d97b7dc	2017-05-16 23:18:09 +0300	[diff] [blame]	524	e.g. ``"http://example.com/"`` or an "authority" (i.e. the hostname,
				525	optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	526	(the latter example includes a port number). The authority, if present, must
Serhiy Storchaka	d97b7dc	2017-05-16 23:18:09 +0300	[diff] [blame]	527	NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	528	not correct.
				529
				530
				531	Proxies
				532	=======
				533
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	534	urllib will auto-detect your proxy settings and use those. This is through
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	535	the ``ProxyHandler``, which is part of the normal handler chain when a proxy
R David Murray	9330a94	2013-04-28 11:24:35 -0400	[diff] [blame]	536	setting is detected. Normally that's a good thing, but there are occasions
				537	when it may not be helpful [#]_. One way to do this is to setup our own
				538	``ProxyHandler``, with no proxies defined. This is done using similar steps to
Serhiy Storchaka	f47036c	2013-12-24 11:04:36 +0200	[diff] [blame]	539	setting up a `Basic Authentication`_ handler: ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	540
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	541	>>> proxy_support = urllib.request.ProxyHandler({})
				542	>>> opener = urllib.request.build_opener(proxy_support)
				543	>>> urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	544
				545	.. note::
				546
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	547	Currently ``urllib.request`` does not support fetching of ``https`` locations
				548	through a proxy. However, this can be enabled by extending urllib.request as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	549	shown in the recipe [#]_.
				550
Senthil Kumaran	4cbb23f	2016-07-30 23:24:16 -0700	[diff] [blame]	551	.. note::
				552
Senthil Kumaran	17742f2	2016-07-30 23:39:06 -0700	[diff] [blame]	553	``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
				554	the documentation on :func:`~urllib.request.getproxies`.
Senthil Kumaran	4cbb23f	2016-07-30 23:24:16 -0700	[diff] [blame]	555
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	556
				557	Sockets and Layers
				558	==================
				559
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	560	The Python support for fetching resources from the web is layered. urllib uses
				561	the :mod:`http.client` library, which in turn uses the socket library.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	562
				563	As of Python 2.3 you can specify how long a socket should wait for a response
				564	before timing out. This can be useful in applications which have to fetch web
				565	pages. By default the socket module has no timeout and can hang. Currently,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	566	the socket timeout is not exposed at the http.client or urllib.request levels.
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	567	However, you can set the default timeout globally for all sockets using ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	568
				569	import socket
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	570	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
				572	# timeout in seconds
				573	timeout = 10
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	574	socket.setdefaulttimeout(timeout)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	575
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	576	# this call to urllib.request.urlopen now uses the default timeout
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	577	# we have set in the socket module
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	578	req = urllib.request.Request('http://www.voidspace.org.uk')
				579	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	580
				581
				582	-------
				583
				584
				585	Footnotes
				586	=========
				587
				588	This document was reviewed and revised by John Lee.
				589
Benjamin Peterson	16ad5cf	2015-09-20 23:17:41 +0500	[diff] [blame]	590	.. [#] Google for example.
Martin Panter	898573a	2016-12-10 05:12:56 +0000	[diff] [blame]	591	.. [#] Browser sniffing is a very bad practice for website design - building
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	592	sites using web standards is much more sensible. Unfortunately a lot of
				593	sites still send different versions to different browsers.
				594	.. [#] The user agent for MSIE 6 is
				595	'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'
				596	.. [#] For details of more HTTP request headers, see
				597	`Quick Reference to HTTP Headers`_.
				598	.. [#] In my case I have to use a proxy to access the internet at work. If you
				599	attempt to fetch localhost URLs through this proxy it blocks them. IE
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	600	is set to use the proxy, which urllib picks up on. In order to test
				601	scripts with a localhost server, I have to prevent urllib from using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602	the proxy.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	603	.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
Serhiy Storchaka	6dff020	2016-05-07 10:49:07 +0300	[diff] [blame]	604	<https://code.activestate.com/recipes/456195/>`_.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	605