Blame - Doc/howto/urllib2.rst - platform/external/python/cpython3

blob: 7b217eaef9114193754f84d9867478f180155a09 [file] [log] [blame]

Georg Brandl	9e4ff75	2009-12-19 17:57:51 +0000	[diff] [blame]	1	.. _urllib-howto:
				2
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	3	***********************************************************
				4	HOWTO Fetch Internet Resources Using The urllib Package
				5	***********************************************************
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	6
				7	:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
				8
				9	.. note::
				10
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	11	There is a French translation of an earlier revision of this
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	12	HOWTO, available at `urllib2 - Le Manuel manquant
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	13	<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
				17	Introduction
				18	============
				19
				20	.. sidebar:: Related Articles
				21
				22	You may also find useful the following article on fetching web resources
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	23	with Python:
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	24
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	25	* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	26
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	27	A tutorial on Basic Authentication, with examples in Python.
				28
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	29	urllib.request is a `Python <http://www.python.org>`_ module for fetching URLs
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	30	(Uniform Resource Locators). It offers a very simple interface, in the form of
				31	the urlopen function. This is capable of fetching URLs using a variety of
				32	different protocols. It also offers a slightly more complex interface for
				33	handling common situations - like basic authentication, cookies, proxies and so
				34	on. These are provided by objects called handlers and openers.
				35
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	36	urllib.request supports fetching URLs for many "URL schemes" (identified by the string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	37	before the ":" in URL - for example "ftp" is the URL scheme of
				38	"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
				39	This tutorial focuses on the most common case, HTTP.
				40
				41	For straightforward situations urlopen is very easy to use. But as soon as you
				42	encounter errors or non-trivial cases when opening HTTP URLs, you will need some
				43	understanding of the HyperText Transfer Protocol. The most comprehensive and
				44	authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	45	not intended to be easy to read. This HOWTO aims to illustrate using urllib,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	with enough detail about HTTP to help you through. It is not intended to replace
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	47	the :mod:`urllib.request` docs, but is supplementary to them.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48
				49
				50	Fetching URLs
				51	=============
				52
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	53	The simplest way to use urllib.request is as follows::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	54
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	55	import urllib.request
				56	response = urllib.request.urlopen('http://python.org/')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	57	html = response.read()
				58
Senthil Kumaran	e24f96a	2012-03-13 19:29:33 -0700	[diff] [blame]	59	If you wish to retrieve a resource via URL and store it in a temporary location,
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	60	you can do so via the :func:`~urllib.request.urlretrieve` function::
Senthil Kumaran	e24f96a	2012-03-13 19:29:33 -0700	[diff] [blame]	61
				62	import urllib.request
				63	local_filename, headers = urllib.request.urlretrieve('http://python.org/')
				64	html = open(local_filename)
				65
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	66	Many uses of urllib will be that simple (note that instead of an 'http:' URL we
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67	could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
				68	purpose of this tutorial to explain the more complicated cases, concentrating on
				69	HTTP.
				70
				71	HTTP is based on requests and responses - the client makes requests and servers
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	72	send responses. urllib.request mirrors this with a ``Request`` object which represents
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	73	the HTTP request you are making. In its simplest form you create a Request
				74	object that specifies the URL you want to fetch. Calling ``urlopen`` with this
				75	Request object returns a response object for the URL requested. This response is
				76	a file-like object, which means you can for example call ``.read()`` on the
				77	response::
				78
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	79	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	81	req = urllib.request.Request('http://www.voidspace.org.uk')
				82	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	83	the_page = response.read()
				84
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	85	Note that urllib.request makes use of the same Request interface to handle all URL
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86	schemes. For example, you can make an FTP request like so::
				87
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	88	req = urllib.request.Request('ftp://example.com/')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
				90	In the case of HTTP, there are two extra things that Request objects allow you
				91	to do: First, you can pass data to be sent to the server. Second, you can pass
				92	extra information ("metadata") about the data or the about request itself, to
				93	the server - this information is sent as HTTP "headers". Let's look at each of
				94	these in turn.
				95
				96	Data
				97	----
				98
				99	Sometimes you want to send data to a URL (often the URL will refer to a CGI
				100	(Common Gateway Interface) script [#]_ or other web application). With HTTP,
				101	this is often done using what's known as a POST request. This is often what
				102	your browser does when you submit a HTML form that you filled in on the web. Not
				103	all POSTs have to come from forms: you can use a POST to transmit arbitrary data
				104	to your own application. In the common case of HTML forms, the data needs to be
				105	encoded in a standard way, and then passed to the Request object as the ``data``
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	106	argument. The encoding is done using a function from the :mod:`urllib.parse`
				107	library. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	108
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	109	import urllib.parse
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	110	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	111
				112	url = 'http://www.someserver.com/cgi-bin/register.cgi'
				113	values = {'name' : 'Michael Foord',
				114	'location' : 'Northampton',
				115	'language' : 'Python' }
				116
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	117	data = urllib.parse.urlencode(values)
Senthil Kumaran	87684e6	2012-03-14 18:08:13 -0700	[diff] [blame]	118	data = data.encode('utf-8') # data should be bytes
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	119	req = urllib.request.Request(url, data)
				120	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	121	the_page = response.read()
				122
				123	Note that other encodings are sometimes required (e.g. for file upload from HTML
				124	forms - see `HTML Specification, Form Submission
				125	<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
				126	details).
				127
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	128	If you do not pass the ``data`` argument, urllib uses a GET request. One
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	129	way in which GET and POST requests differ is that POST requests often have
				130	"side-effects": they change the state of the system in some way (for example by
				131	placing an order with the website for a hundredweight of tinned spam to be
				132	delivered to your door). Though the HTTP standard makes it clear that POSTs are
				133	intended to always cause side-effects, and GET requests never to cause
				134	side-effects, nothing prevents a GET request from having side-effects, nor a
				135	POST requests from having no side-effects. Data can also be passed in an HTTP
				136	GET request by encoding it in the URL itself.
				137
				138	This is done as follows::
				139
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	140	>>> import urllib.request
				141	>>> import urllib.parse
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	142	>>> data = {}
				143	>>> data['name'] = 'Somebody Here'
				144	>>> data['location'] = 'Northampton'
				145	>>> data['language'] = 'Python'
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	146	>>> url_values = urllib.parse.urlencode(data)
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	147	>>> print(url_values) # The order may differ from below. #doctest: +SKIP
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	148	name=Somebody+Here&language=Python&location=Northampton
				149	>>> url = 'http://www.example.com/example.cgi'
				150	>>> full_url = url + '?' + url_values
Georg Brandl	06ad13e	2011-07-23 08:04:40 +0200	[diff] [blame]	151	>>> data = urllib.request.urlopen(full_url)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	152
				153	Notice that the full URL is created by adding a ``?`` to the URL, followed by
				154	the encoded values.
				155
				156	Headers
				157	-------
				158
				159	We'll discuss here one particular HTTP header, to illustrate how to add headers
				160	to your HTTP request.
				161
				162	Some websites [#]_ dislike being browsed by programs, or send different versions
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	163	to different browsers [#]_ . By default urllib identifies itself as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164	``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
				165	numbers of the Python release,
				166	e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
				167	not work. The way a browser identifies itself is through the
				168	``User-Agent`` header [#]_. When you create a Request object you can
				169	pass a dictionary of headers in. The following example makes the same
				170	request as above, but identifies itself as a version of Internet
				171	Explorer [#]_. ::
				172
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	173	import urllib.parse
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	174	import urllib.request
				175
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176	url = 'http://www.someserver.com/cgi-bin/register.cgi'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	177	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	178	values = {'name' : 'Michael Foord',
				179	'location' : 'Northampton',
				180	'language' : 'Python' }
				181	headers = { 'User-Agent' : user_agent }
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	182
Senthil Kumaran	87684e6	2012-03-14 18:08:13 -0700	[diff] [blame]	183	data = urllib.parse.urlencode(values)
				184	data = data.encode('utf-8')
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	185	req = urllib.request.Request(url, data, headers)
				186	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	187	the_page = response.read()
				188
				189	The response also has two useful methods. See the section on `info and geturl`_
				190	which comes after we have a look at what happens when things go wrong.
				191
				192
				193	Handling Exceptions
				194	===================
				195
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	196	urlopen raises :exc:`URLError` when it cannot handle a response (though as
				197	usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
				198	:exc:`TypeError` etc. may also be raised).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	199
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	200	:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	201	HTTP URLs.
				202
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	203	The exception classes are exported from the :mod:`urllib.error` module.
				204
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	205	URLError
				206	--------
				207
				208	Often, URLError is raised because there is no network connection (no route to
				209	the specified server), or the specified server doesn't exist. In this case, the
				210	exception raised will have a 'reason' attribute, which is a tuple containing an
				211	error code and a text error message.
				212
				213	e.g. ::
				214
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	215	>>> req = urllib.request.Request('http://www.pretend_server.org')
				216	>>> try: urllib.request.urlopen(req)
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	217	... except urllib.error.URLError as e:
				218	... print(e.reason) #doctest: +SKIP
				219	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220	(4, 'getaddrinfo failed')
				221
				222
				223	HTTPError
				224	---------
				225
				226	Every HTTP response from the server contains a numeric "status code". Sometimes
				227	the status code indicates that the server is unable to fulfil the request. The
				228	default handlers will handle some of these responses for you (for example, if
				229	the response is a "redirection" that requests the client fetch the document from
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	230	a different URL, urllib will handle that for you). For those it can't handle,
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	231	urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	232	found), '403' (request forbidden), and '401' (authentication required).
				233
				234	See section 10 of RFC 2616 for a reference on all the HTTP error codes.
				235
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	236	The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	237	corresponds to the error sent by the server.
				238
				239	Error Codes
				240	~~~~~~~~~~~
				241
				242	Because the default handlers handle redirects (codes in the 300 range), and
				243	codes in the 100-299 range indicate success, you will usually only see error
				244	codes in the 400-599 range.
				245
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	246	:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	247	response codes in that shows all the response codes used by RFC 2616. The
				248	dictionary is reproduced here for convenience ::
				249
				250	# Table mapping response codes to messages; entries have the
				251	# form {code: (shortmessage, longmessage)}.
				252	responses = {
				253	100: ('Continue', 'Request received, please continue'),
				254	101: ('Switching Protocols',
				255	'Switching to new protocol; obey Upgrade header'),
				256
				257	200: ('OK', 'Request fulfilled, document follows'),
				258	201: ('Created', 'Document created, URL follows'),
				259	202: ('Accepted',
				260	'Request accepted, processing continues off-line'),
				261	203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
				262	204: ('No Content', 'Request fulfilled, nothing follows'),
				263	205: ('Reset Content', 'Clear input form for further input.'),
				264	206: ('Partial Content', 'Partial content follows.'),
				265
				266	300: ('Multiple Choices',
				267	'Object has several resources -- see URI list'),
				268	301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
				269	302: ('Found', 'Object moved temporarily -- see URI list'),
				270	303: ('See Other', 'Object moved -- see Method and URL list'),
				271	304: ('Not Modified',
				272	'Document has not changed since given time'),
				273	305: ('Use Proxy',
				274	'You must use proxy specified in Location to access this '
				275	'resource.'),
				276	307: ('Temporary Redirect',
				277	'Object moved temporarily -- see URI list'),
				278
				279	400: ('Bad Request',
				280	'Bad request syntax or unsupported method'),
				281	401: ('Unauthorized',
				282	'No permission -- see authorization schemes'),
				283	402: ('Payment Required',
				284	'No payment -- see charging schemes'),
				285	403: ('Forbidden',
				286	'Request forbidden -- authorization will not help'),
				287	404: ('Not Found', 'Nothing matches the given URI'),
				288	405: ('Method Not Allowed',
				289	'Specified method is invalid for this server.'),
				290	406: ('Not Acceptable', 'URI not available in preferred format.'),
				291	407: ('Proxy Authentication Required', 'You must authenticate with '
				292	'this proxy before proceeding.'),
				293	408: ('Request Timeout', 'Request timed out; try again later.'),
				294	409: ('Conflict', 'Request conflict.'),
				295	410: ('Gone',
				296	'URI no longer exists and has been permanently removed.'),
				297	411: ('Length Required', 'Client must specify Content-Length.'),
				298	412: ('Precondition Failed', 'Precondition in headers is false.'),
				299	413: ('Request Entity Too Large', 'Entity is too large.'),
				300	414: ('Request-URI Too Long', 'URI is too long.'),
				301	415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
				302	416: ('Requested Range Not Satisfiable',
				303	'Cannot satisfy request range.'),
				304	417: ('Expectation Failed',
				305	'Expect condition could not be satisfied.'),
				306
				307	500: ('Internal Server Error', 'Server got itself in trouble'),
				308	501: ('Not Implemented',
				309	'Server does not support this operation'),
				310	502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
				311	503: ('Service Unavailable',
				312	'The server cannot process the request due to a high load'),
				313	504: ('Gateway Timeout',
				314	'The gateway server did not receive a timely response'),
				315	505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
				316	}
				317
				318	When an error is raised the server responds by returning an HTTP error code
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	319	and an error page. You can use the :exc:`HTTPError` instance as a response on the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	320	page returned. This means that as well as the code attribute, it also has read,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	321	geturl, and info, methods as returned by the ``urllib.response`` module::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	322
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	323	>>> req = urllib.request.Request('http://www.python.org/fish.html')
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	324	>>> try:
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	325	... urllib.request.urlopen(req)
				326	... except urllib.error.HTTPError as e:
				327	... print(e.code)
				328	... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
				329	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	330	404
Senthil Kumaran	570bc4c	2012-10-09 00:38:17 -0700	[diff] [blame]	331	b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
				332	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
				333	...
				334	<title>Page Not Found</title>\n
				335	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	336
				337	Wrapping it Up
				338	--------------
				339
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	340	So if you want to be prepared for :exc:`HTTPError` or :exc:`URLError` there are two
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	341	basic approaches. I prefer the second approach.
				342
				343	Number 1
				344	~~~~~~~~
				345
				346	::
				347
				348
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	349	from urllib.request import Request, urlopen
				350	from urllib.error import URLError, HTTPError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	351	req = Request(someurl)
				352	try:
				353	response = urlopen(req)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	354	except HTTPError as e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	355	print('The server couldn\'t fulfill the request.')
				356	print('Error code: ', e.code)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	357	except URLError as e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	358	print('We failed to reach a server.')
				359	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	360	else:
				361	# everything is fine
				362
				363
				364	.. note::
				365
				366	The ``except HTTPError`` must come first, otherwise ``except URLError``
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	367	will also catch an :exc:`HTTPError`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	368
				369	Number 2
				370	~~~~~~~~
				371
				372	::
				373
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	374	from urllib.request import Request, urlopen
				375	from urllib.error import URLError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	req = Request(someurl)
				377	try:
				378	response = urlopen(req)
Michael Foord	20b50b1	2009-05-12 11:19:14 +0000	[diff] [blame]	379	except URLError as e:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	380	if hasattr(e, 'reason'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	381	print('We failed to reach a server.')
				382	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383	elif hasattr(e, 'code'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	384	print('The server couldn\'t fulfill the request.')
				385	print('Error code: ', e.code)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	386	else:
				387	# everything is fine
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	388
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	389
				390	info and geturl
				391	===============
				392
Benjamin Peterson	e5384b0	2008-10-04 22:00:42 +0000	[diff] [blame]	393	The response returned by urlopen (or the :exc:`HTTPError` instance) has two
				394	useful methods :meth:`info` and :meth:`geturl` and is defined in the module
				395	:mod:`urllib.response`..
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	396
				397	geturl - this returns the real URL of the page fetched. This is useful
				398	because ``urlopen`` (or the opener object used) may have followed a
				399	redirect. The URL of the page fetched may not be the same as the URL requested.
				400
				401	info - this returns a dictionary-like object that describes the page
				402	fetched, particularly the headers sent by the server. It is currently an
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	403	:class:`http.client.HTTPMessage` instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	404
				405	Typical headers include 'Content-length', 'Content-type', and so on. See the
				406	`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
				407	for a useful listing of HTTP headers with brief explanations of their meaning
				408	and use.
				409
				410
				411	Openers and Handlers
				412	====================
				413
				414	When you fetch a URL you use an opener (an instance of the perhaps
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	415	confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	416	the default opener - via ``urlopen`` - but you can create custom
				417	openers. Openers use handlers. All the "heavy lifting" is done by the
				418	handlers. Each handler knows how to open URLs for a particular URL scheme (http,
				419	ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
				420	redirections or HTTP cookies.
				421
				422	You will want to create openers if you want to fetch URLs with specific handlers
				423	installed, for example to get an opener that handles cookies, or to get an
				424	opener that does not handle redirections.
				425
				426	To create an opener, instantiate an ``OpenerDirector``, and then call
				427	``.add_handler(some_handler_instance)`` repeatedly.
				428
				429	Alternatively, you can use ``build_opener``, which is a convenience function for
				430	creating opener objects with a single function call. ``build_opener`` adds
				431	several handlers by default, but provides a quick way to add more and/or
				432	override the default handlers.
				433
				434	Other sorts of handlers you might want to can handle proxies, authentication,
				435	and other common but slightly specialised situations.
				436
				437	``install_opener`` can be used to make an ``opener`` object the (global) default
				438	opener. This means that calls to ``urlopen`` will use the opener you have
				439	installed.
				440
				441	Opener objects have an ``open`` method, which can be called directly to fetch
				442	urls in the same way as the ``urlopen`` function: there's no need to call
				443	``install_opener``, except as a convenience.
				444
				445
				446	Basic Authentication
				447	====================
				448
				449	To illustrate creating and installing a handler we will use the
				450	``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
				451	including an explanation of how Basic Authentication works - see the `Basic
				452	Authentication Tutorial
				453	<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
				454
				455	When authentication is required, the server sends a header (as well as the 401
				456	error code) requesting authentication. This specifies the authentication scheme
Sandro Tosi	08ccbf4	2012-04-24 17:36:41 +0200	[diff] [blame]	457	and a 'realm'. The header looks like : ``WWW-Authenticate: SCHEME
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	458	realm="REALM"``.
				459
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	460	e.g. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	461
Sandro Tosi	08ccbf4	2012-04-24 17:36:41 +0200	[diff] [blame]	462	WWW-Authenticate: Basic realm="cPanel Users"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	463
				464
				465	The client should then retry the request with the appropriate name and password
				466	for the realm included as a header in the request. This is 'basic
				467	authentication'. In order to simplify this process we can create an instance of
				468	``HTTPBasicAuthHandler`` and an opener to use this handler.
				469
				470	The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
				471	the mapping of URLs and realms to passwords and usernames. If you know what the
				472	realm is (from the authentication header sent by the server), then you can use a
				473	``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
				474	case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
				475	you to specify a default username and password for a URL. This will be supplied
				476	in the absence of you providing an alternative combination for a specific
				477	realm. We indicate this by providing ``None`` as the realm argument to the
				478	``add_password`` method.
				479
				480	The top-level URL is the first URL that requires authentication. URLs "deeper"
				481	than the URL you pass to .add_password() will also match. ::
				482
				483	# create a password manager
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	484	password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	485
				486	# Add the username and password.
Georg Brandl	1f01deb	2009-01-03 22:47:39 +0000	[diff] [blame]	487	# If we knew the realm, we could use it instead of None.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	488	top_level_url = "http://example.com/foo/"
				489	password_mgr.add_password(None, top_level_url, username, password)
				490
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	491	handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	492
				493	# create "opener" (OpenerDirector instance)
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	494	opener = urllib.request.build_opener(handler)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	495
				496	# use the opener to fetch a URL
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	497	opener.open(a_url)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	498
				499	# Install the opener.
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	500	# Now all calls to urllib.request.urlopen use our opener.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	501	urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	502
				503	.. note::
				504
Ezio Melotti	8e87fec	2009-07-21 20:37:52 +0000	[diff] [blame]	505	In the above example we only supplied our ``HTTPBasicAuthHandler`` to
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	506	``build_opener``. By default openers have the handlers for normal situations
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	507	-- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
				508	environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	509	``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	510	``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	511
				512	``top_level_url`` is in fact either a full URL (including the 'http:' scheme
				513	component and the hostname and optionally the port number)
				514	e.g. "http://example.com/" or an "authority" (i.e. the hostname,
				515	optionally including the port number) e.g. "example.com" or "example.com:8080"
				516	(the latter example includes a port number). The authority, if present, must
				517	NOT contain the "userinfo" component - for example "joe@password:example.com" is
				518	not correct.
				519
				520
				521	Proxies
				522	=======
				523
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	524	urllib will auto-detect your proxy settings and use those. This is through
R David Murray	5aea37a	2013-04-28 11:07:16 -0400	[diff] [blame]	525	the ``ProxyHandler``, which is part of the normal handler chain when a proxy
R David Murray	9330a94	2013-04-28 11:24:35 -0400	[diff] [blame]	526	setting is detected. Normally that's a good thing, but there are occasions
				527	when it may not be helpful [#]_. One way to do this is to setup our own
				528	``ProxyHandler``, with no proxies defined. This is done using similar steps to
				529	setting up a `Basic Authentication`_ handler : ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	530
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	531	>>> proxy_support = urllib.request.ProxyHandler({})
				532	>>> opener = urllib.request.build_opener(proxy_support)
				533	>>> urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	534
				535	.. note::
				536
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	537	Currently ``urllib.request`` does not support fetching of ``https`` locations
				538	through a proxy. However, this can be enabled by extending urllib.request as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	539	shown in the recipe [#]_.
				540
				541
				542	Sockets and Layers
				543	==================
				544
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	545	The Python support for fetching resources from the web is layered. urllib uses
				546	the :mod:`http.client` library, which in turn uses the socket library.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	547
				548	As of Python 2.3 you can specify how long a socket should wait for a response
				549	before timing out. This can be useful in applications which have to fetch web
				550	pages. By default the socket module has no timeout and can hang. Currently,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	551	the socket timeout is not exposed at the http.client or urllib.request levels.
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	552	However, you can set the default timeout globally for all sockets using ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	553
				554	import socket
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	555	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	556
				557	# timeout in seconds
				558	timeout = 10
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	559	socket.setdefaulttimeout(timeout)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	561	# this call to urllib.request.urlopen now uses the default timeout
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	562	# we have set in the socket module
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	563	req = urllib.request.Request('http://www.voidspace.org.uk')
				564	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	565
				566
				567	-------
				568
				569
				570	Footnotes
				571	=========
				572
				573	This document was reviewed and revised by John Lee.
				574
				575	.. [#] For an introduction to the CGI protocol see
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	576	`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	577	.. [#] Like Google for example. The proper way to use google from a program
				578	is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
				579	`Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
				580	for some examples of using the Google API.
				581	.. [#] Browser sniffing is a very bad practise for website design - building
				582	sites using web standards is much more sensible. Unfortunately a lot of
				583	sites still send different versions to different browsers.
				584	.. [#] The user agent for MSIE 6 is
				585	'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'
				586	.. [#] For details of more HTTP request headers, see
				587	`Quick Reference to HTTP Headers`_.
				588	.. [#] In my case I have to use a proxy to access the internet at work. If you
				589	attempt to fetch localhost URLs through this proxy it blocks them. IE
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	590	is set to use the proxy, which urllib picks up on. In order to test
				591	scripts with a localhost server, I have to prevent urllib from using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	592	the proxy.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	593	.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	594	<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	595