Blame - Doc/howto/urllib2.rst - platform/external/python/cpython3

blob: 5d32d4abf459168c98152b13325418dcec7d95ea [file] [log] [blame]

Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	1	***********************************************************
				2	HOWTO Fetch Internet Resources Using The urllib Package
				3	***********************************************************
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	4
				5	:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
				6
				7	.. note::
				8
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	9	There is a French translation of an earlier revision of this
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10	HOWTO, available at `urllib2 - Le Manuel manquant
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	11	<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	12
				13
				14
				15	Introduction
				16	============
				17
				18	.. sidebar:: Related Articles
				19
				20	You may also find useful the following article on fetching web resources
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	21	with Python:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	22
				23	* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
				24
				25	A tutorial on Basic Authentication, with examples in Python.
				26
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	27	urllib.request is a `Python <http://www.python.org>`_ module for fetching URLs
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	28	(Uniform Resource Locators). It offers a very simple interface, in the form of
				29	the urlopen function. This is capable of fetching URLs using a variety of
				30	different protocols. It also offers a slightly more complex interface for
				31	handling common situations - like basic authentication, cookies, proxies and so
				32	on. These are provided by objects called handlers and openers.
				33
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	34	urllib.request supports fetching URLs for many "URL schemes" (identified by the string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35	before the ":" in URL - for example "ftp" is the URL scheme of
				36	"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
				37	This tutorial focuses on the most common case, HTTP.
				38
				39	For straightforward situations urlopen is very easy to use. But as soon as you
				40	encounter errors or non-trivial cases when opening HTTP URLs, you will need some
				41	understanding of the HyperText Transfer Protocol. The most comprehensive and
				42	authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	43	not intended to be easy to read. This HOWTO aims to illustrate using urllib,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44	with enough detail about HTTP to help you through. It is not intended to replace
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	45	the :mod:`urllib.request` docs, but is supplementary to them.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46
				47
				48	Fetching URLs
				49	=============
				50
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	51	The simplest way to use urllib.request is as follows::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	52
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	53	import urllib.request
				54	response = urllib.request.urlopen('http://python.org/')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	55	html = response.read()
				56
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	57	Many uses of urllib will be that simple (note that instead of an 'http:' URL we
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	58	could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
				59	purpose of this tutorial to explain the more complicated cases, concentrating on
				60	HTTP.
				61
				62	HTTP is based on requests and responses - the client makes requests and servers
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	63	send responses. urllib.request mirrors this with a ``Request`` object which represents
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	64	the HTTP request you are making. In its simplest form you create a Request
				65	object that specifies the URL you want to fetch. Calling ``urlopen`` with this
				66	Request object returns a response object for the URL requested. This response is
				67	a file-like object, which means you can for example call ``.read()`` on the
				68	response::
				69
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	70	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	72	req = urllib.request.Request('http://www.voidspace.org.uk')
				73	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	74	the_page = response.read()
				75
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	76	Note that urllib.request makes use of the same Request interface to handle all URL
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	77	schemes. For example, you can make an FTP request like so::
				78
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	79	req = urllib.request.Request('ftp://example.com/')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81	In the case of HTTP, there are two extra things that Request objects allow you
				82	to do: First, you can pass data to be sent to the server. Second, you can pass
				83	extra information ("metadata") about the data or the about request itself, to
				84	the server - this information is sent as HTTP "headers". Let's look at each of
				85	these in turn.
				86
				87	Data
				88	----
				89
				90	Sometimes you want to send data to a URL (often the URL will refer to a CGI
				91	(Common Gateway Interface) script [#]_ or other web application). With HTTP,
				92	this is often done using what's known as a POST request. This is often what
				93	your browser does when you submit a HTML form that you filled in on the web. Not
				94	all POSTs have to come from forms: you can use a POST to transmit arbitrary data
				95	to your own application. In the common case of HTML forms, the data needs to be
				96	encoded in a standard way, and then passed to the Request object as the ``data``
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	97	argument. The encoding is done using a function from the :mod:`urllib.parse`
				98	library. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	100	import urllib.parse
				101	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
				103	url = 'http://www.someserver.com/cgi-bin/register.cgi'
				104	values = {'name' : 'Michael Foord',
				105	'location' : 'Northampton',
				106	'language' : 'Python' }
				107
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	108	data = urllib.parse.urlencode(values)
				109	req = urllib.request.Request(url, data)
				110	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	111	the_page = response.read()
				112
				113	Note that other encodings are sometimes required (e.g. for file upload from HTML
				114	forms - see `HTML Specification, Form Submission
				115	<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
				116	details).
				117
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	118	If you do not pass the ``data`` argument, urllib uses a GET request. One
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	119	way in which GET and POST requests differ is that POST requests often have
				120	"side-effects": they change the state of the system in some way (for example by
				121	placing an order with the website for a hundredweight of tinned spam to be
				122	delivered to your door). Though the HTTP standard makes it clear that POSTs are
				123	intended to always cause side-effects, and GET requests never to cause
				124	side-effects, nothing prevents a GET request from having side-effects, nor a
				125	POST requests from having no side-effects. Data can also be passed in an HTTP
				126	GET request by encoding it in the URL itself.
				127
				128	This is done as follows::
				129
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	130	>>> import urllib.request
				131	>>> import urllib.parse
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	132	>>> data = {}
				133	>>> data['name'] = 'Somebody Here'
				134	>>> data['location'] = 'Northampton'
				135	>>> data['language'] = 'Python'
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	136	>>> url_values = urllib.parse.urlencode(data)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	137	>>> print(url_values)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	138	name=Somebody+Here&language=Python&location=Northampton
				139	>>> url = 'http://www.example.com/example.cgi'
				140	>>> full_url = url + '?' + url_values
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	141	>>> data = urllib.request.open(full_url)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	142
				143	Notice that the full URL is created by adding a ``?`` to the URL, followed by
				144	the encoded values.
				145
				146	Headers
				147	-------
				148
				149	We'll discuss here one particular HTTP header, to illustrate how to add headers
				150	to your HTTP request.
				151
				152	Some websites [#]_ dislike being browsed by programs, or send different versions
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	153	to different browsers [#]_ . By default urllib identifies itself as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154	``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
				155	numbers of the Python release,
				156	e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
				157	not work. The way a browser identifies itself is through the
				158	``User-Agent`` header [#]_. When you create a Request object you can
				159	pass a dictionary of headers in. The following example makes the same
				160	request as above, but identifies itself as a version of Internet
				161	Explorer [#]_. ::
				162
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	163	import urllib.parse
				164	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
				166	url = 'http://www.someserver.com/cgi-bin/register.cgi'
				167	user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
				168	values = {'name' : 'Michael Foord',
				169	'location' : 'Northampton',
				170	'language' : 'Python' }
				171	headers = { 'User-Agent' : user_agent }
				172
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	173	data = urllib.parse.urlencode(values)
				174	req = urllib.request.Request(url, data, headers)
				175	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176	the_page = response.read()
				177
				178	The response also has two useful methods. See the section on `info and geturl`_
				179	which comes after we have a look at what happens when things go wrong.
				180
				181
				182	Handling Exceptions
				183	===================
				184
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	185	urlopen raises ``URLError`` when it cannot handle a response (though as usual
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	186	with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
				187	be raised).
				188
				189	``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
				190	HTTP URLs.
				191
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	192	The exception classes are exported from the :mod:`urllib.error` module.
				193
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	194	URLError
				195	--------
				196
				197	Often, URLError is raised because there is no network connection (no route to
				198	the specified server), or the specified server doesn't exist. In this case, the
				199	exception raised will have a 'reason' attribute, which is a tuple containing an
				200	error code and a text error message.
				201
				202	e.g. ::
				203
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	204	>>> req = urllib.request.Request('http://www.pretend_server.org')
				205	>>> try: urllib.request.urlopen(req)
				206	>>> except urllib.error.URLError, e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	207	>>> print(e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	208	>>>
				209	(4, 'getaddrinfo failed')
				210
				211
				212	HTTPError
				213	---------
				214
				215	Every HTTP response from the server contains a numeric "status code". Sometimes
				216	the status code indicates that the server is unable to fulfil the request. The
				217	default handlers will handle some of these responses for you (for example, if
				218	the response is a "redirection" that requests the client fetch the document from
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	219	a different URL, urllib will handle that for you). For those it can't handle,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220	urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
				221	found), '403' (request forbidden), and '401' (authentication required).
				222
				223	See section 10 of RFC 2616 for a reference on all the HTTP error codes.
				224
				225	The ``HTTPError`` instance raised will have an integer 'code' attribute, which
				226	corresponds to the error sent by the server.
				227
				228	Error Codes
				229	~~~~~~~~~~~
				230
				231	Because the default handlers handle redirects (codes in the 300 range), and
				232	codes in the 100-299 range indicate success, you will usually only see error
				233	codes in the 400-599 range.
				234
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	235	:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	236	response codes in that shows all the response codes used by RFC 2616. The
				237	dictionary is reproduced here for convenience ::
				238
				239	# Table mapping response codes to messages; entries have the
				240	# form {code: (shortmessage, longmessage)}.
				241	responses = {
				242	100: ('Continue', 'Request received, please continue'),
				243	101: ('Switching Protocols',
				244	'Switching to new protocol; obey Upgrade header'),
				245
				246	200: ('OK', 'Request fulfilled, document follows'),
				247	201: ('Created', 'Document created, URL follows'),
				248	202: ('Accepted',
				249	'Request accepted, processing continues off-line'),
				250	203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
				251	204: ('No Content', 'Request fulfilled, nothing follows'),
				252	205: ('Reset Content', 'Clear input form for further input.'),
				253	206: ('Partial Content', 'Partial content follows.'),
				254
				255	300: ('Multiple Choices',
				256	'Object has several resources -- see URI list'),
				257	301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
				258	302: ('Found', 'Object moved temporarily -- see URI list'),
				259	303: ('See Other', 'Object moved -- see Method and URL list'),
				260	304: ('Not Modified',
				261	'Document has not changed since given time'),
				262	305: ('Use Proxy',
				263	'You must use proxy specified in Location to access this '
				264	'resource.'),
				265	307: ('Temporary Redirect',
				266	'Object moved temporarily -- see URI list'),
				267
				268	400: ('Bad Request',
				269	'Bad request syntax or unsupported method'),
				270	401: ('Unauthorized',
				271	'No permission -- see authorization schemes'),
				272	402: ('Payment Required',
				273	'No payment -- see charging schemes'),
				274	403: ('Forbidden',
				275	'Request forbidden -- authorization will not help'),
				276	404: ('Not Found', 'Nothing matches the given URI'),
				277	405: ('Method Not Allowed',
				278	'Specified method is invalid for this server.'),
				279	406: ('Not Acceptable', 'URI not available in preferred format.'),
				280	407: ('Proxy Authentication Required', 'You must authenticate with '
				281	'this proxy before proceeding.'),
				282	408: ('Request Timeout', 'Request timed out; try again later.'),
				283	409: ('Conflict', 'Request conflict.'),
				284	410: ('Gone',
				285	'URI no longer exists and has been permanently removed.'),
				286	411: ('Length Required', 'Client must specify Content-Length.'),
				287	412: ('Precondition Failed', 'Precondition in headers is false.'),
				288	413: ('Request Entity Too Large', 'Entity is too large.'),
				289	414: ('Request-URI Too Long', 'URI is too long.'),
				290	415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
				291	416: ('Requested Range Not Satisfiable',
				292	'Cannot satisfy request range.'),
				293	417: ('Expectation Failed',
				294	'Expect condition could not be satisfied.'),
				295
				296	500: ('Internal Server Error', 'Server got itself in trouble'),
				297	501: ('Not Implemented',
				298	'Server does not support this operation'),
				299	502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
				300	503: ('Service Unavailable',
				301	'The server cannot process the request due to a high load'),
				302	504: ('Gateway Timeout',
				303	'The gateway server did not receive a timely response'),
				304	505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
				305	}
				306
				307	When an error is raised the server responds by returning an HTTP error code
				308	and an error page. You can use the ``HTTPError`` instance as a response on the
				309	page returned. This means that as well as the code attribute, it also has read,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	310	geturl, and info, methods as returned by the ``urllib.response`` module::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	311
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	312	>>> req = urllib.request.Request('http://www.python.org/fish.html')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	313	>>> try:
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	314	>>> urllib.request.urlopen(req)
				315	>>> except urllib.error.URLError, e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	316	>>> print(e.code)
				317	>>> print(e.read())
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	318	>>>
				319	404
				320	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				321	"http://www.w3.org/TR/html4/loose.dtd">
				322	<?xml-stylesheet href="./css/ht2html.css"
				323	type="text/css"?>
				324	<html><head><title>Error 404: File Not Found</title>
				325	...... etc...
				326
				327	Wrapping it Up
				328	--------------
				329
				330	So if you want to be prepared for ``HTTPError`` or ``URLError`` there are two
				331	basic approaches. I prefer the second approach.
				332
				333	Number 1
				334	~~~~~~~~
				335
				336	::
				337
				338
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	339	from urllib.request import Request, urlopen
				340	from urllib.error import URLError, HTTPError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	341	req = Request(someurl)
				342	try:
				343	response = urlopen(req)
				344	except HTTPError, e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	345	print('The server couldn\'t fulfill the request.')
				346	print('Error code: ', e.code)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	347	except URLError, e:
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	348	print('We failed to reach a server.')
				349	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	350	else:
				351	# everything is fine
				352
				353
				354	.. note::
				355
				356	The ``except HTTPError`` must come first, otherwise ``except URLError``
				357	will also catch an ``HTTPError``.
				358
				359	Number 2
				360	~~~~~~~~
				361
				362	::
				363
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	364	from urllib.request import Request, urlopen
				365	from urllib.error import URLError
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	366	req = Request(someurl)
				367	try:
				368	response = urlopen(req)
				369	except URLError, e:
				370	if hasattr(e, 'reason'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	371	print('We failed to reach a server.')
				372	print('Reason: ', e.reason)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	373	elif hasattr(e, 'code'):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	374	print('The server couldn\'t fulfill the request.')
				375	print('Error code: ', e.code)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	else:
				377	# everything is fine
				378
				379
				380	info and geturl
				381	===============
				382
				383	The response returned by urlopen (or the ``HTTPError`` instance) has two useful
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	384	methods ``info`` and ``geturl`` and is defined in the module
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	385	:mod:`urllib.response`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	386
				387	geturl - this returns the real URL of the page fetched. This is useful
				388	because ``urlopen`` (or the opener object used) may have followed a
				389	redirect. The URL of the page fetched may not be the same as the URL requested.
				390
				391	info - this returns a dictionary-like object that describes the page
				392	fetched, particularly the headers sent by the server. It is currently an
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	393	:class:`http.client.HTTPMessage` instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394
				395	Typical headers include 'Content-length', 'Content-type', and so on. See the
				396	`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
				397	for a useful listing of HTTP headers with brief explanations of their meaning
				398	and use.
				399
				400
				401	Openers and Handlers
				402	====================
				403
				404	When you fetch a URL you use an opener (an instance of the perhaps
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	405	confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	406	the default opener - via ``urlopen`` - but you can create custom
				407	openers. Openers use handlers. All the "heavy lifting" is done by the
				408	handlers. Each handler knows how to open URLs for a particular URL scheme (http,
				409	ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
				410	redirections or HTTP cookies.
				411
				412	You will want to create openers if you want to fetch URLs with specific handlers
				413	installed, for example to get an opener that handles cookies, or to get an
				414	opener that does not handle redirections.
				415
				416	To create an opener, instantiate an ``OpenerDirector``, and then call
				417	``.add_handler(some_handler_instance)`` repeatedly.
				418
				419	Alternatively, you can use ``build_opener``, which is a convenience function for
				420	creating opener objects with a single function call. ``build_opener`` adds
				421	several handlers by default, but provides a quick way to add more and/or
				422	override the default handlers.
				423
				424	Other sorts of handlers you might want to can handle proxies, authentication,
				425	and other common but slightly specialised situations.
				426
				427	``install_opener`` can be used to make an ``opener`` object the (global) default
				428	opener. This means that calls to ``urlopen`` will use the opener you have
				429	installed.
				430
				431	Opener objects have an ``open`` method, which can be called directly to fetch
				432	urls in the same way as the ``urlopen`` function: there's no need to call
				433	``install_opener``, except as a convenience.
				434
				435
				436	Basic Authentication
				437	====================
				438
				439	To illustrate creating and installing a handler we will use the
				440	``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
				441	including an explanation of how Basic Authentication works - see the `Basic
				442	Authentication Tutorial
				443	<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
				444
				445	When authentication is required, the server sends a header (as well as the 401
				446	error code) requesting authentication. This specifies the authentication scheme
				447	and a 'realm'. The header looks like : ``Www-authenticate: SCHEME
				448	realm="REALM"``.
				449
				450	e.g. ::
				451
				452	Www-authenticate: Basic realm="cPanel Users"
				453
				454
				455	The client should then retry the request with the appropriate name and password
				456	for the realm included as a header in the request. This is 'basic
				457	authentication'. In order to simplify this process we can create an instance of
				458	``HTTPBasicAuthHandler`` and an opener to use this handler.
				459
				460	The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
				461	the mapping of URLs and realms to passwords and usernames. If you know what the
				462	realm is (from the authentication header sent by the server), then you can use a
				463	``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
				464	case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
				465	you to specify a default username and password for a URL. This will be supplied
				466	in the absence of you providing an alternative combination for a specific
				467	realm. We indicate this by providing ``None`` as the realm argument to the
				468	``add_password`` method.
				469
				470	The top-level URL is the first URL that requires authentication. URLs "deeper"
				471	than the URL you pass to .add_password() will also match. ::
				472
				473	# create a password manager
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	474	password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	475
				476	# Add the username and password.
				477	# If we knew the realm, we could use it instead of ``None``.
				478	top_level_url = "http://example.com/foo/"
				479	password_mgr.add_password(None, top_level_url, username, password)
				480
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	481	handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	482
				483	# create "opener" (OpenerDirector instance)
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	484	opener = urllib.request.build_opener(handler)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	485
				486	# use the opener to fetch a URL
				487	opener.open(a_url)
				488
				489	# Install the opener.
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	490	# Now all calls to urllib.request.urlopen use our opener.
				491	urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	492
				493	.. note::
				494
				495	In the above example we only supplied our ``HHTPBasicAuthHandler`` to
				496	``build_opener``. By default openers have the handlers for normal situations
				497	-- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``,
				498	``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
				499	``FileHandler``, ``HTTPErrorProcessor``.
				500
				501	``top_level_url`` is in fact either a full URL (including the 'http:' scheme
				502	component and the hostname and optionally the port number)
				503	e.g. "http://example.com/" or an "authority" (i.e. the hostname,
				504	optionally including the port number) e.g. "example.com" or "example.com:8080"
				505	(the latter example includes a port number). The authority, if present, must
				506	NOT contain the "userinfo" component - for example "joe@password:example.com" is
				507	not correct.
				508
				509
				510	Proxies
				511	=======
				512
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	513	urllib will auto-detect your proxy settings and use those. This is through
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	514	the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
				515	a good thing, but there are occasions when it may not be helpful [#]_. One way
				516	to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
				517	is done using similar steps to setting up a `Basic Authentication`_ handler : ::
				518
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	519	>>> proxy_support = urllib.request.ProxyHandler({})
				520	>>> opener = urllib.request.build_opener(proxy_support)
				521	>>> urllib.request.install_opener(opener)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	522
				523	.. note::
				524
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	525	Currently ``urllib.request`` does not support fetching of ``https`` locations
				526	through a proxy. However, this can be enabled by extending urllib.request as
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	527	shown in the recipe [#]_.
				528
				529
				530	Sockets and Layers
				531	==================
				532
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	533	The Python support for fetching resources from the web is layered. urllib uses
				534	the :mod:`http.client` library, which in turn uses the socket library.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	535
				536	As of Python 2.3 you can specify how long a socket should wait for a response
				537	before timing out. This can be useful in applications which have to fetch web
				538	pages. By default the socket module has no timeout and can hang. Currently,
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	539	the socket timeout is not exposed at the http.client or urllib.request levels.
Georg Brandl	2442015	2008-05-26 16:32:26 +0000	[diff] [blame]	540	However, you can set the default timeout globally for all sockets using ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	541
				542	import socket
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	543	import urllib.request
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	544
				545	# timeout in seconds
				546	timeout = 10
				547	socket.setdefaulttimeout(timeout)
				548
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	549	# this call to urllib.request.urlopen now uses the default timeout
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	550	# we have set in the socket module
Senthil Kumaran	aca8fd7	2008-06-23 04:41:59 +0000	[diff] [blame]	551	req = urllib.request.Request('http://www.voidspace.org.uk')
				552	response = urllib.request.urlopen(req)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	553
				554
				555	-------
				556
				557
				558	Footnotes
				559	=========
				560
				561	This document was reviewed and revised by John Lee.
				562
				563	.. [#] For an introduction to the CGI protocol see
				564	`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
				565	.. [#] Like Google for example. The proper way to use google from a program
				566	is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
				567	`Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
				568	for some examples of using the Google API.
				569	.. [#] Browser sniffing is a very bad practise for website design - building
				570	sites using web standards is much more sensible. Unfortunately a lot of
				571	sites still send different versions to different browsers.
				572	.. [#] The user agent for MSIE 6 is
				573	'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'
				574	.. [#] For details of more HTTP request headers, see
				575	`Quick Reference to HTTP Headers`_.
				576	.. [#] In my case I have to use a proxy to access the internet at work. If you
				577	attempt to fetch localhost URLs through this proxy it blocks them. IE
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	578	is set to use the proxy, which urllib picks up on. In order to test
				579	scripts with a localhost server, I have to prevent urllib from using
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	580	the proxy.
Georg Brandl	0f7ede4	2008-06-23 11:23:31 +0000	[diff] [blame]	581	.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	582	<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
				583