Blame - Doc/library/urllib.rst - platform/external/python/cpython2

blob: 63fb53e0f88a4a9660c2f37db9747454fee6a39a [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`urllib` --- Open arbitrary resources by URL
				2	=================================================
				3
				4	.. module:: urllib
				5	:synopsis: Open an arbitrary network resource by URL (requires sockets).
				6
				7
				8	.. index::
				9	single: WWW
				10	single: World Wide Web
				11	single: URL
				12
				13	This module provides a high-level interface for fetching data across the World
				14	Wide Web. In particular, the :func:`urlopen` function is similar to the
				15	built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
				16	instead of filenames. Some restrictions apply --- it can only open URLs for
				17	reading, and no seek operations are available.
				18
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	19	High-level interface
				20	--------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	21
				22	.. function:: urlopen(url[, data[, proxies]])
				23
				24	Open a network object denoted by a URL for reading. If the URL does not have a
				25	scheme identifier, or if it has :file:`file:` as its scheme identifier, this
				26	opens a local file (without universal newlines); otherwise it opens a socket to
				27	a server somewhere on the network. If the connection cannot be made the
				28	:exc:`IOError` exception is raised. If all went well, a file-like object is
				29	returned. This supports the following methods: :meth:`read`, :meth:`readline`,
Christian Heimes	9bd667a	2008-01-20 15:14:11 +0000	[diff] [blame^]	30	:meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	31	:meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	32	caveat: the :meth:`read` method, if the size argument is omitted or negative,
				33	may not read until the end of the data stream; there is no good way to determine
				34	that the entire stream from a socket has been read in the general case.
				35
Christian Heimes	9bd667a	2008-01-20 15:14:11 +0000	[diff] [blame^]	36	Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
				37	these methods have the same interface as for file objects --- see section
				38	:ref:`bltin-file-objects` in this manual. (It is not a built-in file object,
				39	however, so it can't be used at those few places where a true built-in file
				40	object is required.)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41
				42	.. index:: module: mimetools
				43
				44	The :meth:`info` method returns an instance of the class
				45	:class:`mimetools.Message` containing meta-information associated with the
				46	URL. When the method is HTTP, these headers are those returned by the server
				47	at the head of the retrieved HTML page (including Content-Length and
				48	Content-Type). When the method is FTP, a Content-Length header will be
				49	present if (as is now usual) the server passed back a file length in response
				50	to the FTP retrieval request. A Content-Type header will be present if the
				51	MIME type can be guessed. When the method is local-file, returned headers
				52	will include a Date representing the file's last-modified time, a
				53	Content-Length giving file size, and a Content-Type containing a guess at the
				54	file's type. See also the description of the :mod:`mimetools` module.
				55
				56	The :meth:`geturl` method returns the real URL of the page. In some cases, the
				57	HTTP server redirects a client to another URL. The :func:`urlopen` function
				58	handles this transparently, but in some cases the caller needs to know which URL
				59	the client was redirected to. The :meth:`geturl` method can be used to get at
				60	this redirected URL.
				61
Christian Heimes	9bd667a	2008-01-20 15:14:11 +0000	[diff] [blame^]	62	The :meth:`getcode` method returns the HTTP status code that was sent with the
				63	response, or ``None`` if the URL is no HTTP URL.
				64
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	65	If the url uses the :file:`http:` scheme identifier, the optional data
				66	argument may be given to specify a ``POST`` request (normally the request type
				67	is ``GET``). The data argument must be in standard
				68	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				69	function below.
				70
				71	The :func:`urlopen` function works transparently with proxies which do not
				72	require authentication. In a Unix or Windows environment, set the
				73	:envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
				74	identifies the proxy server before starting the Python interpreter. For example
				75	(the ``'%'`` is the command prompt)::
				76
				77	% http_proxy="http://www.someproxy.com:3128"
				78	% export http_proxy
				79	% python
				80	...
				81
Christian Heimes	9bd667a	2008-01-20 15:14:11 +0000	[diff] [blame^]	82	The :envvar:`no_proxy` environment variable can be used to specify hosts which
				83	shouldn't be reached via proxy; if set, it should be a comma-separated list
				84	of hostname suffixes, optionally with ``:port`` appended, for example
				85	``cern.ch,ncsa.uiuc.edu,some.host:8080``.
				86
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87	In a Windows environment, if no proxy environment variables are set, proxy
				88	settings are obtained from the registry's Internet Settings section.
				89
				90	.. index:: single: Internet Config
				91
				92	In a Macintosh environment, :func:`urlopen` will retrieve proxy information from
				93	Internet Config.
				94
				95	Alternatively, the optional proxies argument may be used to explicitly specify
				96	proxies. It must be a dictionary mapping scheme names to proxy URLs, where an
				97	empty dictionary causes no proxies to be used, and ``None`` (the default value)
				98	causes environmental proxy settings to be used as discussed above. For
				99	example::
				100
				101	# Use http://www.someproxy.com:3128 for http proxying
				102	proxies = {'http': 'http://www.someproxy.com:3128'}
				103	filehandle = urllib.urlopen(some_url, proxies=proxies)
				104	# Don't use any proxies
				105	filehandle = urllib.urlopen(some_url, proxies={})
				106	# Use proxies from environment - both versions are equivalent
				107	filehandle = urllib.urlopen(some_url, proxies=None)
				108	filehandle = urllib.urlopen(some_url)
				109
				110	The :func:`urlopen` function does not support explicit proxy specification. If
				111	you need to override environmental proxy settings, use :class:`URLopener`, or a
				112	subclass such as :class:`FancyURLopener`.
				113
				114	Proxies which require authentication for use are not currently supported; this
				115	is considered an implementation limitation.
				116
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	117
				118	.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
				119
				120	Copy a network object denoted by a URL to a local file, if necessary. If the URL
				121	points to a local file, or a valid cached copy of the object exists, the object
				122	is not copied. Return a tuple ``(filename, headers)`` where filename is the
				123	local file name under which the object can be found, and headers is whatever
				124	the :meth:`info` method of the object returned by :func:`urlopen` returned (for
				125	a remote object, possibly cached). Exceptions are the same as for
				126	:func:`urlopen`.
				127
				128	The second argument, if present, specifies the file location to copy to (if
				129	absent, the location will be a tempfile with a generated name). The third
				130	argument, if present, is a hook function that will be called once on
				131	establishment of the network connection and once after each block read
				132	thereafter. The hook will be passed three arguments; a count of blocks
				133	transferred so far, a block size in bytes, and the total size of the file. The
				134	third argument may be ``-1`` on older FTP servers which do not return a file
				135	size in response to a retrieval request.
				136
				137	If the url uses the :file:`http:` scheme identifier, the optional data
				138	argument may be given to specify a ``POST`` request (normally the request type
				139	is ``GET``). The data argument must in standard
				140	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				141	function below.
				142
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	143	:func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
				144	the amount of data available was less than the expected amount (which is the
				145	size reported by a Content-Length header). This can occur, for example, when
				146	the download is interrupted.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	148	The Content-Length is treated as a lower bound: if there's more data to read,
				149	urlretrieve reads more data, but if less data is available, it raises the
				150	exception.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	151
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	152	You can still retrieve the downloaded data in this case, it is stored in the
				153	:attr:`content` attribute of the exception instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	155	If no Content-Length header was supplied, urlretrieve can not check the size
				156	of the data it has downloaded, and just returns it. In this case you just have
				157	to assume that the download was successful.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	158
				159
				160	.. data:: _urlopener
				161
				162	The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
				163	of the :class:`FancyURLopener` class and use it to perform their requested
				164	actions. To override this functionality, programmers can create a subclass of
				165	:class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
				166	class to the ``urllib._urlopener`` variable before calling the desired function.
				167	For example, applications may want to specify a different
				168	:mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
				169	accomplished with the following code::
				170
				171	import urllib
				172
				173	class AppURLopener(urllib.FancyURLopener):
				174	version = "App/1.7"
				175
				176	urllib._urlopener = AppURLopener()
				177
				178
				179	.. function:: urlcleanup()
				180
				181	Clear the cache that may have been built up by previous calls to
				182	:func:`urlretrieve`.
				183
				184
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	185	Utility functions
				186	-----------------
				187
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	188	.. function:: quote(string[, safe])
				189
				190	Replace special characters in string using the ``%xx`` escape. Letters,
				191	digits, and the characters ``'_.-'`` are never quoted. The optional safe
				192	parameter specifies additional characters that should not be quoted --- its
				193	default value is ``'/'``.
				194
				195	Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
				196
				197
				198	.. function:: quote_plus(string[, safe])
				199
				200	Like :func:`quote`, but also replaces spaces by plus signs, as required for
				201	quoting HTML form values. Plus signs in the original string are escaped unless
				202	they are included in safe. It also does not have safe default to ``'/'``.
				203
				204
				205	.. function:: unquote(string)
				206
				207	Replace ``%xx`` escapes by their single-character equivalent.
				208
				209	Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
				210
				211
				212	.. function:: unquote_plus(string)
				213
				214	Like :func:`unquote`, but also replaces plus signs by spaces, as required for
				215	unquoting HTML form values.
				216
				217
				218	.. function:: urlencode(query[, doseq])
				219
				220	Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
				221	string, suitable to pass to :func:`urlopen` above as the optional data
				222	argument. This is useful to pass a dictionary of form fields to a ``POST``
				223	request. The resulting string is a series of ``key=value`` pairs separated by
				224	``'&'`` characters, where both key and value are quoted using
				225	:func:`quote_plus` above. If the optional parameter doseq is present and
				226	evaluates to true, individual ``key=value`` pairs are generated for each element
				227	of the sequence. When a sequence of two-element tuples is used as the query
				228	argument, the first element of each tuple is a key and the second is a value.
				229	The order of parameters in the encoded string will match the order of parameter
				230	tuples in the sequence. The :mod:`cgi` module provides the functions
				231	:func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
				232	into Python data structures.
				233
				234
				235	.. function:: pathname2url(path)
				236
				237	Convert the pathname path from the local syntax for a path to the form used in
				238	the path component of a URL. This does not produce a complete URL. The return
				239	value will already be quoted using the :func:`quote` function.
				240
				241
				242	.. function:: url2pathname(path)
				243
				244	Convert the path component path from an encoded URL to the local syntax for a
				245	path. This does not accept a complete URL. This function uses :func:`unquote`
				246	to decode path.
				247
				248
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	249	URL Opener objects
				250	------------------
				251
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	252	.. class:: URLopener([proxies[, **x509]])
				253
				254	Base class for opening and reading URLs. Unless you need to support opening
				255	objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
				256	you probably want to use :class:`FancyURLopener`.
				257
				258	By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
				259	of ``urllib/VVV``, where VVV is the :mod:`urllib` version number.
				260	Applications can define their own :mailheader:`User-Agent` header by subclassing
				261	:class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
				262	:attr:`version` to an appropriate string value in the subclass definition.
				263
				264	The optional proxies parameter should be a dictionary mapping scheme names to
				265	proxy URLs, where an empty dictionary turns proxies off completely. Its default
				266	value is ``None``, in which case environmental proxy settings will be used if
				267	present, as discussed in the definition of :func:`urlopen`, above.
				268
				269	Additional keyword parameters, collected in x509, may be used for
				270	authentication of the client when using the :file:`https:` scheme. The keywords
				271	key_file and cert_file are supported to provide an SSL key and certificate;
				272	both are needed to support client authentication.
				273
				274	:class:`URLopener` objects will raise an :exc:`IOError` exception if the server
				275	returns an error code.
				276
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	277	.. method:: open(fullurl[, data])
				278
				279	Open fullurl using the appropriate protocol. This method sets up cache and
				280	proxy information, then calls the appropriate open method with its input
				281	arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
				282	The data argument has the same meaning as the data argument of
				283	:func:`urlopen`.
				284
				285
				286	.. method:: open_unknown(fullurl[, data])
				287
				288	Overridable interface to open unknown URL types.
				289
				290
				291	.. method:: retrieve(url[, filename[, reporthook[, data]]])
				292
				293	Retrieves the contents of url and places it in filename. The return value
				294	is a tuple consisting of a local filename and either a
				295	:class:`mimetools.Message` object containing the response headers (for remote
				296	URLs) or ``None`` (for local URLs). The caller must then open and read the
				297	contents of filename. If filename is not given and the URL refers to a
				298	local file, the input filename is returned. If the URL is non-local and
				299	filename is not given, the filename is the output of :func:`tempfile.mktemp`
				300	with a suffix that matches the suffix of the last path component of the input
				301	URL. If reporthook is given, it must be a function accepting three numeric
				302	parameters. It will be called after each chunk of data is read from the
				303	network. reporthook is ignored for local URLs.
				304
				305	If the url uses the :file:`http:` scheme identifier, the optional data
				306	argument may be given to specify a ``POST`` request (normally the request type
				307	is ``GET``). The data argument must in standard
				308	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				309	function below.
				310
				311
				312	.. attribute:: version
				313
				314	Variable that specifies the user agent of the opener object. To get
				315	:mod:`urllib` to tell servers that it is a particular user agent, set this in a
				316	subclass as a class variable or in the constructor before calling the base
				317	constructor.
				318
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	319
				320	.. class:: FancyURLopener(...)
				321
				322	:class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
				323	for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
				324	response codes listed above, the :mailheader:`Location` header is used to fetch
				325	the actual URL. For 401 response codes (authentication required), basic HTTP
				326	authentication is performed. For the 30x response codes, recursion is bounded
				327	by the value of the maxtries attribute, which defaults to 10.
				328
				329	For all other response codes, the method :meth:`http_error_default` is called
				330	which you can override in subclasses to handle the error appropriately.
				331
				332	.. note::
				333
				334	According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
				335	must not be automatically redirected without confirmation by the user. In
				336	reality, browsers do allow automatic redirection of these responses, changing
				337	the POST to a GET, and :mod:`urllib` reproduces this behaviour.
				338
				339	The parameters to the constructor are the same as those for :class:`URLopener`.
				340
				341	.. note::
				342
				343	When performing basic authentication, a :class:`FancyURLopener` instance calls
				344	its :meth:`prompt_user_passwd` method. The default implementation asks the
				345	users for the required information on the controlling terminal. A subclass may
				346	override this method to support more appropriate behavior if needed.
				347
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	348	The :class:`FancyURLopener` class offers one additional method that should be
				349	overloaded to provide the appropriate behavior:
				350
				351	.. method:: prompt_user_passwd(host, realm)
				352
				353	Return information needed to authenticate the user at the given host in the
				354	specified security realm. The return value should be a tuple, ``(user,
				355	password)``, which can be used for basic authentication.
				356
				357	The implementation prompts for this information on the terminal; an application
				358	should override this method to use an appropriate interaction model in the local
				359	environment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	360
				361	.. exception:: ContentTooShortError(msg[, content])
				362
				363	This exception is raised when the :func:`urlretrieve` function detects that the
				364	amount of the downloaded data is less than the expected amount (given by the
				365	Content-Length header). The :attr:`content` attribute stores the downloaded
				366	(and supposedly truncated) data.
				367
Christian Heimes	790c823	2008-01-07 21:14:23 +0000	[diff] [blame]	368
				369	:mod:`urllib` Restrictions
				370	--------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	371
				372	.. index::
				373	pair: HTTP; protocol
				374	pair: FTP; protocol
				375
				376	* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
				377	1.0), FTP, and local files.
				378
				379	* The caching feature of :func:`urlretrieve` has been disabled until I find the
				380	time to hack proper processing of Expiration time headers.
				381
				382	* There should be a function to query whether a particular URL is in the cache.
				383
				384	* For backward compatibility, if a URL appears to point to a local file but the
				385	file can't be opened, the URL is re-interpreted using the FTP protocol. This
				386	can sometimes cause confusing error messages.
				387
				388	* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
				389	long delays while waiting for a network connection to be set up. This means
				390	that it is difficult to build an interactive Web client using these functions
				391	without using threads.
				392
				393	.. index::
				394	single: HTML
				395	pair: HTTP; protocol
				396	module: htmllib
				397
				398	* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
				399	returned by the server. This may be binary data (such as an image), plain text
				400	or (for example) HTML. The HTTP protocol provides type information in the reply
				401	header, which can be inspected by looking at the :mailheader:`Content-Type`
				402	header. If the returned data is HTML, you can use the module :mod:`htmllib` to
				403	parse it.
				404
				405	.. index:: single: FTP
				406
				407	* The code handling the FTP protocol cannot differentiate between a file and a
				408	directory. This can lead to unexpected behavior when attempting to read a URL
				409	that points to a file that is not accessible. If the URL ends in a ``/``, it is
				410	assumed to refer to a directory and will be handled accordingly. But if an
				411	attempt to read a file leads to a 550 error (meaning the URL cannot be found or
				412	is not accessible, often for permission reasons), then the path is treated as a
				413	directory in order to handle the case when a directory is specified by a URL but
				414	the trailing ``/`` has been left off. This can cause misleading results when
				415	you try to fetch a file whose read permissions make it inaccessible; the FTP
				416	code will try to read it, fail with a 550 error, and then perform a directory
				417	listing for the unreadable file. If fine-grained control is needed, consider
				418	using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
				419	_urlopener to meet your needs.
				420
				421	* This module does not support the use of proxies which require authentication.
				422	This may be implemented in the future.
				423
				424	.. index:: module: urlparse
				425
				426	* Although the :mod:`urllib` module contains (undocumented) routines to parse
				427	and unparse URL strings, the recommended interface for URL manipulation is in
				428	module :mod:`urlparse`.
				429
				430
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	431	.. _urllib-examples:
				432
				433	Examples
				434	--------
				435
				436	Here is an example session that uses the ``GET`` method to retrieve a URL
				437	containing parameters::
				438
				439	>>> import urllib
				440	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
				441	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
Collin Winter	c79461b	2007-09-01 23:34:30 +0000	[diff] [blame]	442	>>> print(f.read())
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	443
				444	The following example uses the ``POST`` method instead::
				445
				446	>>> import urllib
				447	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
				448	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
Collin Winter	c79461b	2007-09-01 23:34:30 +0000	[diff] [blame]	449	>>> print(f.read())
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	450
				451	The following example uses an explicitly specified HTTP proxy, overriding
				452	environment settings::
				453
				454	>>> import urllib
				455	>>> proxies = {'http': 'http://proxy.example.com:8080/'}
				456	>>> opener = urllib.FancyURLopener(proxies)
				457	>>> f = opener.open("http://www.python.org")
				458	>>> f.read()
				459
				460	The following example uses no proxies at all, overriding environment settings::
				461
				462	>>> import urllib
				463	>>> opener = urllib.FancyURLopener({})
				464	>>> f = opener.open("http://www.python.org/")
				465	>>> f.read()
				466