Blame - Doc/library/urllib.rst - platform/external/python/cpython2

blob: ae1828f2a065dad6719808410d733daf9a876fd2 [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1	:mod:`urllib` --- Open arbitrary resources by URL
				2	=================================================
				3
				4	.. module:: urllib
				5	:synopsis: Open an arbitrary network resource by URL (requires sockets).
				6
				7
				8	.. index::
				9	single: WWW
				10	single: World Wide Web
				11	single: URL
				12
				13	This module provides a high-level interface for fetching data across the World
				14	Wide Web. In particular, the :func:`urlopen` function is similar to the
				15	built-in function :func:`open`, but accepts Universal Resource Locators (URLs)
				16	instead of filenames. Some restrictions apply --- it can only open URLs for
				17	reading, and no seek operations are available.
				18
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	19	High-level interface
				20	--------------------
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	21
				22	.. function:: urlopen(url[, data[, proxies]])
				23
				24	Open a network object denoted by a URL for reading. If the URL does not have a
				25	scheme identifier, or if it has :file:`file:` as its scheme identifier, this
				26	opens a local file (without universal newlines); otherwise it opens a socket to
				27	a server somewhere on the network. If the connection cannot be made the
				28	:exc:`IOError` exception is raised. If all went well, a file-like object is
				29	returned. This supports the following methods: :meth:`read`, :meth:`readline`,
Georg Brandl	9b0d46d	2008-01-20 11:43:03 +0000	[diff] [blame]	30	:meth:`readlines`, :meth:`fileno`, :meth:`close`, :meth:`info`, :meth:`getcode` and
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	31	:meth:`geturl`. It also has proper support for the :term:`iterator` protocol. One
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	32	caveat: the :meth:`read` method, if the size argument is omitted or negative,
				33	may not read until the end of the data stream; there is no good way to determine
				34	that the entire stream from a socket has been read in the general case.
				35
Georg Brandl	9b0d46d	2008-01-20 11:43:03 +0000	[diff] [blame]	36	Except for the :meth:`info`, :meth:`getcode` and :meth:`geturl` methods,
				37	these methods have the same interface as for file objects --- see section
				38	:ref:`bltin-file-objects` in this manual. (It is not a built-in file object,
				39	however, so it can't be used at those few places where a true built-in file
				40	object is required.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	41
				42	.. index:: module: mimetools
				43
				44	The :meth:`info` method returns an instance of the class
				45	:class:`mimetools.Message` containing meta-information associated with the
				46	URL. When the method is HTTP, these headers are those returned by the server
				47	at the head of the retrieved HTML page (including Content-Length and
				48	Content-Type). When the method is FTP, a Content-Length header will be
				49	present if (as is now usual) the server passed back a file length in response
				50	to the FTP retrieval request. A Content-Type header will be present if the
				51	MIME type can be guessed. When the method is local-file, returned headers
				52	will include a Date representing the file's last-modified time, a
				53	Content-Length giving file size, and a Content-Type containing a guess at the
				54	file's type. See also the description of the :mod:`mimetools` module.
				55
				56	The :meth:`geturl` method returns the real URL of the page. In some cases, the
				57	HTTP server redirects a client to another URL. The :func:`urlopen` function
				58	handles this transparently, but in some cases the caller needs to know which URL
				59	the client was redirected to. The :meth:`geturl` method can be used to get at
				60	this redirected URL.
				61
Georg Brandl	9b0d46d	2008-01-20 11:43:03 +0000	[diff] [blame]	62	The :meth:`getcode` method returns the HTTP status code that was sent with the
				63	response, or ``None`` if the URL is no HTTP URL.
				64
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	65	If the url uses the :file:`http:` scheme identifier, the optional data
				66	argument may be given to specify a ``POST`` request (normally the request type
				67	is ``GET``). The data argument must be in standard
				68	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				69	function below.
				70
				71	The :func:`urlopen` function works transparently with proxies which do not
				72	require authentication. In a Unix or Windows environment, set the
				73	:envvar:`http_proxy`, or :envvar:`ftp_proxy` environment variables to a URL that
				74	identifies the proxy server before starting the Python interpreter. For example
				75	(the ``'%'`` is the command prompt)::
				76
				77	% http_proxy="http://www.someproxy.com:3128"
				78	% export http_proxy
				79	% python
				80	...
				81
Georg Brandl	2235011	2008-01-20 12:05:43 +0000	[diff] [blame]	82	The :envvar:`no_proxy` environment variable can be used to specify hosts which
				83	shouldn't be reached via proxy; if set, it should be a comma-separated list
				84	of hostname suffixes, optionally with ``:port`` appended, for example
				85	``cern.ch,ncsa.uiuc.edu,some.host:8080``.
				86
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	87	In a Windows environment, if no proxy environment variables are set, proxy
				88	settings are obtained from the registry's Internet Settings section.
				89
				90	.. index:: single: Internet Config
				91
				92	In a Macintosh environment, :func:`urlopen` will retrieve proxy information from
				93	Internet Config.
				94
				95	Alternatively, the optional proxies argument may be used to explicitly specify
				96	proxies. It must be a dictionary mapping scheme names to proxy URLs, where an
				97	empty dictionary causes no proxies to be used, and ``None`` (the default value)
				98	causes environmental proxy settings to be used as discussed above. For
				99	example::
				100
				101	# Use http://www.someproxy.com:3128 for http proxying
				102	proxies = {'http': 'http://www.someproxy.com:3128'}
				103	filehandle = urllib.urlopen(some_url, proxies=proxies)
				104	# Don't use any proxies
				105	filehandle = urllib.urlopen(some_url, proxies={})
				106	# Use proxies from environment - both versions are equivalent
				107	filehandle = urllib.urlopen(some_url, proxies=None)
				108	filehandle = urllib.urlopen(some_url)
				109
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	110	Proxies which require authentication for use are not currently supported; this
				111	is considered an implementation limitation.
				112
				113	.. versionchanged:: 2.3
				114	Added the proxies support.
				115
Georg Brandl	2235011	2008-01-20 12:05:43 +0000	[diff] [blame]	116	.. versionchanged:: 2.6
				117	Added :meth:`getcode` to returned object and support for the
				118	:envvar:`no_proxy` environment variable.
				119
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	120
				121	.. function:: urlretrieve(url[, filename[, reporthook[, data]]])
				122
				123	Copy a network object denoted by a URL to a local file, if necessary. If the URL
				124	points to a local file, or a valid cached copy of the object exists, the object
				125	is not copied. Return a tuple ``(filename, headers)`` where filename is the
				126	local file name under which the object can be found, and headers is whatever
				127	the :meth:`info` method of the object returned by :func:`urlopen` returned (for
				128	a remote object, possibly cached). Exceptions are the same as for
				129	:func:`urlopen`.
				130
				131	The second argument, if present, specifies the file location to copy to (if
				132	absent, the location will be a tempfile with a generated name). The third
				133	argument, if present, is a hook function that will be called once on
				134	establishment of the network connection and once after each block read
				135	thereafter. The hook will be passed three arguments; a count of blocks
				136	transferred so far, a block size in bytes, and the total size of the file. The
				137	third argument may be ``-1`` on older FTP servers which do not return a file
				138	size in response to a retrieval request.
				139
				140	If the url uses the :file:`http:` scheme identifier, the optional data
				141	argument may be given to specify a ``POST`` request (normally the request type
				142	is ``GET``). The data argument must in standard
				143	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				144	function below.
				145
				146	.. versionchanged:: 2.5
				147	:func:`urlretrieve` will raise :exc:`ContentTooShortError` when it detects that
				148	the amount of data available was less than the expected amount (which is the
				149	size reported by a Content-Length header). This can occur, for example, when
				150	the download is interrupted.
				151
				152	The Content-Length is treated as a lower bound: if there's more data to read,
				153	urlretrieve reads more data, but if less data is available, it raises the
				154	exception.
				155
				156	You can still retrieve the downloaded data in this case, it is stored in the
				157	:attr:`content` attribute of the exception instance.
				158
				159	If no Content-Length header was supplied, urlretrieve can not check the size
				160	of the data it has downloaded, and just returns it. In this case you just have
				161	to assume that the download was successful.
				162
				163
				164	.. data:: _urlopener
				165
				166	The public functions :func:`urlopen` and :func:`urlretrieve` create an instance
				167	of the :class:`FancyURLopener` class and use it to perform their requested
				168	actions. To override this functionality, programmers can create a subclass of
				169	:class:`URLopener` or :class:`FancyURLopener`, then assign an instance of that
				170	class to the ``urllib._urlopener`` variable before calling the desired function.
				171	For example, applications may want to specify a different
				172	:mailheader:`User-Agent` header than :class:`URLopener` defines. This can be
				173	accomplished with the following code::
				174
				175	import urllib
				176
				177	class AppURLopener(urllib.FancyURLopener):
				178	version = "App/1.7"
				179
				180	urllib._urlopener = AppURLopener()
				181
				182
				183	.. function:: urlcleanup()
				184
				185	Clear the cache that may have been built up by previous calls to
				186	:func:`urlretrieve`.
				187
				188
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	189	Utility functions
				190	-----------------
				191
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	192	.. function:: quote(string[, safe])
				193
				194	Replace special characters in string using the ``%xx`` escape. Letters,
				195	digits, and the characters ``'_.-'`` are never quoted. The optional safe
				196	parameter specifies additional characters that should not be quoted --- its
				197	default value is ``'/'``.
				198
				199	Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``.
				200
				201
				202	.. function:: quote_plus(string[, safe])
				203
				204	Like :func:`quote`, but also replaces spaces by plus signs, as required for
				205	quoting HTML form values. Plus signs in the original string are escaped unless
				206	they are included in safe. It also does not have safe default to ``'/'``.
				207
				208
				209	.. function:: unquote(string)
				210
				211	Replace ``%xx`` escapes by their single-character equivalent.
				212
				213	Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``.
				214
				215
				216	.. function:: unquote_plus(string)
				217
				218	Like :func:`unquote`, but also replaces plus signs by spaces, as required for
				219	unquoting HTML form values.
				220
				221
				222	.. function:: urlencode(query[, doseq])
				223
				224	Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
				225	string, suitable to pass to :func:`urlopen` above as the optional data
				226	argument. This is useful to pass a dictionary of form fields to a ``POST``
				227	request. The resulting string is a series of ``key=value`` pairs separated by
				228	``'&'`` characters, where both key and value are quoted using
				229	:func:`quote_plus` above. If the optional parameter doseq is present and
				230	evaluates to true, individual ``key=value`` pairs are generated for each element
				231	of the sequence. When a sequence of two-element tuples is used as the query
				232	argument, the first element of each tuple is a key and the second is a value.
				233	The order of parameters in the encoded string will match the order of parameter
				234	tuples in the sequence. The :mod:`cgi` module provides the functions
				235	:func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
				236	into Python data structures.
				237
				238
				239	.. function:: pathname2url(path)
				240
				241	Convert the pathname path from the local syntax for a path to the form used in
				242	the path component of a URL. This does not produce a complete URL. The return
				243	value will already be quoted using the :func:`quote` function.
				244
				245
				246	.. function:: url2pathname(path)
				247
				248	Convert the path component path from an encoded URL to the local syntax for a
				249	path. This does not accept a complete URL. This function uses :func:`unquote`
				250	to decode path.
				251
				252
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	253	URL Opener objects
				254	------------------
				255
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	256	.. class:: URLopener([proxies[, **x509]])
				257
				258	Base class for opening and reading URLs. Unless you need to support opening
				259	objects using schemes other than :file:`http:`, :file:`ftp:`, or :file:`file:`,
				260	you probably want to use :class:`FancyURLopener`.
				261
				262	By default, the :class:`URLopener` class sends a :mailheader:`User-Agent` header
				263	of ``urllib/VVV``, where VVV is the :mod:`urllib` version number.
				264	Applications can define their own :mailheader:`User-Agent` header by subclassing
				265	:class:`URLopener` or :class:`FancyURLopener` and setting the class attribute
				266	:attr:`version` to an appropriate string value in the subclass definition.
				267
				268	The optional proxies parameter should be a dictionary mapping scheme names to
				269	proxy URLs, where an empty dictionary turns proxies off completely. Its default
				270	value is ``None``, in which case environmental proxy settings will be used if
				271	present, as discussed in the definition of :func:`urlopen`, above.
				272
				273	Additional keyword parameters, collected in x509, may be used for
				274	authentication of the client when using the :file:`https:` scheme. The keywords
				275	key_file and cert_file are supported to provide an SSL key and certificate;
				276	both are needed to support client authentication.
				277
				278	:class:`URLopener` objects will raise an :exc:`IOError` exception if the server
				279	returns an error code.
				280
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	281	.. method:: open(fullurl[, data])
				282
				283	Open fullurl using the appropriate protocol. This method sets up cache and
				284	proxy information, then calls the appropriate open method with its input
				285	arguments. If the scheme is not recognized, :meth:`open_unknown` is called.
				286	The data argument has the same meaning as the data argument of
				287	:func:`urlopen`.
				288
				289
				290	.. method:: open_unknown(fullurl[, data])
				291
				292	Overridable interface to open unknown URL types.
				293
				294
				295	.. method:: retrieve(url[, filename[, reporthook[, data]]])
				296
				297	Retrieves the contents of url and places it in filename. The return value
				298	is a tuple consisting of a local filename and either a
				299	:class:`mimetools.Message` object containing the response headers (for remote
				300	URLs) or ``None`` (for local URLs). The caller must then open and read the
				301	contents of filename. If filename is not given and the URL refers to a
				302	local file, the input filename is returned. If the URL is non-local and
				303	filename is not given, the filename is the output of :func:`tempfile.mktemp`
				304	with a suffix that matches the suffix of the last path component of the input
				305	URL. If reporthook is given, it must be a function accepting three numeric
				306	parameters. It will be called after each chunk of data is read from the
				307	network. reporthook is ignored for local URLs.
				308
				309	If the url uses the :file:`http:` scheme identifier, the optional data
				310	argument may be given to specify a ``POST`` request (normally the request type
				311	is ``GET``). The data argument must in standard
				312	:mimetype:`application/x-www-form-urlencoded` format; see the :func:`urlencode`
				313	function below.
				314
				315
				316	.. attribute:: version
				317
				318	Variable that specifies the user agent of the opener object. To get
				319	:mod:`urllib` to tell servers that it is a particular user agent, set this in a
				320	subclass as a class variable or in the constructor before calling the base
				321	constructor.
				322
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	323
				324	.. class:: FancyURLopener(...)
				325
				326	:class:`FancyURLopener` subclasses :class:`URLopener` providing default handling
				327	for the following HTTP response codes: 301, 302, 303, 307 and 401. For the 30x
				328	response codes listed above, the :mailheader:`Location` header is used to fetch
				329	the actual URL. For 401 response codes (authentication required), basic HTTP
				330	authentication is performed. For the 30x response codes, recursion is bounded
				331	by the value of the maxtries attribute, which defaults to 10.
				332
				333	For all other response codes, the method :meth:`http_error_default` is called
				334	which you can override in subclasses to handle the error appropriately.
				335
				336	.. note::
				337
				338	According to the letter of :rfc:`2616`, 301 and 302 responses to POST requests
				339	must not be automatically redirected without confirmation by the user. In
				340	reality, browsers do allow automatic redirection of these responses, changing
				341	the POST to a GET, and :mod:`urllib` reproduces this behaviour.
				342
				343	The parameters to the constructor are the same as those for :class:`URLopener`.
				344
				345	.. note::
				346
				347	When performing basic authentication, a :class:`FancyURLopener` instance calls
				348	its :meth:`prompt_user_passwd` method. The default implementation asks the
				349	users for the required information on the controlling terminal. A subclass may
				350	override this method to support more appropriate behavior if needed.
				351
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	352	The :class:`FancyURLopener` class offers one additional method that should be
				353	overloaded to provide the appropriate behavior:
				354
				355	.. method:: prompt_user_passwd(host, realm)
				356
				357	Return information needed to authenticate the user at the given host in the
				358	specified security realm. The return value should be a tuple, ``(user,
				359	password)``, which can be used for basic authentication.
				360
				361	The implementation prompts for this information on the terminal; an application
				362	should override this method to use an appropriate interaction model in the local
				363	environment.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	364
				365	.. exception:: ContentTooShortError(msg[, content])
				366
				367	This exception is raised when the :func:`urlretrieve` function detects that the
				368	amount of the downloaded data is less than the expected amount (given by the
				369	Content-Length header). The :attr:`content` attribute stores the downloaded
				370	(and supposedly truncated) data.
				371
				372	.. versionadded:: 2.5
				373
Georg Brandl	6264765	2008-01-07 18:23:27 +0000	[diff] [blame]	374
				375	:mod:`urllib` Restrictions
				376	--------------------------
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	377
				378	.. index::
				379	pair: HTTP; protocol
				380	pair: FTP; protocol
				381
				382	* Currently, only the following protocols are supported: HTTP, (versions 0.9 and
				383	1.0), FTP, and local files.
				384
				385	* The caching feature of :func:`urlretrieve` has been disabled until I find the
				386	time to hack proper processing of Expiration time headers.
				387
				388	* There should be a function to query whether a particular URL is in the cache.
				389
				390	* For backward compatibility, if a URL appears to point to a local file but the
				391	file can't be opened, the URL is re-interpreted using the FTP protocol. This
				392	can sometimes cause confusing error messages.
				393
				394	* The :func:`urlopen` and :func:`urlretrieve` functions can cause arbitrarily
				395	long delays while waiting for a network connection to be set up. This means
				396	that it is difficult to build an interactive Web client using these functions
				397	without using threads.
				398
				399	.. index::
				400	single: HTML
				401	pair: HTTP; protocol
				402	module: htmllib
				403
				404	* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
				405	returned by the server. This may be binary data (such as an image), plain text
				406	or (for example) HTML. The HTTP protocol provides type information in the reply
				407	header, which can be inspected by looking at the :mailheader:`Content-Type`
				408	header. If the returned data is HTML, you can use the module :mod:`htmllib` to
				409	parse it.
				410
				411	.. index:: single: FTP
				412
				413	* The code handling the FTP protocol cannot differentiate between a file and a
				414	directory. This can lead to unexpected behavior when attempting to read a URL
				415	that points to a file that is not accessible. If the URL ends in a ``/``, it is
				416	assumed to refer to a directory and will be handled accordingly. But if an
				417	attempt to read a file leads to a 550 error (meaning the URL cannot be found or
				418	is not accessible, often for permission reasons), then the path is treated as a
				419	directory in order to handle the case when a directory is specified by a URL but
				420	the trailing ``/`` has been left off. This can cause misleading results when
				421	you try to fetch a file whose read permissions make it inaccessible; the FTP
				422	code will try to read it, fail with a 550 error, and then perform a directory
				423	listing for the unreadable file. If fine-grained control is needed, consider
				424	using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
				425	_urlopener to meet your needs.
				426
				427	* This module does not support the use of proxies which require authentication.
				428	This may be implemented in the future.
				429
				430	.. index:: module: urlparse
				431
				432	* Although the :mod:`urllib` module contains (undocumented) routines to parse
				433	and unparse URL strings, the recommended interface for URL manipulation is in
				434	module :mod:`urlparse`.
				435
				436
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	437	.. _urllib-examples:
				438
				439	Examples
				440	--------
				441
				442	Here is an example session that uses the ``GET`` method to retrieve a URL
				443	containing parameters::
				444
				445	>>> import urllib
				446	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
				447	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
				448	>>> print f.read()
				449
				450	The following example uses the ``POST`` method instead::
				451
				452	>>> import urllib
				453	>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
				454	>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
				455	>>> print f.read()
				456
				457	The following example uses an explicitly specified HTTP proxy, overriding
				458	environment settings::
				459
				460	>>> import urllib
				461	>>> proxies = {'http': 'http://proxy.example.com:8080/'}
				462	>>> opener = urllib.FancyURLopener(proxies)
				463	>>> f = opener.open("http://www.python.org")
				464	>>> f.read()
				465
				466	The following example uses no proxies at all, overriding environment settings::
				467
				468	>>> import urllib
				469	>>> opener = urllib.FancyURLopener({})
				470	>>> f = opener.open("http://www.python.org/")
				471	>>> f.read()
				472