Blame - Doc/lib/liburlparse.tex - platform/external/python/cpython2

blob: 76622d5341ad9ed5d6f2dc2e5239a95b540d29e0 [file] [log] [blame]

Fred Drake	295da24	1998-08-10 19:42:37 +0000	[diff] [blame]	1	\section{\module{urlparse} ---
Fred Drake	0308ff8	2000-08-25 17:29:35 +0000	[diff] [blame]	2	Parse URLs into components}
Fred Drake	b91e934	1998-07-23 17:59:49 +0000	[diff] [blame]	3	\declaremodule{standard}{urlparse}
				4
Fred Drake	72d157e	1998-08-06 21:23:17 +0000	[diff] [blame]	5	\modulesynopsis{Parse URLs into components.}
Fred Drake	b91e934	1998-07-23 17:59:49 +0000	[diff] [blame]	6
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	7	\index{WWW}
Fred Drake	8ee679f	2001-07-14 02:50:55 +0000	[diff] [blame]	8	\index{World Wide Web}
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	9	\index{URL}
				10	\indexii{URL}{parsing}
				11	\indexii{relative}{URL}
				12
Guido van Rossum	8675115	1995-02-28 17:14:32 +0000	[diff] [blame]	13
Fred Drake	0308ff8	2000-08-25 17:29:35 +0000	[diff] [blame]	14	This module defines a standard interface to break Uniform Resource
				15	Locator (URL) strings up in components (addressing scheme, network
				16	location, path etc.), to combine the components back into a URL
				17	string, and to convert a ``relative URL'' to an absolute URL given a
				18	``base URL.''
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	19
Fred Drake	d1cc9c2	1998-01-21 04:55:02 +0000	[diff] [blame]	20	The module has been designed to match the Internet RFC on Relative
				21	Uniform Resource Locators (and discovered a bug in an earlier
Georg Brandl	1de3700	2006-01-20 21:17:01 +0000	[diff] [blame]	22	draft!). It supports the following URL schemes:
				23	\code{file}, \code{ftp}, \code{gopher}, \code{hdl}, \code{http},
				24	\code{https}, \code{imap}, \code{mailto}, \code{mms}, \code{news},
				25	\code{nntp}, \code{prospero}, \code{rsync}, \code{rtsp}, \code{rtspu},
Fred Drake	23fd3d4	2006-04-01 06:11:07 +0000	[diff] [blame]	26	\code{sftp}, \code{shttp}, \code{sip}, \code{sips}, \code{snews}, \code{svn},
Georg Brandl	1de3700	2006-01-20 21:17:01 +0000	[diff] [blame]	27	\code{svn+ssh}, \code{telnet}, \code{wais}.
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	28
Fred Drake	23fd3d4	2006-04-01 06:11:07 +0000	[diff] [blame]	29	\versionadded[Support for the \code{sftp} and \code{sips} schemes]{2.5}
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	30
Georg Brandl	1de3700	2006-01-20 21:17:01 +0000	[diff] [blame]	31	The \module{urlparse} module defines the following functions:
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	32
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	33	\begin{funcdesc}{urlparse}{urlstring\optional{,
				34	default_scheme\optional{, allow_fragments}}}
				35	Parse a URL into six components, returning a 6-tuple. This
				36	corresponds to the general structure of a URL:
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	37	\code{\var{scheme}://\var{netloc}/\var{path};\var{parameters}?\var{query}\#\var{fragment}}.
				38	Each tuple item is a string, possibly empty.
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	39	The components are not broken up in smaller parts (for example, the network
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	40	location is a single string), and \% escapes are not expanded.
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	41	The delimiters as shown above are not part of the result,
Guido van Rossum	470be14	1995-03-17 16:07:09 +0000	[diff] [blame]	42	except for a leading slash in the \var{path} component, which is
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	43	retained if present. For example:
Guido van Rossum	96628a9	1995-04-10 11:34:00 +0000	[diff] [blame]	44
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	45	\begin{verbatim}
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	46	>>> from urlparse import urlparse
				47	>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
				48	>>> o
Guido van Rossum	96628a9	1995-04-10 11:34:00 +0000	[diff] [blame]	49	('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '')
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	50	>>> o.scheme
				51	'http'
				52	>>> o.port
				53	80
				54	>>> o.geturl()
				55	'http://www.cwi.nl:80/%7Eguido/Python.html'
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	56	\end{verbatim}
Fred Drake	45ca333	2000-08-24 04:58:25 +0000	[diff] [blame]	57
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	58	If the \var{default_scheme} argument is specified, it gives the
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	59	default addressing scheme, to be used only if the URL does not
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	60	specify one. The default value for this argument is the empty string.
				61
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	62	If the \var{allow_fragments} argument is false, fragment identifiers
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	63	are not allowed, even if the URL's addressing scheme normally does
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	64	support them. The default value for this argument is \constant{True}.
				65
				66	The return value is actually an instance of a subclass of
				67	\pytype{tuple}. This class has the following additional read-only
				68	convenience attributes:
				69
				70	\begin{tableiv}{l\|c\|l\|c}{member}{Attribute}{Index}{Value}{Value if not present}
				71	\lineiv{scheme} {0} {URL scheme specifier} {empty string}
				72	\lineiv{netloc} {1} {Network location part} {empty string}
				73	\lineiv{path} {2} {Hierarchical path} {empty string}
				74	\lineiv{params} {3} {Parameters for last path element} {empty string}
				75	\lineiv{query} {4} {Query component} {empty string}
				76	\lineiv{fragment}{5} {Fragment identifier} {empty string}
				77	\lineiv{username}{ } {User name} {\constant{None}}
				78	\lineiv{password}{ } {Password} {\constant{None}}
				79	\lineiv{hostname}{ } {Host name (lower case)} {\constant{None}}
				80	\lineiv{port} { } {Port number as integer, if present} {\constant{None}}
				81	\end{tableiv}
				82
				83	See section~\ref{urlparse-result-object}, ``Results of
				84	\function{urlparse()} and \function{urlsplit()},'' for more
				85	information on the result object.
				86
				87	\versionchanged[Added attributes to return value]{2.5}
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	88	\end{funcdesc}
				89
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	90	\begin{funcdesc}{urlunparse}{parts}
				91	Construct a URL from a tuple as returned by \code{urlparse()}.
Andrew M. Kuchling	96e6065	2006-12-20 19:58:18 +0000	[diff] [blame]	92	The \var{parts} argument can be any six-item iterable.
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	93	This may result in a slightly different, but equivalent URL, if the
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	94	URL that was parsed originally had unnecessary delimiters (for example,
				95	a ? with an empty query; the RFC states that these are equivalent).
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	96	\end{funcdesc}
				97
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	98	\begin{funcdesc}{urlsplit}{urlstring\optional{,
				99	default_scheme\optional{, allow_fragments}}}
				100	This is similar to \function{urlparse()}, but does not split the
				101	params from the URL. This should generally be used instead of
				102	\function{urlparse()} if the more recent URL syntax allowing
				103	parameters to be applied to each segment of the \var{path} portion of
Walter Dörwald	ff9ca5e	2005-08-31 11:03:12 +0000	[diff] [blame]	104	the URL (see \rfc{2396}) is wanted. A separate function is needed to
				105	separate the path segments and parameters. This function returns a
				106	5-tuple: (addressing scheme, network location, path, query, fragment
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	107	identifier).
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	108
				109	The return value is actually an instance of a subclass of
				110	\pytype{tuple}. This class has the following additional read-only
				111	convenience attributes:
				112
				113	\begin{tableiv}{l\|c\|l\|c}{member}{Attribute}{Index}{Value}{Value if not present}
				114	\lineiv{scheme} {0} {URL scheme specifier} {empty string}
				115	\lineiv{netloc} {1} {Network location part} {empty string}
				116	\lineiv{path} {2} {Hierarchical path} {empty string}
				117	\lineiv{query} {3} {Query component} {empty string}
				118	\lineiv{fragment} {4} {Fragment identifier} {empty string}
				119	\lineiv{username} { } {User name} {\constant{None}}
				120	\lineiv{password} { } {Password} {\constant{None}}
				121	\lineiv{hostname} { } {Host name (lower case)} {\constant{None}}
				122	\lineiv{port} { } {Port number as integer, if present} {\constant{None}}
				123	\end{tableiv}
				124
				125	See section~\ref{urlparse-result-object}, ``Results of
				126	\function{urlparse()} and \function{urlsplit()},'' for more
				127	information on the result object.
				128
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	129	\versionadded{2.2}
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	130	\versionchanged[Added attributes to return value]{2.5}
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	131	\end{funcdesc}
				132
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	133	\begin{funcdesc}{urlunsplit}{parts}
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	134	Combine the elements of a tuple as returned by \function{urlsplit()}
				135	into a complete URL as a string.
Andrew M. Kuchling	96e6065	2006-12-20 19:58:18 +0000	[diff] [blame]	136	The \var{parts} argument can be any five-item iterable.
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	137	This may result in a slightly different, but equivalent URL, if the
				138	URL that was parsed originally had unnecessary delimiters (for example,
				139	a ? with an empty query; the RFC states that these are equivalent).
Fred Drake	5545219	2001-11-16 03:22:15 +0000	[diff] [blame]	140	\versionadded{2.2}
				141	\end{funcdesc}
				142
Fred Drake	cce1090	1998-03-17 06:33:25 +0000	[diff] [blame]	143	\begin{funcdesc}{urljoin}{base, url\optional{, allow_fragments}}
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	144	Construct a full (``absolute'') URL by combining a ``base URL''
Georg Brandl	b85509d	2006-10-12 11:14:44 +0000	[diff] [blame]	145	(\var{base}) with another URL (\var{url}). Informally, this
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	146	uses components of the base URL, in particular the addressing scheme,
				147	the network location and (part of) the path, to provide missing
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	148	components in the relative URL. For example:
Guido van Rossum	96628a9	1995-04-10 11:34:00 +0000	[diff] [blame]	149
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	150	\begin{verbatim}
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	151	>>> from urlparse import urljoin
				152	>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
Guido van Rossum	96628a9	1995-04-10 11:34:00 +0000	[diff] [blame]	153	'http://www.cwi.nl/%7Eguido/FAQ.html'
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	154	\end{verbatim}
Fred Drake	0308ff8	2000-08-25 17:29:35 +0000	[diff] [blame]	155
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	156	The \var{allow_fragments} argument has the same meaning and default as
				157	for \function{urlparse()}.
Georg Brandl	b85509d	2006-10-12 11:14:44 +0000	[diff] [blame]	158
				159	\note{If \var{url} is an absolute URL (that is, starting with \code{//}
				160	or \code{scheme://}, the \var{url}'s host name and/or scheme
				161	will be present in the result. For example:}
				162
				163	\begin{verbatim}
				164	>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
				165	... '//www.python.org/%7Eguido')
				166	'http://www.python.org/%7Eguido'
				167	\end{verbatim}
				168
				169	If you do not want that behavior, preprocess
				170	the \var{url} with \function{urlsplit()} and \function{urlunsplit()},
Georg Brandl	dfc2966	2007-03-08 17:49:17 +0000	[diff] [blame]	171	removing possible \emph{scheme} and \emph{netloc} parts.
Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	172	\end{funcdesc}
Fred Drake	45ca333	2000-08-24 04:58:25 +0000	[diff] [blame]	173
Fred Drake	98ef20d	2002-10-16 20:07:54 +0000	[diff] [blame]	174	\begin{funcdesc}{urldefrag}{url}
				175	If \var{url} contains a fragment identifier, returns a modified
				176	version of \var{url} with no fragment identifier, and the fragment
				177	identifier as a separate string. If there is no fragment identifier
				178	in \var{url}, returns \var{url} unmodified and an empty string.
				179	\end{funcdesc}
				180
Fred Drake	45ca333	2000-08-24 04:58:25 +0000	[diff] [blame]	181
				182	\begin{seealso}
				183	\seerfc{1738}{Uniform Resource Locators (URL)}{
				184	This specifies the formal syntax and semantics of absolute
				185	URLs.}
				186	\seerfc{1808}{Relative Uniform Resource Locators}{
				187	This Request For Comments includes the rules for joining an
Fred Drake	5f2c1d2	2002-10-17 19:23:43 +0000	[diff] [blame]	188	absolute and a relative URL, including a fair number of
Fred Drake	45ca333	2000-08-24 04:58:25 +0000	[diff] [blame]	189	``Abnormal Examples'' which govern the treatment of border
				190	cases.}
Fred Drake	0308ff8	2000-08-25 17:29:35 +0000	[diff] [blame]	191	\seerfc{2396}{Uniform Resource Identifiers (URI): Generic Syntax}{
				192	Document describing the generic syntactic requirements for
				193	both Uniform Resource Names (URNs) and Uniform Resource
				194	Locators (URLs).}
Fred Drake	45ca333	2000-08-24 04:58:25 +0000	[diff] [blame]	195	\end{seealso}
Fred Drake	ad5177c	2006-04-01 22:14:43 +0000	[diff] [blame]	196
				197
				198	\subsection{Results of \function{urlparse()} and \function{urlsplit()}
				199	\label{urlparse-result-object}}
				200
				201	The result objects from the \function{urlparse()} and
				202	\function{urlsplit()} functions are subclasses of the \pytype{tuple}
				203	type. These subclasses add the attributes described in those
				204	functions, as well as provide an additional method:
				205
				206	\begin{methoddesc}[ParseResult]{geturl}{}
				207	Return the re-combined version of the original URL as a string.
				208	This may differ from the original URL in that the scheme will always
				209	be normalized to lower case and empty components may be dropped.
				210	Specifically, empty parameters, queries, and fragment identifiers
				211	will be removed.
				212
				213	The result of this method is a fixpoint if passed back through the
				214	original parsing function:
				215
				216	\begin{verbatim}
				217	>>> import urlparse
				218	>>> url = 'HTTP://www.Python.org/doc/#'
				219
				220	>>> r1 = urlparse.urlsplit(url)
				221	>>> r1.geturl()
				222	'http://www.Python.org/doc/'
				223
				224	>>> r2 = urlparse.urlsplit(r1.geturl())
				225	>>> r2.geturl()
				226	'http://www.Python.org/doc/'
				227	\end{verbatim}
				228
				229	\versionadded{2.5}
				230	\end{methoddesc}
				231
				232	The following classes provide the implementations of the parse results::
				233
				234	\begin{classdesc*}{BaseResult}
				235	Base class for the concrete result classes. This provides most of
				236	the attribute definitions. It does not provide a \method{geturl()}
				237	method. It is derived from \class{tuple}, but does not override the
				238	\method{__init__()} or \method{__new__()} methods.
				239	\end{classdesc*}
				240
				241
				242	\begin{classdesc}{ParseResult}{scheme, netloc, path, params, query, fragment}
				243	Concrete class for \function{urlparse()} results. The
				244	\method{__new__()} method is overridden to support checking that the
				245	right number of arguments are passed.
				246	\end{classdesc}
				247
				248
				249	\begin{classdesc}{SplitResult}{scheme, netloc, path, query, fragment}
				250	Concrete class for \function{urlsplit()} results. The
				251	\method{__new__()} method is overridden to support checking that the
				252	right number of arguments are passed.
				253	\end{classdesc}