Fred Drake | 295da24 | 1998-08-10 19:42:37 +0000 | [diff] [blame] | 1 | \section{\module{urlparse} --- |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 2 | Parse URLs into components} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 3 | \declaremodule{standard}{urlparse} |
| 4 | |
Fred Drake | 72d157e | 1998-08-06 21:23:17 +0000 | [diff] [blame] | 5 | \modulesynopsis{Parse URLs into components.} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 6 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 7 | \index{WWW} |
Fred Drake | 8ee679f | 2001-07-14 02:50:55 +0000 | [diff] [blame] | 8 | \index{World Wide Web} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 9 | \index{URL} |
| 10 | \indexii{URL}{parsing} |
| 11 | \indexii{relative}{URL} |
| 12 | |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 13 | |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 14 | This module defines a standard interface to break Uniform Resource |
| 15 | Locator (URL) strings up in components (addressing scheme, network |
| 16 | location, path etc.), to combine the components back into a URL |
| 17 | string, and to convert a ``relative URL'' to an absolute URL given a |
| 18 | ``base URL.'' |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 19 | |
Fred Drake | d1cc9c2 | 1998-01-21 04:55:02 +0000 | [diff] [blame] | 20 | The module has been designed to match the Internet RFC on Relative |
| 21 | Uniform Resource Locators (and discovered a bug in an earlier |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 22 | draft!). It supports the following URL schemes: |
| 23 | \code{file}, \code{ftp}, \code{gopher}, \code{hdl}, \code{http}, |
| 24 | \code{https}, \code{imap}, \code{mailto}, \code{mms}, \code{news}, |
| 25 | \code{nntp}, \code{prospero}, \code{rsync}, \code{rtsp}, \code{rtspu}, |
Fred Drake | 23fd3d4 | 2006-04-01 06:11:07 +0000 | [diff] [blame] | 26 | \code{sftp}, \code{shttp}, \code{sip}, \code{sips}, \code{snews}, \code{svn}, |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 27 | \code{svn+ssh}, \code{telnet}, \code{wais}. |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 28 | |
Fred Drake | 23fd3d4 | 2006-04-01 06:11:07 +0000 | [diff] [blame] | 29 | \versionadded[Support for the \code{sftp} and \code{sips} schemes]{2.5} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 30 | |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 31 | The \module{urlparse} module defines the following functions: |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 32 | |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 33 | \begin{funcdesc}{urlparse}{urlstring\optional{, |
| 34 | default_scheme\optional{, allow_fragments}}} |
| 35 | Parse a URL into six components, returning a 6-tuple. This |
| 36 | corresponds to the general structure of a URL: |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 37 | \code{\var{scheme}://\var{netloc}/\var{path};\var{parameters}?\var{query}\#\var{fragment}}. |
| 38 | Each tuple item is a string, possibly empty. |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 39 | The components are not broken up in smaller parts (for example, the network |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 40 | location is a single string), and \% escapes are not expanded. |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 41 | The delimiters as shown above are not part of the result, |
Guido van Rossum | 470be14 | 1995-03-17 16:07:09 +0000 | [diff] [blame] | 42 | except for a leading slash in the \var{path} component, which is |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 43 | retained if present. For example: |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 44 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 45 | \begin{verbatim} |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 46 | >>> from urlparse import urlparse |
| 47 | >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') |
| 48 | >>> o |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 49 | ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '') |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 50 | >>> o.scheme |
| 51 | 'http' |
| 52 | >>> o.port |
| 53 | 80 |
| 54 | >>> o.geturl() |
| 55 | 'http://www.cwi.nl:80/%7Eguido/Python.html' |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 56 | \end{verbatim} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 57 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 58 | If the \var{default_scheme} argument is specified, it gives the |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 59 | default addressing scheme, to be used only if the URL does not |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 60 | specify one. The default value for this argument is the empty string. |
| 61 | |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 62 | If the \var{allow_fragments} argument is false, fragment identifiers |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 63 | are not allowed, even if the URL's addressing scheme normally does |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 64 | support them. The default value for this argument is \constant{True}. |
| 65 | |
| 66 | The return value is actually an instance of a subclass of |
| 67 | \pytype{tuple}. This class has the following additional read-only |
| 68 | convenience attributes: |
| 69 | |
| 70 | \begin{tableiv}{l|c|l|c}{member}{Attribute}{Index}{Value}{Value if not present} |
| 71 | \lineiv{scheme} {0} {URL scheme specifier} {empty string} |
| 72 | \lineiv{netloc} {1} {Network location part} {empty string} |
| 73 | \lineiv{path} {2} {Hierarchical path} {empty string} |
| 74 | \lineiv{params} {3} {Parameters for last path element} {empty string} |
| 75 | \lineiv{query} {4} {Query component} {empty string} |
| 76 | \lineiv{fragment}{5} {Fragment identifier} {empty string} |
| 77 | \lineiv{username}{ } {User name} {\constant{None}} |
| 78 | \lineiv{password}{ } {Password} {\constant{None}} |
| 79 | \lineiv{hostname}{ } {Host name (lower case)} {\constant{None}} |
| 80 | \lineiv{port} { } {Port number as integer, if present} {\constant{None}} |
| 81 | \end{tableiv} |
| 82 | |
| 83 | See section~\ref{urlparse-result-object}, ``Results of |
| 84 | \function{urlparse()} and \function{urlsplit()},'' for more |
| 85 | information on the result object. |
| 86 | |
| 87 | \versionchanged[Added attributes to return value]{2.5} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 88 | \end{funcdesc} |
| 89 | |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 90 | \begin{funcdesc}{urlunparse}{parts} |
| 91 | Construct a URL from a tuple as returned by \code{urlparse()}. |
Andrew M. Kuchling | 96e6065 | 2006-12-20 19:58:18 +0000 | [diff] [blame] | 92 | The \var{parts} argument can be any six-item iterable. |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 93 | This may result in a slightly different, but equivalent URL, if the |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 94 | URL that was parsed originally had unnecessary delimiters (for example, |
| 95 | a ? with an empty query; the RFC states that these are equivalent). |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 96 | \end{funcdesc} |
| 97 | |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 98 | \begin{funcdesc}{urlsplit}{urlstring\optional{, |
| 99 | default_scheme\optional{, allow_fragments}}} |
| 100 | This is similar to \function{urlparse()}, but does not split the |
| 101 | params from the URL. This should generally be used instead of |
| 102 | \function{urlparse()} if the more recent URL syntax allowing |
| 103 | parameters to be applied to each segment of the \var{path} portion of |
Walter Dörwald | ff9ca5e | 2005-08-31 11:03:12 +0000 | [diff] [blame] | 104 | the URL (see \rfc{2396}) is wanted. A separate function is needed to |
| 105 | separate the path segments and parameters. This function returns a |
| 106 | 5-tuple: (addressing scheme, network location, path, query, fragment |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 107 | identifier). |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 108 | |
| 109 | The return value is actually an instance of a subclass of |
| 110 | \pytype{tuple}. This class has the following additional read-only |
| 111 | convenience attributes: |
| 112 | |
| 113 | \begin{tableiv}{l|c|l|c}{member}{Attribute}{Index}{Value}{Value if not present} |
| 114 | \lineiv{scheme} {0} {URL scheme specifier} {empty string} |
| 115 | \lineiv{netloc} {1} {Network location part} {empty string} |
| 116 | \lineiv{path} {2} {Hierarchical path} {empty string} |
| 117 | \lineiv{query} {3} {Query component} {empty string} |
| 118 | \lineiv{fragment} {4} {Fragment identifier} {empty string} |
| 119 | \lineiv{username} { } {User name} {\constant{None}} |
| 120 | \lineiv{password} { } {Password} {\constant{None}} |
| 121 | \lineiv{hostname} { } {Host name (lower case)} {\constant{None}} |
| 122 | \lineiv{port} { } {Port number as integer, if present} {\constant{None}} |
| 123 | \end{tableiv} |
| 124 | |
| 125 | See section~\ref{urlparse-result-object}, ``Results of |
| 126 | \function{urlparse()} and \function{urlsplit()},'' for more |
| 127 | information on the result object. |
| 128 | |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 129 | \versionadded{2.2} |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 130 | \versionchanged[Added attributes to return value]{2.5} |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 131 | \end{funcdesc} |
| 132 | |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 133 | \begin{funcdesc}{urlunsplit}{parts} |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 134 | Combine the elements of a tuple as returned by \function{urlsplit()} |
| 135 | into a complete URL as a string. |
Andrew M. Kuchling | 96e6065 | 2006-12-20 19:58:18 +0000 | [diff] [blame] | 136 | The \var{parts} argument can be any five-item iterable. |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 137 | This may result in a slightly different, but equivalent URL, if the |
| 138 | URL that was parsed originally had unnecessary delimiters (for example, |
| 139 | a ? with an empty query; the RFC states that these are equivalent). |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 140 | \versionadded{2.2} |
| 141 | \end{funcdesc} |
| 142 | |
Fred Drake | cce1090 | 1998-03-17 06:33:25 +0000 | [diff] [blame] | 143 | \begin{funcdesc}{urljoin}{base, url\optional{, allow_fragments}} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 144 | Construct a full (``absolute'') URL by combining a ``base URL'' |
Georg Brandl | b85509d | 2006-10-12 11:14:44 +0000 | [diff] [blame] | 145 | (\var{base}) with another URL (\var{url}). Informally, this |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 146 | uses components of the base URL, in particular the addressing scheme, |
| 147 | the network location and (part of) the path, to provide missing |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 148 | components in the relative URL. For example: |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 149 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 150 | \begin{verbatim} |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 151 | >>> from urlparse import urljoin |
| 152 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 153 | 'http://www.cwi.nl/%7Eguido/FAQ.html' |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 154 | \end{verbatim} |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 155 | |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 156 | The \var{allow_fragments} argument has the same meaning and default as |
| 157 | for \function{urlparse()}. |
Georg Brandl | b85509d | 2006-10-12 11:14:44 +0000 | [diff] [blame] | 158 | |
| 159 | \note{If \var{url} is an absolute URL (that is, starting with \code{//} |
| 160 | or \code{scheme://}, the \var{url}'s host name and/or scheme |
| 161 | will be present in the result. For example:} |
| 162 | |
| 163 | \begin{verbatim} |
| 164 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', |
| 165 | ... '//www.python.org/%7Eguido') |
| 166 | 'http://www.python.org/%7Eguido' |
| 167 | \end{verbatim} |
| 168 | |
| 169 | If you do not want that behavior, preprocess |
| 170 | the \var{url} with \function{urlsplit()} and \function{urlunsplit()}, |
Georg Brandl | dfc2966 | 2007-03-08 17:49:17 +0000 | [diff] [blame] | 171 | removing possible \emph{scheme} and \emph{netloc} parts. |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 172 | \end{funcdesc} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 173 | |
Fred Drake | 98ef20d | 2002-10-16 20:07:54 +0000 | [diff] [blame] | 174 | \begin{funcdesc}{urldefrag}{url} |
| 175 | If \var{url} contains a fragment identifier, returns a modified |
| 176 | version of \var{url} with no fragment identifier, and the fragment |
| 177 | identifier as a separate string. If there is no fragment identifier |
| 178 | in \var{url}, returns \var{url} unmodified and an empty string. |
| 179 | \end{funcdesc} |
| 180 | |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 181 | |
| 182 | \begin{seealso} |
| 183 | \seerfc{1738}{Uniform Resource Locators (URL)}{ |
| 184 | This specifies the formal syntax and semantics of absolute |
| 185 | URLs.} |
| 186 | \seerfc{1808}{Relative Uniform Resource Locators}{ |
| 187 | This Request For Comments includes the rules for joining an |
Fred Drake | 5f2c1d2 | 2002-10-17 19:23:43 +0000 | [diff] [blame] | 188 | absolute and a relative URL, including a fair number of |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 189 | ``Abnormal Examples'' which govern the treatment of border |
| 190 | cases.} |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 191 | \seerfc{2396}{Uniform Resource Identifiers (URI): Generic Syntax}{ |
| 192 | Document describing the generic syntactic requirements for |
| 193 | both Uniform Resource Names (URNs) and Uniform Resource |
| 194 | Locators (URLs).} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 195 | \end{seealso} |
Fred Drake | ad5177c | 2006-04-01 22:14:43 +0000 | [diff] [blame] | 196 | |
| 197 | |
| 198 | \subsection{Results of \function{urlparse()} and \function{urlsplit()} |
| 199 | \label{urlparse-result-object}} |
| 200 | |
| 201 | The result objects from the \function{urlparse()} and |
| 202 | \function{urlsplit()} functions are subclasses of the \pytype{tuple} |
| 203 | type. These subclasses add the attributes described in those |
| 204 | functions, as well as provide an additional method: |
| 205 | |
| 206 | \begin{methoddesc}[ParseResult]{geturl}{} |
| 207 | Return the re-combined version of the original URL as a string. |
| 208 | This may differ from the original URL in that the scheme will always |
| 209 | be normalized to lower case and empty components may be dropped. |
| 210 | Specifically, empty parameters, queries, and fragment identifiers |
| 211 | will be removed. |
| 212 | |
| 213 | The result of this method is a fixpoint if passed back through the |
| 214 | original parsing function: |
| 215 | |
| 216 | \begin{verbatim} |
| 217 | >>> import urlparse |
| 218 | >>> url = 'HTTP://www.Python.org/doc/#' |
| 219 | |
| 220 | >>> r1 = urlparse.urlsplit(url) |
| 221 | >>> r1.geturl() |
| 222 | 'http://www.Python.org/doc/' |
| 223 | |
| 224 | >>> r2 = urlparse.urlsplit(r1.geturl()) |
| 225 | >>> r2.geturl() |
| 226 | 'http://www.Python.org/doc/' |
| 227 | \end{verbatim} |
| 228 | |
| 229 | \versionadded{2.5} |
| 230 | \end{methoddesc} |
| 231 | |
| 232 | The following classes provide the implementations of the parse results:: |
| 233 | |
| 234 | \begin{classdesc*}{BaseResult} |
| 235 | Base class for the concrete result classes. This provides most of |
| 236 | the attribute definitions. It does not provide a \method{geturl()} |
| 237 | method. It is derived from \class{tuple}, but does not override the |
| 238 | \method{__init__()} or \method{__new__()} methods. |
| 239 | \end{classdesc*} |
| 240 | |
| 241 | |
| 242 | \begin{classdesc}{ParseResult}{scheme, netloc, path, params, query, fragment} |
| 243 | Concrete class for \function{urlparse()} results. The |
| 244 | \method{__new__()} method is overridden to support checking that the |
| 245 | right number of arguments are passed. |
| 246 | \end{classdesc} |
| 247 | |
| 248 | |
| 249 | \begin{classdesc}{SplitResult}{scheme, netloc, path, query, fragment} |
| 250 | Concrete class for \function{urlsplit()} results. The |
| 251 | \method{__new__()} method is overridden to support checking that the |
| 252 | right number of arguments are passed. |
| 253 | \end{classdesc} |