Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | :mod:`urlparse` --- Parse URLs into components |
| 2 | ============================================== |
| 3 | |
| 4 | .. module:: urlparse |
| 5 | :synopsis: Parse URLs into or assemble them from components. |
| 6 | |
| 7 | |
| 8 | .. index:: |
| 9 | single: WWW |
| 10 | single: World Wide Web |
| 11 | single: URL |
| 12 | pair: URL; parsing |
| 13 | pair: relative; URL |
| 14 | |
Brett Cannon | f6afa33 | 2008-07-11 00:16:30 +0000 | [diff] [blame] | 15 | .. note:: |
Ezio Melotti | 510ff54 | 2012-05-03 19:21:40 +0300 | [diff] [blame] | 16 | The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3. |
Brett Cannon | f6afa33 | 2008-07-11 00:16:30 +0000 | [diff] [blame] | 17 | The :term:`2to3` tool will automatically adapt imports when converting |
Ezio Melotti | 510ff54 | 2012-05-03 19:21:40 +0300 | [diff] [blame] | 18 | your sources to Python 3. |
Brett Cannon | f6afa33 | 2008-07-11 00:16:30 +0000 | [diff] [blame] | 19 | |
Éric Araujo | 29a0b57 | 2011-08-19 02:14:03 +0200 | [diff] [blame] | 20 | **Source code:** :source:`Lib/urlparse.py` |
| 21 | |
| 22 | -------------- |
Brett Cannon | f6afa33 | 2008-07-11 00:16:30 +0000 | [diff] [blame] | 23 | |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 24 | This module defines a standard interface to break Uniform Resource Locator (URL) |
| 25 | strings up in components (addressing scheme, network location, path etc.), to |
| 26 | combine the components back into a URL string, and to convert a "relative URL" |
| 27 | to an absolute URL given a "base URL." |
| 28 | |
| 29 | The module has been designed to match the Internet RFC on Relative Uniform |
Senthil Kumaran | 9d5d507 | 2012-06-28 21:07:32 -0700 | [diff] [blame] | 30 | Resource Locators. It supports the following URL schemes: ``file``, ``ftp``, |
| 31 | ``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``, |
| 32 | ``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``, |
| 33 | ``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``, |
| 34 | ``wais``. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 35 | |
| 36 | .. versionadded:: 2.5 |
| 37 | Support for the ``sftp`` and ``sips`` schemes. |
| 38 | |
| 39 | The :mod:`urlparse` module defines the following functions: |
| 40 | |
| 41 | |
R. David Murray | 172e06e | 2010-05-25 15:32:06 +0000 | [diff] [blame] | 42 | .. function:: urlparse(urlstring[, scheme[, allow_fragments]]) |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 43 | |
| 44 | Parse a URL into six components, returning a 6-tuple. This corresponds to the |
| 45 | general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. |
| 46 | Each tuple item is a string, possibly empty. The components are not broken up in |
| 47 | smaller parts (for example, the network location is a single string), and % |
| 48 | escapes are not expanded. The delimiters as shown above are not part of the |
| 49 | result, except for a leading slash in the *path* component, which is retained if |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 50 | present. For example: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 51 | |
| 52 | >>> from urlparse import urlparse |
| 53 | >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 54 | >>> o # doctest: +NORMALIZE_WHITESPACE |
| 55 | ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', |
| 56 | params='', query='', fragment='') |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 57 | >>> o.scheme |
| 58 | 'http' |
| 59 | >>> o.port |
| 60 | 80 |
| 61 | >>> o.geturl() |
| 62 | 'http://www.cwi.nl:80/%7Eguido/Python.html' |
| 63 | |
Senthil Kumaran | 0b5019f | 2010-08-04 04:45:31 +0000 | [diff] [blame] | 64 | |
Senthil Kumaran | 683beb6 | 2010-11-07 13:10:02 +0000 | [diff] [blame] | 65 | Following the syntax specifications in :rfc:`1808`, urlparse recognizes |
| 66 | a netloc only if it is properly introduced by '//'. Otherwise the |
| 67 | input is presumed to be a relative URL and thus to start with |
| 68 | a path component. |
Senthil Kumaran | 0b5019f | 2010-08-04 04:45:31 +0000 | [diff] [blame] | 69 | |
| 70 | >>> from urlparse import urlparse |
| 71 | >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html') |
| 72 | ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', |
| 73 | params='', query='', fragment='') |
Senthil Kumaran | b1bbc0b | 2013-02-26 01:02:14 -0800 | [diff] [blame] | 74 | >>> urlparse('www.cwi.nl/%7Eguido/Python.html') |
Senthil Kumaran | 34f7c4e | 2013-09-30 22:10:44 -0700 | [diff] [blame] | 75 | ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', |
Senthil Kumaran | 0b5019f | 2010-08-04 04:45:31 +0000 | [diff] [blame] | 76 | params='', query='', fragment='') |
| 77 | >>> urlparse('help/Python.html') |
| 78 | ParseResult(scheme='', netloc='', path='help/Python.html', params='', |
| 79 | query='', fragment='') |
| 80 | |
R. David Murray | 172e06e | 2010-05-25 15:32:06 +0000 | [diff] [blame] | 81 | If the *scheme* argument is specified, it gives the default addressing |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 82 | scheme, to be used only if the URL does not specify one. The default value for |
| 83 | this argument is the empty string. |
| 84 | |
| 85 | If the *allow_fragments* argument is false, fragment identifiers are not |
Georg Brandl | f8757fd | 2014-10-12 16:13:32 +0200 | [diff] [blame] | 86 | recognized and parsed as part of the preceding component, even if the URL's |
| 87 | addressing scheme normally does support them. The default value for this |
| 88 | argument is :const:`True`. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 89 | |
| 90 | The return value is actually an instance of a subclass of :class:`tuple`. This |
| 91 | class has the following additional read-only convenience attributes: |
| 92 | |
| 93 | +------------------+-------+--------------------------+----------------------+ |
| 94 | | Attribute | Index | Value | Value if not present | |
| 95 | +==================+=======+==========================+======================+ |
| 96 | | :attr:`scheme` | 0 | URL scheme specifier | empty string | |
| 97 | +------------------+-------+--------------------------+----------------------+ |
| 98 | | :attr:`netloc` | 1 | Network location part | empty string | |
| 99 | +------------------+-------+--------------------------+----------------------+ |
| 100 | | :attr:`path` | 2 | Hierarchical path | empty string | |
| 101 | +------------------+-------+--------------------------+----------------------+ |
| 102 | | :attr:`params` | 3 | Parameters for last path | empty string | |
| 103 | | | | element | | |
| 104 | +------------------+-------+--------------------------+----------------------+ |
| 105 | | :attr:`query` | 4 | Query component | empty string | |
| 106 | +------------------+-------+--------------------------+----------------------+ |
| 107 | | :attr:`fragment` | 5 | Fragment identifier | empty string | |
| 108 | +------------------+-------+--------------------------+----------------------+ |
| 109 | | :attr:`username` | | User name | :const:`None` | |
| 110 | +------------------+-------+--------------------------+----------------------+ |
| 111 | | :attr:`password` | | Password | :const:`None` | |
| 112 | +------------------+-------+--------------------------+----------------------+ |
| 113 | | :attr:`hostname` | | Host name (lower case) | :const:`None` | |
| 114 | +------------------+-------+--------------------------+----------------------+ |
| 115 | | :attr:`port` | | Port number as integer, | :const:`None` | |
| 116 | | | | if present | | |
| 117 | +------------------+-------+--------------------------+----------------------+ |
| 118 | |
| 119 | See section :ref:`urlparse-result-object` for more information on the result |
| 120 | object. |
| 121 | |
| 122 | .. versionchanged:: 2.5 |
| 123 | Added attributes to return value. |
| 124 | |
Senthil Kumaran | 3982461 | 2010-04-22 12:10:13 +0000 | [diff] [blame] | 125 | .. versionchanged:: 2.7 |
| 126 | Added IPv6 URL parsing capabilities. |
| 127 | |
| 128 | |
Facundo Batista | c585df9 | 2008-09-03 22:35:50 +0000 | [diff] [blame] | 129 | .. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]]) |
| 130 | |
| 131 | Parse a query string given as a string argument (data of type |
| 132 | :mimetype:`application/x-www-form-urlencoded`). Data are returned as a |
| 133 | dictionary. The dictionary keys are the unique query variable names and the |
| 134 | values are lists of values for each name. |
| 135 | |
| 136 | The optional argument *keep_blank_values* is a flag indicating whether blank |
Senthil Kumaran | bd13f45 | 2010-08-09 20:14:11 +0000 | [diff] [blame] | 137 | values in percent-encoded queries should be treated as blank strings. A true value |
Facundo Batista | c585df9 | 2008-09-03 22:35:50 +0000 | [diff] [blame] | 138 | indicates that blanks should be retained as blank strings. The default false |
| 139 | value indicates that blank values are to be ignored and treated as if they were |
| 140 | not included. |
| 141 | |
| 142 | The optional argument *strict_parsing* is a flag indicating what to do with |
| 143 | parsing errors. If false (the default), errors are silently ignored. If true, |
| 144 | errors raise a :exc:`ValueError` exception. |
| 145 | |
| 146 | Use the :func:`urllib.urlencode` function to convert such dictionaries into |
| 147 | query strings. |
| 148 | |
Georg Brandl | a6714b2 | 2009-11-03 18:34:27 +0000 | [diff] [blame] | 149 | .. versionadded:: 2.6 |
| 150 | Copied from the :mod:`cgi` module. |
| 151 | |
Facundo Batista | c585df9 | 2008-09-03 22:35:50 +0000 | [diff] [blame] | 152 | |
| 153 | .. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]]) |
| 154 | |
| 155 | Parse a query string given as a string argument (data of type |
| 156 | :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of |
| 157 | name, value pairs. |
| 158 | |
| 159 | The optional argument *keep_blank_values* is a flag indicating whether blank |
Senthil Kumaran | bd13f45 | 2010-08-09 20:14:11 +0000 | [diff] [blame] | 160 | values in percent-encoded queries should be treated as blank strings. A true value |
Facundo Batista | c585df9 | 2008-09-03 22:35:50 +0000 | [diff] [blame] | 161 | indicates that blanks should be retained as blank strings. The default false |
| 162 | value indicates that blank values are to be ignored and treated as if they were |
| 163 | not included. |
| 164 | |
| 165 | The optional argument *strict_parsing* is a flag indicating what to do with |
| 166 | parsing errors. If false (the default), errors are silently ignored. If true, |
| 167 | errors raise a :exc:`ValueError` exception. |
| 168 | |
| 169 | Use the :func:`urllib.urlencode` function to convert such lists of pairs into |
| 170 | query strings. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 171 | |
Georg Brandl | a6714b2 | 2009-11-03 18:34:27 +0000 | [diff] [blame] | 172 | .. versionadded:: 2.6 |
| 173 | Copied from the :mod:`cgi` module. |
| 174 | |
| 175 | |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 176 | .. function:: urlunparse(parts) |
| 177 | |
| 178 | Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument |
| 179 | can be any six-item iterable. This may result in a slightly different, but |
| 180 | equivalent URL, if the URL that was parsed originally had unnecessary delimiters |
| 181 | (for example, a ? with an empty query; the RFC states that these are |
| 182 | equivalent). |
| 183 | |
| 184 | |
R. David Murray | 172e06e | 2010-05-25 15:32:06 +0000 | [diff] [blame] | 185 | .. function:: urlsplit(urlstring[, scheme[, allow_fragments]]) |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 186 | |
| 187 | This is similar to :func:`urlparse`, but does not split the params from the URL. |
| 188 | This should generally be used instead of :func:`urlparse` if the more recent URL |
| 189 | syntax allowing parameters to be applied to each segment of the *path* portion |
| 190 | of the URL (see :rfc:`2396`) is wanted. A separate function is needed to |
| 191 | separate the path segments and parameters. This function returns a 5-tuple: |
| 192 | (addressing scheme, network location, path, query, fragment identifier). |
| 193 | |
| 194 | The return value is actually an instance of a subclass of :class:`tuple`. This |
| 195 | class has the following additional read-only convenience attributes: |
| 196 | |
| 197 | +------------------+-------+-------------------------+----------------------+ |
| 198 | | Attribute | Index | Value | Value if not present | |
| 199 | +==================+=======+=========================+======================+ |
| 200 | | :attr:`scheme` | 0 | URL scheme specifier | empty string | |
| 201 | +------------------+-------+-------------------------+----------------------+ |
| 202 | | :attr:`netloc` | 1 | Network location part | empty string | |
| 203 | +------------------+-------+-------------------------+----------------------+ |
| 204 | | :attr:`path` | 2 | Hierarchical path | empty string | |
| 205 | +------------------+-------+-------------------------+----------------------+ |
| 206 | | :attr:`query` | 3 | Query component | empty string | |
| 207 | +------------------+-------+-------------------------+----------------------+ |
| 208 | | :attr:`fragment` | 4 | Fragment identifier | empty string | |
| 209 | +------------------+-------+-------------------------+----------------------+ |
| 210 | | :attr:`username` | | User name | :const:`None` | |
| 211 | +------------------+-------+-------------------------+----------------------+ |
| 212 | | :attr:`password` | | Password | :const:`None` | |
| 213 | +------------------+-------+-------------------------+----------------------+ |
| 214 | | :attr:`hostname` | | Host name (lower case) | :const:`None` | |
| 215 | +------------------+-------+-------------------------+----------------------+ |
| 216 | | :attr:`port` | | Port number as integer, | :const:`None` | |
| 217 | | | | if present | | |
| 218 | +------------------+-------+-------------------------+----------------------+ |
| 219 | |
| 220 | See section :ref:`urlparse-result-object` for more information on the result |
| 221 | object. |
| 222 | |
| 223 | .. versionadded:: 2.2 |
| 224 | |
| 225 | .. versionchanged:: 2.5 |
| 226 | Added attributes to return value. |
| 227 | |
| 228 | |
| 229 | .. function:: urlunsplit(parts) |
| 230 | |
| 231 | Combine the elements of a tuple as returned by :func:`urlsplit` into a complete |
| 232 | URL as a string. The *parts* argument can be any five-item iterable. This may |
| 233 | result in a slightly different, but equivalent URL, if the URL that was parsed |
| 234 | originally had unnecessary delimiters (for example, a ? with an empty query; the |
| 235 | RFC states that these are equivalent). |
| 236 | |
| 237 | .. versionadded:: 2.2 |
| 238 | |
| 239 | |
| 240 | .. function:: urljoin(base, url[, allow_fragments]) |
| 241 | |
| 242 | Construct a full ("absolute") URL by combining a "base URL" (*base*) with |
| 243 | another URL (*url*). Informally, this uses components of the base URL, in |
| 244 | particular the addressing scheme, the network location and (part of) the path, |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 245 | to provide missing components in the relative URL. For example: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 246 | |
| 247 | >>> from urlparse import urljoin |
| 248 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') |
| 249 | 'http://www.cwi.nl/%7Eguido/FAQ.html' |
| 250 | |
| 251 | The *allow_fragments* argument has the same meaning and default as for |
| 252 | :func:`urlparse`. |
| 253 | |
| 254 | .. note:: |
| 255 | |
| 256 | If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), |
| 257 | the *url*'s host name and/or scheme will be present in the result. For example: |
| 258 | |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 259 | .. doctest:: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 260 | |
| 261 | >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', |
| 262 | ... '//www.python.org/%7Eguido') |
| 263 | 'http://www.python.org/%7Eguido' |
| 264 | |
| 265 | If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and |
| 266 | :func:`urlunsplit`, removing possible *scheme* and *netloc* parts. |
| 267 | |
| 268 | |
| 269 | .. function:: urldefrag(url) |
| 270 | |
| 271 | If *url* contains a fragment identifier, returns a modified version of *url* |
| 272 | with no fragment identifier, and the fragment identifier as a separate string. |
| 273 | If there is no fragment identifier in *url*, returns *url* unmodified and an |
| 274 | empty string. |
| 275 | |
| 276 | |
| 277 | .. seealso:: |
| 278 | |
Senthil Kumaran | 0a36181 | 2010-04-22 05:48:35 +0000 | [diff] [blame] | 279 | :rfc:`3986` - Uniform Resource Identifiers |
| 280 | This is the current standard (STD66). Any changes to urlparse module |
| 281 | should conform to this. Certain deviations could be observed, which are |
Senthil Kumaran | 3982461 | 2010-04-22 12:10:13 +0000 | [diff] [blame] | 282 | mostly due backward compatiblity purposes and for certain de-facto |
Senthil Kumaran | 0a36181 | 2010-04-22 05:48:35 +0000 | [diff] [blame] | 283 | parsing requirements as commonly observed in major browsers. |
| 284 | |
| 285 | :rfc:`2732` - Format for Literal IPv6 Addresses in URL's. |
| 286 | This specifies the parsing requirements of IPv6 URLs. |
| 287 | |
| 288 | :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax |
| 289 | Document describing the generic syntactic requirements for both Uniform Resource |
| 290 | Names (URNs) and Uniform Resource Locators (URLs). |
| 291 | |
| 292 | :rfc:`2368` - The mailto URL scheme. |
| 293 | Parsing requirements for mailto url schemes. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 294 | |
| 295 | :rfc:`1808` - Relative Uniform Resource Locators |
| 296 | This Request For Comments includes the rules for joining an absolute and a |
| 297 | relative URL, including a fair number of "Abnormal Examples" which govern the |
| 298 | treatment of border cases. |
| 299 | |
Senthil Kumaran | 0a36181 | 2010-04-22 05:48:35 +0000 | [diff] [blame] | 300 | :rfc:`1738` - Uniform Resource Locators (URL) |
| 301 | This specifies the formal syntax and semantics of absolute URLs. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 302 | |
| 303 | |
| 304 | .. _urlparse-result-object: |
| 305 | |
| 306 | Results of :func:`urlparse` and :func:`urlsplit` |
| 307 | ------------------------------------------------ |
| 308 | |
| 309 | The result objects from the :func:`urlparse` and :func:`urlsplit` functions are |
| 310 | subclasses of the :class:`tuple` type. These subclasses add the attributes |
| 311 | described in those functions, as well as provide an additional method: |
| 312 | |
| 313 | |
| 314 | .. method:: ParseResult.geturl() |
| 315 | |
| 316 | Return the re-combined version of the original URL as a string. This may differ |
| 317 | from the original URL in that the scheme will always be normalized to lower case |
| 318 | and empty components may be dropped. Specifically, empty parameters, queries, |
| 319 | and fragment identifiers will be removed. |
| 320 | |
| 321 | The result of this method is a fixpoint if passed back through the original |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 322 | parsing function: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 323 | |
| 324 | >>> import urlparse |
| 325 | >>> url = 'HTTP://www.Python.org/doc/#' |
| 326 | |
| 327 | >>> r1 = urlparse.urlsplit(url) |
| 328 | >>> r1.geturl() |
| 329 | 'http://www.Python.org/doc/' |
| 330 | |
| 331 | >>> r2 = urlparse.urlsplit(r1.geturl()) |
| 332 | >>> r2.geturl() |
| 333 | 'http://www.Python.org/doc/' |
| 334 | |
| 335 | .. versionadded:: 2.5 |
| 336 | |
Georg Brandl | fc29f27 | 2009-01-02 20:25:14 +0000 | [diff] [blame] | 337 | The following classes provide the implementations of the parse results: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 338 | |
| 339 | |
| 340 | .. class:: BaseResult |
| 341 | |
| 342 | Base class for the concrete result classes. This provides most of the attribute |
| 343 | definitions. It does not provide a :meth:`geturl` method. It is derived from |
| 344 | :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__` |
| 345 | methods. |
| 346 | |
| 347 | |
| 348 | .. class:: ParseResult(scheme, netloc, path, params, query, fragment) |
| 349 | |
| 350 | Concrete class for :func:`urlparse` results. The :meth:`__new__` method is |
| 351 | overridden to support checking that the right number of arguments are passed. |
| 352 | |
| 353 | |
| 354 | .. class:: SplitResult(scheme, netloc, path, query, fragment) |
| 355 | |
| 356 | Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is |
| 357 | overridden to support checking that the right number of arguments are passed. |
| 358 | |