blob: 49000050a1882bb8a84b016babf3b71ae276893b [file] [log] [blame]
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00001:mod:`urllib.parse` --- Parse URLs into components
2==================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00004.. module:: urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
15This module defines a standard interface to break Uniform Resource Locator (URL)
16strings up in components (addressing scheme, network location, path etc.), to
17combine the components back into a URL string, and to convert a "relative URL"
18to an absolute URL given a "base URL."
19
20The module has been designed to match the Internet RFC on Relative Uniform
21Resource Locators (and discovered a bug in an earlier draft!). It supports the
22following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
Georg Brandl0f7ede42008-06-23 11:23:31 +000023``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
24``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
25``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
Georg Brandl116aa622007-08-15 14:28:22 +000026
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000027The :mod:`urllib.parse` module defines the following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000028
Georg Brandl7f01a132009-09-16 15:58:14 +000029.. function:: urlparse(urlstring, default_scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +000030
31 Parse a URL into six components, returning a 6-tuple. This corresponds to the
32 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
33 Each tuple item is a string, possibly empty. The components are not broken up in
34 smaller parts (for example, the network location is a single string), and %
35 escapes are not expanded. The delimiters as shown above are not part of the
36 result, except for a leading slash in the *path* component, which is retained if
Christian Heimesfe337bf2008-03-23 21:54:12 +000037 present. For example:
Georg Brandl116aa622007-08-15 14:28:22 +000038
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000039 >>> from urllib.parse import urlparse
Georg Brandl116aa622007-08-15 14:28:22 +000040 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Christian Heimesfe337bf2008-03-23 21:54:12 +000041 >>> o # doctest: +NORMALIZE_WHITESPACE
42 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
43 params='', query='', fragment='')
Georg Brandl116aa622007-08-15 14:28:22 +000044 >>> o.scheme
45 'http'
46 >>> o.port
47 80
48 >>> o.geturl()
49 'http://www.cwi.nl:80/%7Eguido/Python.html'
50
51 If the *default_scheme* argument is specified, it gives the default addressing
52 scheme, to be used only if the URL does not specify one. The default value for
53 this argument is the empty string.
54
55 If the *allow_fragments* argument is false, fragment identifiers are not
56 allowed, even if the URL's addressing scheme normally does support them. The
57 default value for this argument is :const:`True`.
58
59 The return value is actually an instance of a subclass of :class:`tuple`. This
60 class has the following additional read-only convenience attributes:
61
62 +------------------+-------+--------------------------+----------------------+
63 | Attribute | Index | Value | Value if not present |
64 +==================+=======+==========================+======================+
65 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
66 +------------------+-------+--------------------------+----------------------+
67 | :attr:`netloc` | 1 | Network location part | empty string |
68 +------------------+-------+--------------------------+----------------------+
69 | :attr:`path` | 2 | Hierarchical path | empty string |
70 +------------------+-------+--------------------------+----------------------+
71 | :attr:`params` | 3 | Parameters for last path | empty string |
72 | | | element | |
73 +------------------+-------+--------------------------+----------------------+
74 | :attr:`query` | 4 | Query component | empty string |
75 +------------------+-------+--------------------------+----------------------+
76 | :attr:`fragment` | 5 | Fragment identifier | empty string |
77 +------------------+-------+--------------------------+----------------------+
78 | :attr:`username` | | User name | :const:`None` |
79 +------------------+-------+--------------------------+----------------------+
80 | :attr:`password` | | Password | :const:`None` |
81 +------------------+-------+--------------------------+----------------------+
82 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
83 +------------------+-------+--------------------------+----------------------+
84 | :attr:`port` | | Port number as integer, | :const:`None` |
85 | | | if present | |
86 +------------------+-------+--------------------------+----------------------+
87
88 See section :ref:`urlparse-result-object` for more information on the result
89 object.
90
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +000091 .. versionchanged:: 3.2
92 Added IPv6 URL parsing capabilities.
93
Georg Brandl116aa622007-08-15 14:28:22 +000094
Georg Brandl7f01a132009-09-16 15:58:14 +000095.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +000096
97 Parse a query string given as a string argument (data of type
98 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
99 dictionary. The dictionary keys are the unique query variable names and the
100 values are lists of values for each name.
101
102 The optional argument *keep_blank_values* is a flag indicating whether blank
103 values in URL encoded queries should be treated as blank strings. A true value
104 indicates that blanks should be retained as blank strings. The default false
105 value indicates that blank values are to be ignored and treated as if they were
106 not included.
107
108 The optional argument *strict_parsing* is a flag indicating what to do with
109 parsing errors. If false (the default), errors are silently ignored. If true,
110 errors raise a :exc:`ValueError` exception.
111
Georg Brandl7fe2c4a2008-12-05 07:32:56 +0000112 Use the :func:`urllib.parse.urlencode` function to convert such
113 dictionaries into query strings.
Facundo Batistac469d4c2008-09-03 22:49:01 +0000114
115
Georg Brandl7f01a132009-09-16 15:58:14 +0000116.. function:: parse_qsl(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +0000117
118 Parse a query string given as a string argument (data of type
119 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
120 name, value pairs.
121
122 The optional argument *keep_blank_values* is a flag indicating whether blank
123 values in URL encoded queries should be treated as blank strings. A true value
124 indicates that blanks should be retained as blank strings. The default false
125 value indicates that blank values are to be ignored and treated as if they were
126 not included.
127
128 The optional argument *strict_parsing* is a flag indicating what to do with
129 parsing errors. If false (the default), errors are silently ignored. If true,
130 errors raise a :exc:`ValueError` exception.
131
132 Use the :func:`urllib.parse.urlencode` function to convert such lists of pairs into
133 query strings.
134
135
Georg Brandl116aa622007-08-15 14:28:22 +0000136.. function:: urlunparse(parts)
137
Georg Brandl0f7ede42008-06-23 11:23:31 +0000138 Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
139 argument can be any six-item iterable. This may result in a slightly
140 different, but equivalent URL, if the URL that was parsed originally had
141 unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
142 states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000143
144
Georg Brandl7f01a132009-09-16 15:58:14 +0000145.. function:: urlsplit(urlstring, default_scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000146
147 This is similar to :func:`urlparse`, but does not split the params from the URL.
148 This should generally be used instead of :func:`urlparse` if the more recent URL
149 syntax allowing parameters to be applied to each segment of the *path* portion
150 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
151 separate the path segments and parameters. This function returns a 5-tuple:
152 (addressing scheme, network location, path, query, fragment identifier).
153
154 The return value is actually an instance of a subclass of :class:`tuple`. This
155 class has the following additional read-only convenience attributes:
156
157 +------------------+-------+-------------------------+----------------------+
158 | Attribute | Index | Value | Value if not present |
159 +==================+=======+=========================+======================+
160 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
161 +------------------+-------+-------------------------+----------------------+
162 | :attr:`netloc` | 1 | Network location part | empty string |
163 +------------------+-------+-------------------------+----------------------+
164 | :attr:`path` | 2 | Hierarchical path | empty string |
165 +------------------+-------+-------------------------+----------------------+
166 | :attr:`query` | 3 | Query component | empty string |
167 +------------------+-------+-------------------------+----------------------+
168 | :attr:`fragment` | 4 | Fragment identifier | empty string |
169 +------------------+-------+-------------------------+----------------------+
170 | :attr:`username` | | User name | :const:`None` |
171 +------------------+-------+-------------------------+----------------------+
172 | :attr:`password` | | Password | :const:`None` |
173 +------------------+-------+-------------------------+----------------------+
174 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
175 +------------------+-------+-------------------------+----------------------+
176 | :attr:`port` | | Port number as integer, | :const:`None` |
177 | | | if present | |
178 +------------------+-------+-------------------------+----------------------+
179
180 See section :ref:`urlparse-result-object` for more information on the result
181 object.
182
Georg Brandl116aa622007-08-15 14:28:22 +0000183
184.. function:: urlunsplit(parts)
185
Georg Brandl0f7ede42008-06-23 11:23:31 +0000186 Combine the elements of a tuple as returned by :func:`urlsplit` into a
187 complete URL as a string. The *parts* argument can be any five-item
188 iterable. This may result in a slightly different, but equivalent URL, if the
189 URL that was parsed originally had unnecessary delimiters (for example, a ?
190 with an empty query; the RFC states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000191
Georg Brandl116aa622007-08-15 14:28:22 +0000192
Georg Brandl7f01a132009-09-16 15:58:14 +0000193.. function:: urljoin(base, url, allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000194
195 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
196 another URL (*url*). Informally, this uses components of the base URL, in
Georg Brandl0f7ede42008-06-23 11:23:31 +0000197 particular the addressing scheme, the network location and (part of) the
198 path, to provide missing components in the relative URL. For example:
Georg Brandl116aa622007-08-15 14:28:22 +0000199
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000200 >>> from urllib.parse import urljoin
Georg Brandl116aa622007-08-15 14:28:22 +0000201 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
202 'http://www.cwi.nl/%7Eguido/FAQ.html'
203
204 The *allow_fragments* argument has the same meaning and default as for
205 :func:`urlparse`.
206
207 .. note::
208
209 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
210 the *url*'s host name and/or scheme will be present in the result. For example:
211
Christian Heimesfe337bf2008-03-23 21:54:12 +0000212 .. doctest::
Georg Brandl116aa622007-08-15 14:28:22 +0000213
214 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
215 ... '//www.python.org/%7Eguido')
216 'http://www.python.org/%7Eguido'
217
218 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
219 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
220
221
222.. function:: urldefrag(url)
223
Georg Brandl0f7ede42008-06-23 11:23:31 +0000224 If *url* contains a fragment identifier, return a modified version of *url*
225 with no fragment identifier, and the fragment identifier as a separate
226 string. If there is no fragment identifier in *url*, return *url* unmodified
227 and an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +0000228
Georg Brandl7f01a132009-09-16 15:58:14 +0000229
230.. function:: quote(string, safe='/', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000231
232 Replace special characters in *string* using the ``%xx`` escape. Letters,
Senthil Kumaran8aa8bbe2009-08-31 16:43:45 +0000233 digits, and the characters ``'_.-'`` are never quoted. By default, this
234 function is intended for quoting the path section of URL. The optional *safe*
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000235 parameter specifies additional ASCII characters that should not be quoted
236 --- its default value is ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000237
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000238 *string* may be either a :class:`str` or a :class:`bytes`.
239
240 The optional *encoding* and *errors* parameters specify how to deal with
241 non-ASCII characters, as accepted by the :meth:`str.encode` method.
242 *encoding* defaults to ``'utf-8'``.
243 *errors* defaults to ``'strict'``, meaning unsupported characters raise a
244 :class:`UnicodeEncodeError`.
245 *encoding* and *errors* must not be supplied if *string* is a
246 :class:`bytes`, or a :class:`TypeError` is raised.
247
248 Note that ``quote(string, safe, encoding, errors)`` is equivalent to
249 ``quote_from_bytes(string.encode(encoding, errors), safe)``.
250
251 Example: ``quote('/El Niño/')`` yields ``'/El%20Ni%C3%B1o/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000252
253
Georg Brandl7f01a132009-09-16 15:58:14 +0000254.. function:: quote_plus(string, safe='', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000255
Georg Brandl0f7ede42008-06-23 11:23:31 +0000256 Like :func:`quote`, but also replace spaces by plus signs, as required for
Georg Brandl81c09db2009-07-29 07:27:08 +0000257 quoting HTML form values when building up a query string to go into a URL.
258 Plus signs in the original string are escaped unless they are included in
259 *safe*. It also does not have *safe* default to ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000260
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000261 Example: ``quote_plus('/El Niño/')`` yields ``'%2FEl+Ni%C3%B1o%2F'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000262
Georg Brandl7f01a132009-09-16 15:58:14 +0000263
264.. function:: quote_from_bytes(bytes, safe='/')
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000265
266 Like :func:`quote`, but accepts a :class:`bytes` object rather than a
267 :class:`str`, and does not perform string-to-bytes encoding.
268
269 Example: ``quote_from_bytes(b'a&\xef')`` yields
270 ``'a%26%EF'``.
271
Georg Brandl7f01a132009-09-16 15:58:14 +0000272
273.. function:: unquote(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000274
275 Replace ``%xx`` escapes by their single-character equivalent.
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000276 The optional *encoding* and *errors* parameters specify how to decode
277 percent-encoded sequences into Unicode characters, as accepted by the
278 :meth:`bytes.decode` method.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000279
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000280 *string* must be a :class:`str`.
281
282 *encoding* defaults to ``'utf-8'``.
283 *errors* defaults to ``'replace'``, meaning invalid sequences are replaced
284 by a placeholder character.
285
286 Example: ``unquote('/El%20Ni%C3%B1o/')`` yields ``'/El Niño/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000287
288
Georg Brandl7f01a132009-09-16 15:58:14 +0000289.. function:: unquote_plus(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000290
Georg Brandl0f7ede42008-06-23 11:23:31 +0000291 Like :func:`unquote`, but also replace plus signs by spaces, as required for
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000292 unquoting HTML form values.
293
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000294 *string* must be a :class:`str`.
295
296 Example: ``unquote_plus('/El+Ni%C3%B1o/')`` yields ``'/El Niño/'``.
297
Georg Brandl7f01a132009-09-16 15:58:14 +0000298
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000299.. function:: unquote_to_bytes(string)
300
301 Replace ``%xx`` escapes by their single-octet equivalent, and return a
302 :class:`bytes` object.
303
304 *string* may be either a :class:`str` or a :class:`bytes`.
305
306 If it is a :class:`str`, unescaped non-ASCII characters in *string*
307 are encoded into UTF-8 bytes.
308
309 Example: ``unquote_to_bytes('a%26%EF')`` yields
310 ``b'a&\xef'``.
311
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000312
Georg Brandl7f01a132009-09-16 15:58:14 +0000313.. function:: urlencode(query, doseq=False)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000314
315 Convert a mapping object or a sequence of two-element tuples to a "url-encoded"
316 string, suitable to pass to :func:`urlopen` above as the optional *data*
317 argument. This is useful to pass a dictionary of form fields to a ``POST``
318 request. The resulting string is a series of ``key=value`` pairs separated by
319 ``'&'`` characters, where both *key* and *value* are quoted using
320 :func:`quote_plus` above. If the optional parameter *doseq* is present and
321 evaluates to true, individual ``key=value`` pairs are generated for each element
322 of the sequence. When a sequence of two-element tuples is used as the *query*
323 argument, the first element of each tuple is a key and the second is a value.
324 The order of parameters in the encoded string will match the order of parameter
Facundo Batistac469d4c2008-09-03 22:49:01 +0000325 tuples in the sequence. This module provides the functions
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000326 :func:`parse_qs` and :func:`parse_qsl` which are used to parse query strings
327 into Python data structures.
328
Georg Brandl116aa622007-08-15 14:28:22 +0000329
330.. seealso::
331
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000332 :rfc:`3986` - Uniform Resource Identifiers
333 This is the current standard (STD66). Any changes to urlparse module
334 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +0000335 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000336 parsing requirements as commonly observed in major browsers.
337
338 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
339 This specifies the parsing requirements of IPv6 URLs.
340
341 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
342 Document describing the generic syntactic requirements for both Uniform Resource
343 Names (URNs) and Uniform Resource Locators (URLs).
344
345 :rfc:`2368` - The mailto URL scheme.
346 Parsing requirements for mailto url schemes.
Georg Brandl116aa622007-08-15 14:28:22 +0000347
348 :rfc:`1808` - Relative Uniform Resource Locators
349 This Request For Comments includes the rules for joining an absolute and a
350 relative URL, including a fair number of "Abnormal Examples" which govern the
351 treatment of border cases.
352
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000353 :rfc:`1738` - Uniform Resource Locators (URL)
354 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl116aa622007-08-15 14:28:22 +0000355
356
357.. _urlparse-result-object:
358
359Results of :func:`urlparse` and :func:`urlsplit`
360------------------------------------------------
361
362The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
363subclasses of the :class:`tuple` type. These subclasses add the attributes
364described in those functions, as well as provide an additional method:
365
Georg Brandl116aa622007-08-15 14:28:22 +0000366.. method:: ParseResult.geturl()
367
368 Return the re-combined version of the original URL as a string. This may differ
369 from the original URL in that the scheme will always be normalized to lower case
370 and empty components may be dropped. Specifically, empty parameters, queries,
371 and fragment identifiers will be removed.
372
373 The result of this method is a fixpoint if passed back through the original
Christian Heimesfe337bf2008-03-23 21:54:12 +0000374 parsing function:
Georg Brandl116aa622007-08-15 14:28:22 +0000375
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000376 >>> import urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +0000377 >>> url = 'HTTP://www.Python.org/doc/#'
378
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000379 >>> r1 = urllib.parse.urlsplit(url)
Georg Brandl116aa622007-08-15 14:28:22 +0000380 >>> r1.geturl()
381 'http://www.Python.org/doc/'
382
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000383 >>> r2 = urllib.parse.urlsplit(r1.geturl())
Georg Brandl116aa622007-08-15 14:28:22 +0000384 >>> r2.geturl()
385 'http://www.Python.org/doc/'
386
Georg Brandl116aa622007-08-15 14:28:22 +0000387
Georg Brandl1f01deb2009-01-03 22:47:39 +0000388The following classes provide the implementations of the parse results:
Georg Brandl116aa622007-08-15 14:28:22 +0000389
Georg Brandl116aa622007-08-15 14:28:22 +0000390.. class:: BaseResult
391
Georg Brandl0f7ede42008-06-23 11:23:31 +0000392 Base class for the concrete result classes. This provides most of the
393 attribute definitions. It does not provide a :meth:`geturl` method. It is
394 derived from :class:`tuple`, but does not override the :meth:`__init__` or
395 :meth:`__new__` methods.
Georg Brandl116aa622007-08-15 14:28:22 +0000396
397
398.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
399
400 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
401 overridden to support checking that the right number of arguments are passed.
402
403
404.. class:: SplitResult(scheme, netloc, path, query, fragment)
405
406 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
407 overridden to support checking that the right number of arguments are passed.
408