blob: add07a50b6268af62919cb362fcd712654b1eda9 [file] [log] [blame]
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00001:mod:`urllib.parse` --- Parse URLs into components
2==================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00004.. module:: urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
15This module defines a standard interface to break Uniform Resource Locator (URL)
16strings up in components (addressing scheme, network location, path etc.), to
17combine the components back into a URL string, and to convert a "relative URL"
18to an absolute URL given a "base URL."
19
20The module has been designed to match the Internet RFC on Relative Uniform
21Resource Locators (and discovered a bug in an earlier draft!). It supports the
22following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
Georg Brandl0f7ede42008-06-23 11:23:31 +000023``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
24``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
25``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
Georg Brandl116aa622007-08-15 14:28:22 +000026
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000027The :mod:`urllib.parse` module defines the following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000028
R. David Murrayf5077aa2010-05-25 15:36:46 +000029.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +000030
31 Parse a URL into six components, returning a 6-tuple. This corresponds to the
32 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
33 Each tuple item is a string, possibly empty. The components are not broken up in
34 smaller parts (for example, the network location is a single string), and %
35 escapes are not expanded. The delimiters as shown above are not part of the
36 result, except for a leading slash in the *path* component, which is retained if
Christian Heimesfe337bf2008-03-23 21:54:12 +000037 present. For example:
Georg Brandl116aa622007-08-15 14:28:22 +000038
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000039 >>> from urllib.parse import urlparse
Georg Brandl116aa622007-08-15 14:28:22 +000040 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Christian Heimesfe337bf2008-03-23 21:54:12 +000041 >>> o # doctest: +NORMALIZE_WHITESPACE
42 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
43 params='', query='', fragment='')
Georg Brandl116aa622007-08-15 14:28:22 +000044 >>> o.scheme
45 'http'
46 >>> o.port
47 80
48 >>> o.geturl()
49 'http://www.cwi.nl:80/%7Eguido/Python.html'
50
R. David Murrayf5077aa2010-05-25 15:36:46 +000051 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl116aa622007-08-15 14:28:22 +000052 scheme, to be used only if the URL does not specify one. The default value for
53 this argument is the empty string.
54
55 If the *allow_fragments* argument is false, fragment identifiers are not
56 allowed, even if the URL's addressing scheme normally does support them. The
57 default value for this argument is :const:`True`.
58
59 The return value is actually an instance of a subclass of :class:`tuple`. This
60 class has the following additional read-only convenience attributes:
61
62 +------------------+-------+--------------------------+----------------------+
63 | Attribute | Index | Value | Value if not present |
64 +==================+=======+==========================+======================+
65 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
66 +------------------+-------+--------------------------+----------------------+
67 | :attr:`netloc` | 1 | Network location part | empty string |
68 +------------------+-------+--------------------------+----------------------+
69 | :attr:`path` | 2 | Hierarchical path | empty string |
70 +------------------+-------+--------------------------+----------------------+
71 | :attr:`params` | 3 | Parameters for last path | empty string |
72 | | | element | |
73 +------------------+-------+--------------------------+----------------------+
74 | :attr:`query` | 4 | Query component | empty string |
75 +------------------+-------+--------------------------+----------------------+
76 | :attr:`fragment` | 5 | Fragment identifier | empty string |
77 +------------------+-------+--------------------------+----------------------+
78 | :attr:`username` | | User name | :const:`None` |
79 +------------------+-------+--------------------------+----------------------+
80 | :attr:`password` | | Password | :const:`None` |
81 +------------------+-------+--------------------------+----------------------+
82 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
83 +------------------+-------+--------------------------+----------------------+
84 | :attr:`port` | | Port number as integer, | :const:`None` |
85 | | | if present | |
86 +------------------+-------+--------------------------+----------------------+
87
88 See section :ref:`urlparse-result-object` for more information on the result
89 object.
90
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +000091 .. versionchanged:: 3.2
92 Added IPv6 URL parsing capabilities.
93
Georg Brandl116aa622007-08-15 14:28:22 +000094
Georg Brandl7f01a132009-09-16 15:58:14 +000095.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +000096
97 Parse a query string given as a string argument (data of type
98 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
99 dictionary. The dictionary keys are the unique query variable names and the
100 values are lists of values for each name.
101
102 The optional argument *keep_blank_values* is a flag indicating whether blank
103 values in URL encoded queries should be treated as blank strings. A true value
104 indicates that blanks should be retained as blank strings. The default false
105 value indicates that blank values are to be ignored and treated as if they were
106 not included.
107
108 The optional argument *strict_parsing* is a flag indicating what to do with
109 parsing errors. If false (the default), errors are silently ignored. If true,
110 errors raise a :exc:`ValueError` exception.
111
Georg Brandl7fe2c4a2008-12-05 07:32:56 +0000112 Use the :func:`urllib.parse.urlencode` function to convert such
113 dictionaries into query strings.
Facundo Batistac469d4c2008-09-03 22:49:01 +0000114
115
Georg Brandl7f01a132009-09-16 15:58:14 +0000116.. function:: parse_qsl(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +0000117
118 Parse a query string given as a string argument (data of type
119 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
120 name, value pairs.
121
122 The optional argument *keep_blank_values* is a flag indicating whether blank
123 values in URL encoded queries should be treated as blank strings. A true value
124 indicates that blanks should be retained as blank strings. The default false
125 value indicates that blank values are to be ignored and treated as if they were
126 not included.
127
128 The optional argument *strict_parsing* is a flag indicating what to do with
129 parsing errors. If false (the default), errors are silently ignored. If true,
130 errors raise a :exc:`ValueError` exception.
131
132 Use the :func:`urllib.parse.urlencode` function to convert such lists of pairs into
133 query strings.
134
135
Georg Brandl116aa622007-08-15 14:28:22 +0000136.. function:: urlunparse(parts)
137
Georg Brandl0f7ede42008-06-23 11:23:31 +0000138 Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
139 argument can be any six-item iterable. This may result in a slightly
140 different, but equivalent URL, if the URL that was parsed originally had
141 unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
142 states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000143
144
R. David Murrayf5077aa2010-05-25 15:36:46 +0000145.. function:: urlsplit(urlstring, scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000146
147 This is similar to :func:`urlparse`, but does not split the params from the URL.
148 This should generally be used instead of :func:`urlparse` if the more recent URL
149 syntax allowing parameters to be applied to each segment of the *path* portion
150 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
151 separate the path segments and parameters. This function returns a 5-tuple:
152 (addressing scheme, network location, path, query, fragment identifier).
153
154 The return value is actually an instance of a subclass of :class:`tuple`. This
155 class has the following additional read-only convenience attributes:
156
157 +------------------+-------+-------------------------+----------------------+
158 | Attribute | Index | Value | Value if not present |
159 +==================+=======+=========================+======================+
160 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
161 +------------------+-------+-------------------------+----------------------+
162 | :attr:`netloc` | 1 | Network location part | empty string |
163 +------------------+-------+-------------------------+----------------------+
164 | :attr:`path` | 2 | Hierarchical path | empty string |
165 +------------------+-------+-------------------------+----------------------+
166 | :attr:`query` | 3 | Query component | empty string |
167 +------------------+-------+-------------------------+----------------------+
168 | :attr:`fragment` | 4 | Fragment identifier | empty string |
169 +------------------+-------+-------------------------+----------------------+
170 | :attr:`username` | | User name | :const:`None` |
171 +------------------+-------+-------------------------+----------------------+
172 | :attr:`password` | | Password | :const:`None` |
173 +------------------+-------+-------------------------+----------------------+
174 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
175 +------------------+-------+-------------------------+----------------------+
176 | :attr:`port` | | Port number as integer, | :const:`None` |
177 | | | if present | |
178 +------------------+-------+-------------------------+----------------------+
179
180 See section :ref:`urlparse-result-object` for more information on the result
181 object.
182
Georg Brandl116aa622007-08-15 14:28:22 +0000183
184.. function:: urlunsplit(parts)
185
Georg Brandl0f7ede42008-06-23 11:23:31 +0000186 Combine the elements of a tuple as returned by :func:`urlsplit` into a
187 complete URL as a string. The *parts* argument can be any five-item
188 iterable. This may result in a slightly different, but equivalent URL, if the
189 URL that was parsed originally had unnecessary delimiters (for example, a ?
190 with an empty query; the RFC states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000191
Georg Brandl116aa622007-08-15 14:28:22 +0000192
Georg Brandl7f01a132009-09-16 15:58:14 +0000193.. function:: urljoin(base, url, allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000194
195 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
196 another URL (*url*). Informally, this uses components of the base URL, in
Georg Brandl0f7ede42008-06-23 11:23:31 +0000197 particular the addressing scheme, the network location and (part of) the
198 path, to provide missing components in the relative URL. For example:
Georg Brandl116aa622007-08-15 14:28:22 +0000199
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000200 >>> from urllib.parse import urljoin
Georg Brandl116aa622007-08-15 14:28:22 +0000201 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
202 'http://www.cwi.nl/%7Eguido/FAQ.html'
203
204 The *allow_fragments* argument has the same meaning and default as for
205 :func:`urlparse`.
206
207 .. note::
208
209 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
210 the *url*'s host name and/or scheme will be present in the result. For example:
211
Christian Heimesfe337bf2008-03-23 21:54:12 +0000212 .. doctest::
Georg Brandl116aa622007-08-15 14:28:22 +0000213
214 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
215 ... '//www.python.org/%7Eguido')
216 'http://www.python.org/%7Eguido'
217
218 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
219 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
220
221
222.. function:: urldefrag(url)
223
Georg Brandl0f7ede42008-06-23 11:23:31 +0000224 If *url* contains a fragment identifier, return a modified version of *url*
225 with no fragment identifier, and the fragment identifier as a separate
226 string. If there is no fragment identifier in *url*, return *url* unmodified
227 and an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +0000228
Georg Brandl7f01a132009-09-16 15:58:14 +0000229
230.. function:: quote(string, safe='/', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000231
232 Replace special characters in *string* using the ``%xx`` escape. Letters,
Senthil Kumaran8aa8bbe2009-08-31 16:43:45 +0000233 digits, and the characters ``'_.-'`` are never quoted. By default, this
234 function is intended for quoting the path section of URL. The optional *safe*
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000235 parameter specifies additional ASCII characters that should not be quoted
236 --- its default value is ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000237
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000238 *string* may be either a :class:`str` or a :class:`bytes`.
239
240 The optional *encoding* and *errors* parameters specify how to deal with
241 non-ASCII characters, as accepted by the :meth:`str.encode` method.
242 *encoding* defaults to ``'utf-8'``.
243 *errors* defaults to ``'strict'``, meaning unsupported characters raise a
244 :class:`UnicodeEncodeError`.
245 *encoding* and *errors* must not be supplied if *string* is a
246 :class:`bytes`, or a :class:`TypeError` is raised.
247
248 Note that ``quote(string, safe, encoding, errors)`` is equivalent to
249 ``quote_from_bytes(string.encode(encoding, errors), safe)``.
250
251 Example: ``quote('/El Niño/')`` yields ``'/El%20Ni%C3%B1o/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000252
253
Georg Brandl7f01a132009-09-16 15:58:14 +0000254.. function:: quote_plus(string, safe='', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000255
Georg Brandl0f7ede42008-06-23 11:23:31 +0000256 Like :func:`quote`, but also replace spaces by plus signs, as required for
Georg Brandl81c09db2009-07-29 07:27:08 +0000257 quoting HTML form values when building up a query string to go into a URL.
258 Plus signs in the original string are escaped unless they are included in
259 *safe*. It also does not have *safe* default to ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000260
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000261 Example: ``quote_plus('/El Niño/')`` yields ``'%2FEl+Ni%C3%B1o%2F'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000262
Georg Brandl7f01a132009-09-16 15:58:14 +0000263
264.. function:: quote_from_bytes(bytes, safe='/')
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000265
266 Like :func:`quote`, but accepts a :class:`bytes` object rather than a
267 :class:`str`, and does not perform string-to-bytes encoding.
268
269 Example: ``quote_from_bytes(b'a&\xef')`` yields
270 ``'a%26%EF'``.
271
Georg Brandl7f01a132009-09-16 15:58:14 +0000272
273.. function:: unquote(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000274
275 Replace ``%xx`` escapes by their single-character equivalent.
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000276 The optional *encoding* and *errors* parameters specify how to decode
277 percent-encoded sequences into Unicode characters, as accepted by the
278 :meth:`bytes.decode` method.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000279
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000280 *string* must be a :class:`str`.
281
282 *encoding* defaults to ``'utf-8'``.
283 *errors* defaults to ``'replace'``, meaning invalid sequences are replaced
284 by a placeholder character.
285
286 Example: ``unquote('/El%20Ni%C3%B1o/')`` yields ``'/El Niño/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000287
288
Georg Brandl7f01a132009-09-16 15:58:14 +0000289.. function:: unquote_plus(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000290
Georg Brandl0f7ede42008-06-23 11:23:31 +0000291 Like :func:`unquote`, but also replace plus signs by spaces, as required for
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000292 unquoting HTML form values.
293
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000294 *string* must be a :class:`str`.
295
296 Example: ``unquote_plus('/El+Ni%C3%B1o/')`` yields ``'/El Niño/'``.
297
Georg Brandl7f01a132009-09-16 15:58:14 +0000298
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000299.. function:: unquote_to_bytes(string)
300
301 Replace ``%xx`` escapes by their single-octet equivalent, and return a
302 :class:`bytes` object.
303
304 *string* may be either a :class:`str` or a :class:`bytes`.
305
306 If it is a :class:`str`, unescaped non-ASCII characters in *string*
307 are encoded into UTF-8 bytes.
308
309 Example: ``unquote_to_bytes('a%26%EF')`` yields
310 ``b'a&\xef'``.
311
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000312
Senthil Kumarandf022da2010-07-03 17:48:22 +0000313.. function:: urlencode(query, doseq=False, safe='', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000314
Senthil Kumarandf022da2010-07-03 17:48:22 +0000315 Convert a mapping object or a sequence of two-element tuples, which may
316 either be a :class:`str` or a :class:`bytes`, to a "url-encoded" string,
317 suitable to pass to :func:`urlopen` above as the optional *data* argument.
318 This is useful to pass a dictionary of form fields to a ``POST`` request.
319 The resulting string is a series of ``key=value`` pairs separated by ``'&'``
320 characters, where both *key* and *value* are quoted using :func:`quote_plus`
321 above. When a sequence of two-element tuples is used as the *query*
322 argument, the first element of each tuple is a key and the second is a
323 value. The value element in itself can be a sequence and in that case, if
324 the optional parameter *doseq* is evaluates to *True*, individual
325 ``key=value`` pairs separated by ``'&'`` are generated for each element of
326 the value sequence for the key. The order of parameters in the encoded
327 string will match the order of parameter tuples in the sequence. This module
328 provides the functions :func:`parse_qs` and :func:`parse_qsl` which are used
329 to parse query strings into Python data structures.
330
331 When *query* parameter is a :class:`str`, the *safe*, *encoding* and *error*
332 parameters are sent the :func:`quote_plus` for encoding.
333
334 .. versionchanged:: 3.2
335 query paramater supports bytes and string.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000336
Georg Brandl116aa622007-08-15 14:28:22 +0000337
338.. seealso::
339
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000340 :rfc:`3986` - Uniform Resource Identifiers
341 This is the current standard (STD66). Any changes to urlparse module
342 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +0000343 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000344 parsing requirements as commonly observed in major browsers.
345
346 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
347 This specifies the parsing requirements of IPv6 URLs.
348
349 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
350 Document describing the generic syntactic requirements for both Uniform Resource
351 Names (URNs) and Uniform Resource Locators (URLs).
352
353 :rfc:`2368` - The mailto URL scheme.
354 Parsing requirements for mailto url schemes.
Georg Brandl116aa622007-08-15 14:28:22 +0000355
356 :rfc:`1808` - Relative Uniform Resource Locators
357 This Request For Comments includes the rules for joining an absolute and a
358 relative URL, including a fair number of "Abnormal Examples" which govern the
359 treatment of border cases.
360
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000361 :rfc:`1738` - Uniform Resource Locators (URL)
362 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl116aa622007-08-15 14:28:22 +0000363
364
365.. _urlparse-result-object:
366
367Results of :func:`urlparse` and :func:`urlsplit`
368------------------------------------------------
369
370The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
371subclasses of the :class:`tuple` type. These subclasses add the attributes
372described in those functions, as well as provide an additional method:
373
Georg Brandl116aa622007-08-15 14:28:22 +0000374.. method:: ParseResult.geturl()
375
376 Return the re-combined version of the original URL as a string. This may differ
377 from the original URL in that the scheme will always be normalized to lower case
378 and empty components may be dropped. Specifically, empty parameters, queries,
379 and fragment identifiers will be removed.
380
381 The result of this method is a fixpoint if passed back through the original
Christian Heimesfe337bf2008-03-23 21:54:12 +0000382 parsing function:
Georg Brandl116aa622007-08-15 14:28:22 +0000383
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000384 >>> import urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +0000385 >>> url = 'HTTP://www.Python.org/doc/#'
386
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000387 >>> r1 = urllib.parse.urlsplit(url)
Georg Brandl116aa622007-08-15 14:28:22 +0000388 >>> r1.geturl()
389 'http://www.Python.org/doc/'
390
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000391 >>> r2 = urllib.parse.urlsplit(r1.geturl())
Georg Brandl116aa622007-08-15 14:28:22 +0000392 >>> r2.geturl()
393 'http://www.Python.org/doc/'
394
Georg Brandl116aa622007-08-15 14:28:22 +0000395
Georg Brandl1f01deb2009-01-03 22:47:39 +0000396The following classes provide the implementations of the parse results:
Georg Brandl116aa622007-08-15 14:28:22 +0000397
Georg Brandl116aa622007-08-15 14:28:22 +0000398.. class:: BaseResult
399
Georg Brandl0f7ede42008-06-23 11:23:31 +0000400 Base class for the concrete result classes. This provides most of the
401 attribute definitions. It does not provide a :meth:`geturl` method. It is
402 derived from :class:`tuple`, but does not override the :meth:`__init__` or
403 :meth:`__new__` methods.
Georg Brandl116aa622007-08-15 14:28:22 +0000404
405
406.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
407
408 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
409 overridden to support checking that the right number of arguments are passed.
410
411
412.. class:: SplitResult(scheme, netloc, path, query, fragment)
413
414 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
415 overridden to support checking that the right number of arguments are passed.
416