blob: 3d13efceac34d2fe20dd75d75e7fff68dffcfb02 [file] [log] [blame]
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00001:mod:`urllib.parse` --- Parse URLs into components
2==================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Senthil Kumaranaca8fd72008-06-23 04:41:59 +00004.. module:: urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
15This module defines a standard interface to break Uniform Resource Locator (URL)
16strings up in components (addressing scheme, network location, path etc.), to
17combine the components back into a URL string, and to convert a "relative URL"
18to an absolute URL given a "base URL."
19
20The module has been designed to match the Internet RFC on Relative Uniform
21Resource Locators (and discovered a bug in an earlier draft!). It supports the
22following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
Georg Brandl0f7ede42008-06-23 11:23:31 +000023``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
24``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
25``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
Georg Brandl116aa622007-08-15 14:28:22 +000026
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000027The :mod:`urllib.parse` module defines the following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000028
R. David Murrayf5077aa2010-05-25 15:36:46 +000029.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +000030
31 Parse a URL into six components, returning a 6-tuple. This corresponds to the
32 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
33 Each tuple item is a string, possibly empty. The components are not broken up in
34 smaller parts (for example, the network location is a single string), and %
35 escapes are not expanded. The delimiters as shown above are not part of the
36 result, except for a leading slash in the *path* component, which is retained if
Christian Heimesfe337bf2008-03-23 21:54:12 +000037 present. For example:
Georg Brandl116aa622007-08-15 14:28:22 +000038
Senthil Kumaranaca8fd72008-06-23 04:41:59 +000039 >>> from urllib.parse import urlparse
Georg Brandl116aa622007-08-15 14:28:22 +000040 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Christian Heimesfe337bf2008-03-23 21:54:12 +000041 >>> o # doctest: +NORMALIZE_WHITESPACE
42 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
43 params='', query='', fragment='')
Georg Brandl116aa622007-08-15 14:28:22 +000044 >>> o.scheme
45 'http'
46 >>> o.port
47 80
48 >>> o.geturl()
49 'http://www.cwi.nl:80/%7Eguido/Python.html'
50
R. David Murrayf5077aa2010-05-25 15:36:46 +000051 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl116aa622007-08-15 14:28:22 +000052 scheme, to be used only if the URL does not specify one. The default value for
53 this argument is the empty string.
54
55 If the *allow_fragments* argument is false, fragment identifiers are not
56 allowed, even if the URL's addressing scheme normally does support them. The
57 default value for this argument is :const:`True`.
58
59 The return value is actually an instance of a subclass of :class:`tuple`. This
60 class has the following additional read-only convenience attributes:
61
62 +------------------+-------+--------------------------+----------------------+
63 | Attribute | Index | Value | Value if not present |
64 +==================+=======+==========================+======================+
65 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
66 +------------------+-------+--------------------------+----------------------+
67 | :attr:`netloc` | 1 | Network location part | empty string |
68 +------------------+-------+--------------------------+----------------------+
69 | :attr:`path` | 2 | Hierarchical path | empty string |
70 +------------------+-------+--------------------------+----------------------+
71 | :attr:`params` | 3 | Parameters for last path | empty string |
72 | | | element | |
73 +------------------+-------+--------------------------+----------------------+
74 | :attr:`query` | 4 | Query component | empty string |
75 +------------------+-------+--------------------------+----------------------+
76 | :attr:`fragment` | 5 | Fragment identifier | empty string |
77 +------------------+-------+--------------------------+----------------------+
78 | :attr:`username` | | User name | :const:`None` |
79 +------------------+-------+--------------------------+----------------------+
80 | :attr:`password` | | Password | :const:`None` |
81 +------------------+-------+--------------------------+----------------------+
82 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
83 +------------------+-------+--------------------------+----------------------+
84 | :attr:`port` | | Port number as integer, | :const:`None` |
85 | | | if present | |
86 +------------------+-------+--------------------------+----------------------+
87
88 See section :ref:`urlparse-result-object` for more information on the result
89 object.
90
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +000091 .. versionchanged:: 3.2
92 Added IPv6 URL parsing capabilities.
93
Georg Brandl116aa622007-08-15 14:28:22 +000094
Georg Brandl7f01a132009-09-16 15:58:14 +000095.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +000096
97 Parse a query string given as a string argument (data of type
98 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
99 dictionary. The dictionary keys are the unique query variable names and the
100 values are lists of values for each name.
101
102 The optional argument *keep_blank_values* is a flag indicating whether blank
103 values in URL encoded queries should be treated as blank strings. A true value
104 indicates that blanks should be retained as blank strings. The default false
105 value indicates that blank values are to be ignored and treated as if they were
106 not included.
107
108 The optional argument *strict_parsing* is a flag indicating what to do with
109 parsing errors. If false (the default), errors are silently ignored. If true,
110 errors raise a :exc:`ValueError` exception.
111
Georg Brandl7fe2c4a2008-12-05 07:32:56 +0000112 Use the :func:`urllib.parse.urlencode` function to convert such
113 dictionaries into query strings.
Facundo Batistac469d4c2008-09-03 22:49:01 +0000114
115
Georg Brandl7f01a132009-09-16 15:58:14 +0000116.. function:: parse_qsl(qs, keep_blank_values=False, strict_parsing=False)
Facundo Batistac469d4c2008-09-03 22:49:01 +0000117
118 Parse a query string given as a string argument (data of type
119 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
120 name, value pairs.
121
122 The optional argument *keep_blank_values* is a flag indicating whether blank
123 values in URL encoded queries should be treated as blank strings. A true value
124 indicates that blanks should be retained as blank strings. The default false
125 value indicates that blank values are to be ignored and treated as if they were
126 not included.
127
128 The optional argument *strict_parsing* is a flag indicating what to do with
129 parsing errors. If false (the default), errors are silently ignored. If true,
130 errors raise a :exc:`ValueError` exception.
131
132 Use the :func:`urllib.parse.urlencode` function to convert such lists of pairs into
133 query strings.
134
135
Georg Brandl116aa622007-08-15 14:28:22 +0000136.. function:: urlunparse(parts)
137
Georg Brandl0f7ede42008-06-23 11:23:31 +0000138 Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
139 argument can be any six-item iterable. This may result in a slightly
140 different, but equivalent URL, if the URL that was parsed originally had
141 unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
142 states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000143
144
R. David Murrayf5077aa2010-05-25 15:36:46 +0000145.. function:: urlsplit(urlstring, scheme='', allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000146
147 This is similar to :func:`urlparse`, but does not split the params from the URL.
148 This should generally be used instead of :func:`urlparse` if the more recent URL
149 syntax allowing parameters to be applied to each segment of the *path* portion
150 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
151 separate the path segments and parameters. This function returns a 5-tuple:
152 (addressing scheme, network location, path, query, fragment identifier).
153
154 The return value is actually an instance of a subclass of :class:`tuple`. This
155 class has the following additional read-only convenience attributes:
156
157 +------------------+-------+-------------------------+----------------------+
158 | Attribute | Index | Value | Value if not present |
159 +==================+=======+=========================+======================+
160 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
161 +------------------+-------+-------------------------+----------------------+
162 | :attr:`netloc` | 1 | Network location part | empty string |
163 +------------------+-------+-------------------------+----------------------+
164 | :attr:`path` | 2 | Hierarchical path | empty string |
165 +------------------+-------+-------------------------+----------------------+
166 | :attr:`query` | 3 | Query component | empty string |
167 +------------------+-------+-------------------------+----------------------+
168 | :attr:`fragment` | 4 | Fragment identifier | empty string |
169 +------------------+-------+-------------------------+----------------------+
170 | :attr:`username` | | User name | :const:`None` |
171 +------------------+-------+-------------------------+----------------------+
172 | :attr:`password` | | Password | :const:`None` |
173 +------------------+-------+-------------------------+----------------------+
174 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
175 +------------------+-------+-------------------------+----------------------+
176 | :attr:`port` | | Port number as integer, | :const:`None` |
177 | | | if present | |
178 +------------------+-------+-------------------------+----------------------+
179
180 See section :ref:`urlparse-result-object` for more information on the result
181 object.
182
Georg Brandl116aa622007-08-15 14:28:22 +0000183
184.. function:: urlunsplit(parts)
185
Georg Brandl0f7ede42008-06-23 11:23:31 +0000186 Combine the elements of a tuple as returned by :func:`urlsplit` into a
187 complete URL as a string. The *parts* argument can be any five-item
188 iterable. This may result in a slightly different, but equivalent URL, if the
189 URL that was parsed originally had unnecessary delimiters (for example, a ?
190 with an empty query; the RFC states that these are equivalent).
Georg Brandl116aa622007-08-15 14:28:22 +0000191
Georg Brandl116aa622007-08-15 14:28:22 +0000192
Georg Brandl7f01a132009-09-16 15:58:14 +0000193.. function:: urljoin(base, url, allow_fragments=True)
Georg Brandl116aa622007-08-15 14:28:22 +0000194
195 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
196 another URL (*url*). Informally, this uses components of the base URL, in
Georg Brandl0f7ede42008-06-23 11:23:31 +0000197 particular the addressing scheme, the network location and (part of) the
198 path, to provide missing components in the relative URL. For example:
Georg Brandl116aa622007-08-15 14:28:22 +0000199
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000200 >>> from urllib.parse import urljoin
Georg Brandl116aa622007-08-15 14:28:22 +0000201 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
202 'http://www.cwi.nl/%7Eguido/FAQ.html'
203
204 The *allow_fragments* argument has the same meaning and default as for
205 :func:`urlparse`.
206
207 .. note::
208
209 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
210 the *url*'s host name and/or scheme will be present in the result. For example:
211
Christian Heimesfe337bf2008-03-23 21:54:12 +0000212 .. doctest::
Georg Brandl116aa622007-08-15 14:28:22 +0000213
214 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
215 ... '//www.python.org/%7Eguido')
216 'http://www.python.org/%7Eguido'
217
218 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
219 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
220
221
222.. function:: urldefrag(url)
223
Georg Brandl0f7ede42008-06-23 11:23:31 +0000224 If *url* contains a fragment identifier, return a modified version of *url*
225 with no fragment identifier, and the fragment identifier as a separate
226 string. If there is no fragment identifier in *url*, return *url* unmodified
227 and an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +0000228
Georg Brandl7f01a132009-09-16 15:58:14 +0000229
230.. function:: quote(string, safe='/', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000231
232 Replace special characters in *string* using the ``%xx`` escape. Letters,
Senthil Kumaran8aa8bbe2009-08-31 16:43:45 +0000233 digits, and the characters ``'_.-'`` are never quoted. By default, this
234 function is intended for quoting the path section of URL. The optional *safe*
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000235 parameter specifies additional ASCII characters that should not be quoted
236 --- its default value is ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000237
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000238 *string* may be either a :class:`str` or a :class:`bytes`.
239
240 The optional *encoding* and *errors* parameters specify how to deal with
241 non-ASCII characters, as accepted by the :meth:`str.encode` method.
242 *encoding* defaults to ``'utf-8'``.
243 *errors* defaults to ``'strict'``, meaning unsupported characters raise a
244 :class:`UnicodeEncodeError`.
245 *encoding* and *errors* must not be supplied if *string* is a
246 :class:`bytes`, or a :class:`TypeError` is raised.
247
248 Note that ``quote(string, safe, encoding, errors)`` is equivalent to
249 ``quote_from_bytes(string.encode(encoding, errors), safe)``.
250
251 Example: ``quote('/El Niño/')`` yields ``'/El%20Ni%C3%B1o/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000252
253
Georg Brandl7f01a132009-09-16 15:58:14 +0000254.. function:: quote_plus(string, safe='', encoding=None, errors=None)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000255
Georg Brandl0f7ede42008-06-23 11:23:31 +0000256 Like :func:`quote`, but also replace spaces by plus signs, as required for
Georg Brandl81c09db2009-07-29 07:27:08 +0000257 quoting HTML form values when building up a query string to go into a URL.
258 Plus signs in the original string are escaped unless they are included in
259 *safe*. It also does not have *safe* default to ``'/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000260
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000261 Example: ``quote_plus('/El Niño/')`` yields ``'%2FEl+Ni%C3%B1o%2F'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000262
Georg Brandl7f01a132009-09-16 15:58:14 +0000263
264.. function:: quote_from_bytes(bytes, safe='/')
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000265
266 Like :func:`quote`, but accepts a :class:`bytes` object rather than a
267 :class:`str`, and does not perform string-to-bytes encoding.
268
269 Example: ``quote_from_bytes(b'a&\xef')`` yields
270 ``'a%26%EF'``.
271
Georg Brandl7f01a132009-09-16 15:58:14 +0000272
273.. function:: unquote(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000274
275 Replace ``%xx`` escapes by their single-character equivalent.
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000276 The optional *encoding* and *errors* parameters specify how to decode
277 percent-encoded sequences into Unicode characters, as accepted by the
278 :meth:`bytes.decode` method.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000279
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000280 *string* must be a :class:`str`.
281
282 *encoding* defaults to ``'utf-8'``.
283 *errors* defaults to ``'replace'``, meaning invalid sequences are replaced
284 by a placeholder character.
285
286 Example: ``unquote('/El%20Ni%C3%B1o/')`` yields ``'/El Niño/'``.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000287
288
Georg Brandl7f01a132009-09-16 15:58:14 +0000289.. function:: unquote_plus(string, encoding='utf-8', errors='replace')
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000290
Georg Brandl0f7ede42008-06-23 11:23:31 +0000291 Like :func:`unquote`, but also replace plus signs by spaces, as required for
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000292 unquoting HTML form values.
293
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000294 *string* must be a :class:`str`.
295
296 Example: ``unquote_plus('/El+Ni%C3%B1o/')`` yields ``'/El Niño/'``.
297
Georg Brandl7f01a132009-09-16 15:58:14 +0000298
Guido van Rossum52dbbb92008-08-18 21:44:30 +0000299.. function:: unquote_to_bytes(string)
300
301 Replace ``%xx`` escapes by their single-octet equivalent, and return a
302 :class:`bytes` object.
303
304 *string* may be either a :class:`str` or a :class:`bytes`.
305
306 If it is a :class:`str`, unescaped non-ASCII characters in *string*
307 are encoded into UTF-8 bytes.
308
309 Example: ``unquote_to_bytes('a%26%EF')`` yields
310 ``b'a&\xef'``.
311
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000312
Georg Brandl7f01a132009-09-16 15:58:14 +0000313.. function:: urlencode(query, doseq=False)
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000314
Benjamin Peterson3789b972010-06-06 02:32:09 +0000315 Convert a mapping object or a sequence of two-element tuples to a
Senthil Kumaranc92c97c2010-06-02 02:29:00 +0000316 "url-encoded" string, suitable to pass to :func:`urlopen` above as the
317 optional *data* argument. This is useful to pass a dictionary of form
318 fields to a ``POST`` request. The resulting string is a series of
319 ``key=value`` pairs separated by ``'&'`` characters, where both *key* and
320 *value* are quoted using :func:`quote_plus` above. When a sequence of
321 two-element tuples is used as the *query* argument, the first element of
322 each tuple is a key and the second is a value. The value element in itself
323 can be a sequence and in that case, if the optional parameter *doseq* is
Benjamin Peterson3789b972010-06-06 02:32:09 +0000324 evaluates to *True*, individual ``key=value`` pairs separated by ``'&'`` are
Senthil Kumaranc92c97c2010-06-02 02:29:00 +0000325 generated for each element of the value sequence for the key. The order of
326 parameters in the encoded string will match the order of parameter tuples in
327 the sequence. This module provides the functions :func:`parse_qs` and
328 :func:`parse_qsl` which are used to parse query strings into Python data
329 structures.
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000330
Georg Brandl116aa622007-08-15 14:28:22 +0000331
332.. seealso::
333
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000334 :rfc:`3986` - Uniform Resource Identifiers
335 This is the current standard (STD66). Any changes to urlparse module
336 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran7a1e09f2010-04-22 12:19:46 +0000337 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000338 parsing requirements as commonly observed in major browsers.
339
340 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
341 This specifies the parsing requirements of IPv6 URLs.
342
343 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
344 Document describing the generic syntactic requirements for both Uniform Resource
345 Names (URNs) and Uniform Resource Locators (URLs).
346
347 :rfc:`2368` - The mailto URL scheme.
348 Parsing requirements for mailto url schemes.
Georg Brandl116aa622007-08-15 14:28:22 +0000349
350 :rfc:`1808` - Relative Uniform Resource Locators
351 This Request For Comments includes the rules for joining an absolute and a
352 relative URL, including a fair number of "Abnormal Examples" which govern the
353 treatment of border cases.
354
Senthil Kumaran6257bdd2010-04-22 05:53:18 +0000355 :rfc:`1738` - Uniform Resource Locators (URL)
356 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl116aa622007-08-15 14:28:22 +0000357
358
359.. _urlparse-result-object:
360
361Results of :func:`urlparse` and :func:`urlsplit`
362------------------------------------------------
363
364The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
365subclasses of the :class:`tuple` type. These subclasses add the attributes
366described in those functions, as well as provide an additional method:
367
Georg Brandl116aa622007-08-15 14:28:22 +0000368.. method:: ParseResult.geturl()
369
370 Return the re-combined version of the original URL as a string. This may differ
371 from the original URL in that the scheme will always be normalized to lower case
372 and empty components may be dropped. Specifically, empty parameters, queries,
373 and fragment identifiers will be removed.
374
375 The result of this method is a fixpoint if passed back through the original
Christian Heimesfe337bf2008-03-23 21:54:12 +0000376 parsing function:
Georg Brandl116aa622007-08-15 14:28:22 +0000377
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000378 >>> import urllib.parse
Georg Brandl116aa622007-08-15 14:28:22 +0000379 >>> url = 'HTTP://www.Python.org/doc/#'
380
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000381 >>> r1 = urllib.parse.urlsplit(url)
Georg Brandl116aa622007-08-15 14:28:22 +0000382 >>> r1.geturl()
383 'http://www.Python.org/doc/'
384
Senthil Kumaranaca8fd72008-06-23 04:41:59 +0000385 >>> r2 = urllib.parse.urlsplit(r1.geturl())
Georg Brandl116aa622007-08-15 14:28:22 +0000386 >>> r2.geturl()
387 'http://www.Python.org/doc/'
388
Georg Brandl116aa622007-08-15 14:28:22 +0000389
Georg Brandl1f01deb2009-01-03 22:47:39 +0000390The following classes provide the implementations of the parse results:
Georg Brandl116aa622007-08-15 14:28:22 +0000391
Georg Brandl116aa622007-08-15 14:28:22 +0000392.. class:: BaseResult
393
Georg Brandl0f7ede42008-06-23 11:23:31 +0000394 Base class for the concrete result classes. This provides most of the
395 attribute definitions. It does not provide a :meth:`geturl` method. It is
396 derived from :class:`tuple`, but does not override the :meth:`__init__` or
397 :meth:`__new__` methods.
Georg Brandl116aa622007-08-15 14:28:22 +0000398
399
400.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
401
402 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
403 overridden to support checking that the right number of arguments are passed.
404
405
406.. class:: SplitResult(scheme, netloc, path, query, fragment)
407
408 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
409 overridden to support checking that the right number of arguments are passed.
410