blob: 87e08dd47f9c5b763fe4a80cebf4eb43bd81e875 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
Brett Cannonf6afa332008-07-11 00:16:30 +000015.. note::
Ezio Melotti510ff542012-05-03 19:21:40 +030016 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.
Brett Cannonf6afa332008-07-11 00:16:30 +000017 The :term:`2to3` tool will automatically adapt imports when converting
Ezio Melotti510ff542012-05-03 19:21:40 +030018 your sources to Python 3.
Brett Cannonf6afa332008-07-11 00:16:30 +000019
Éric Araujo29a0b572011-08-19 02:14:03 +020020**Source code:** :source:`Lib/urlparse.py`
21
22--------------
Brett Cannonf6afa332008-07-11 00:16:30 +000023
Georg Brandl8ec7f652007-08-15 14:28:01 +000024This module defines a standard interface to break Uniform Resource Locator (URL)
25strings up in components (addressing scheme, network location, path etc.), to
26combine the components back into a URL string, and to convert a "relative URL"
27to an absolute URL given a "base URL."
28
29The module has been designed to match the Internet RFC on Relative Uniform
Senthil Kumaran9d5d5072012-06-28 21:07:32 -070030Resource Locators. It supports the following URL schemes: ``file``, ``ftp``,
31``gopher``, ``hdl``, ``http``, ``https``, ``imap``, ``mailto``, ``mms``,
32``news``, ``nntp``, ``prospero``, ``rsync``, ``rtsp``, ``rtspu``, ``sftp``,
33``shttp``, ``sip``, ``sips``, ``snews``, ``svn``, ``svn+ssh``, ``telnet``,
34``wais``.
Georg Brandl8ec7f652007-08-15 14:28:01 +000035
36.. versionadded:: 2.5
37 Support for the ``sftp`` and ``sips`` schemes.
38
39The :mod:`urlparse` module defines the following functions:
40
41
R. David Murray172e06e2010-05-25 15:32:06 +000042.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +000043
44 Parse a URL into six components, returning a 6-tuple. This corresponds to the
45 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
46 Each tuple item is a string, possibly empty. The components are not broken up in
47 smaller parts (for example, the network location is a single string), and %
48 escapes are not expanded. The delimiters as shown above are not part of the
49 result, except for a leading slash in the *path* component, which is retained if
Georg Brandle8f1b002008-03-22 22:04:10 +000050 present. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +000051
52 >>> from urlparse import urlparse
53 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Georg Brandle8f1b002008-03-22 22:04:10 +000054 >>> o # doctest: +NORMALIZE_WHITESPACE
55 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
56 params='', query='', fragment='')
Georg Brandl8ec7f652007-08-15 14:28:01 +000057 >>> o.scheme
58 'http'
59 >>> o.port
60 80
61 >>> o.geturl()
62 'http://www.cwi.nl:80/%7Eguido/Python.html'
63
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000064
Senthil Kumaran683beb62010-11-07 13:10:02 +000065 Following the syntax specifications in :rfc:`1808`, urlparse recognizes
66 a netloc only if it is properly introduced by '//'. Otherwise the
67 input is presumed to be a relative URL and thus to start with
68 a path component.
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000069
70 >>> from urlparse import urlparse
71 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
72 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
73 params='', query='', fragment='')
Senthil Kumaranb1bbc0b2013-02-26 01:02:14 -080074 >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
Senthil Kumaran34f7c4e2013-09-30 22:10:44 -070075 ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000076 params='', query='', fragment='')
77 >>> urlparse('help/Python.html')
78 ParseResult(scheme='', netloc='', path='help/Python.html', params='',
79 query='', fragment='')
80
R. David Murray172e06e2010-05-25 15:32:06 +000081 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl8ec7f652007-08-15 14:28:01 +000082 scheme, to be used only if the URL does not specify one. The default value for
83 this argument is the empty string.
84
85 If the *allow_fragments* argument is false, fragment identifiers are not
Georg Brandlf8757fd2014-10-12 16:13:32 +020086 recognized and parsed as part of the preceding component, even if the URL's
87 addressing scheme normally does support them. The default value for this
88 argument is :const:`True`.
Georg Brandl8ec7f652007-08-15 14:28:01 +000089
90 The return value is actually an instance of a subclass of :class:`tuple`. This
91 class has the following additional read-only convenience attributes:
92
93 +------------------+-------+--------------------------+----------------------+
94 | Attribute | Index | Value | Value if not present |
95 +==================+=======+==========================+======================+
96 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
97 +------------------+-------+--------------------------+----------------------+
98 | :attr:`netloc` | 1 | Network location part | empty string |
99 +------------------+-------+--------------------------+----------------------+
100 | :attr:`path` | 2 | Hierarchical path | empty string |
101 +------------------+-------+--------------------------+----------------------+
102 | :attr:`params` | 3 | Parameters for last path | empty string |
103 | | | element | |
104 +------------------+-------+--------------------------+----------------------+
105 | :attr:`query` | 4 | Query component | empty string |
106 +------------------+-------+--------------------------+----------------------+
107 | :attr:`fragment` | 5 | Fragment identifier | empty string |
108 +------------------+-------+--------------------------+----------------------+
109 | :attr:`username` | | User name | :const:`None` |
110 +------------------+-------+--------------------------+----------------------+
111 | :attr:`password` | | Password | :const:`None` |
112 +------------------+-------+--------------------------+----------------------+
113 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
114 +------------------+-------+--------------------------+----------------------+
115 | :attr:`port` | | Port number as integer, | :const:`None` |
116 | | | if present | |
117 +------------------+-------+--------------------------+----------------------+
118
119 See section :ref:`urlparse-result-object` for more information on the result
120 object.
121
122 .. versionchanged:: 2.5
123 Added attributes to return value.
124
Senthil Kumaran39824612010-04-22 12:10:13 +0000125 .. versionchanged:: 2.7
126 Added IPv6 URL parsing capabilities.
127
128
Facundo Batistac585df92008-09-03 22:35:50 +0000129.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
130
131 Parse a query string given as a string argument (data of type
132 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
133 dictionary. The dictionary keys are the unique query variable names and the
134 values are lists of values for each name.
135
136 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000137 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000138 indicates that blanks should be retained as blank strings. The default false
139 value indicates that blank values are to be ignored and treated as if they were
140 not included.
141
142 The optional argument *strict_parsing* is a flag indicating what to do with
143 parsing errors. If false (the default), errors are silently ignored. If true,
144 errors raise a :exc:`ValueError` exception.
145
146 Use the :func:`urllib.urlencode` function to convert such dictionaries into
147 query strings.
148
Georg Brandla6714b22009-11-03 18:34:27 +0000149 .. versionadded:: 2.6
150 Copied from the :mod:`cgi` module.
151
Facundo Batistac585df92008-09-03 22:35:50 +0000152
153.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
154
155 Parse a query string given as a string argument (data of type
156 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
157 name, value pairs.
158
159 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000160 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000161 indicates that blanks should be retained as blank strings. The default false
162 value indicates that blank values are to be ignored and treated as if they were
163 not included.
164
165 The optional argument *strict_parsing* is a flag indicating what to do with
166 parsing errors. If false (the default), errors are silently ignored. If true,
167 errors raise a :exc:`ValueError` exception.
168
169 Use the :func:`urllib.urlencode` function to convert such lists of pairs into
170 query strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000171
Georg Brandla6714b22009-11-03 18:34:27 +0000172 .. versionadded:: 2.6
173 Copied from the :mod:`cgi` module.
174
175
Georg Brandl8ec7f652007-08-15 14:28:01 +0000176.. function:: urlunparse(parts)
177
178 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
179 can be any six-item iterable. This may result in a slightly different, but
180 equivalent URL, if the URL that was parsed originally had unnecessary delimiters
181 (for example, a ? with an empty query; the RFC states that these are
182 equivalent).
183
184
R. David Murray172e06e2010-05-25 15:32:06 +0000185.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000186
187 This is similar to :func:`urlparse`, but does not split the params from the URL.
188 This should generally be used instead of :func:`urlparse` if the more recent URL
189 syntax allowing parameters to be applied to each segment of the *path* portion
190 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
191 separate the path segments and parameters. This function returns a 5-tuple:
192 (addressing scheme, network location, path, query, fragment identifier).
193
194 The return value is actually an instance of a subclass of :class:`tuple`. This
195 class has the following additional read-only convenience attributes:
196
197 +------------------+-------+-------------------------+----------------------+
198 | Attribute | Index | Value | Value if not present |
199 +==================+=======+=========================+======================+
200 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
201 +------------------+-------+-------------------------+----------------------+
202 | :attr:`netloc` | 1 | Network location part | empty string |
203 +------------------+-------+-------------------------+----------------------+
204 | :attr:`path` | 2 | Hierarchical path | empty string |
205 +------------------+-------+-------------------------+----------------------+
206 | :attr:`query` | 3 | Query component | empty string |
207 +------------------+-------+-------------------------+----------------------+
208 | :attr:`fragment` | 4 | Fragment identifier | empty string |
209 +------------------+-------+-------------------------+----------------------+
210 | :attr:`username` | | User name | :const:`None` |
211 +------------------+-------+-------------------------+----------------------+
212 | :attr:`password` | | Password | :const:`None` |
213 +------------------+-------+-------------------------+----------------------+
214 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
215 +------------------+-------+-------------------------+----------------------+
216 | :attr:`port` | | Port number as integer, | :const:`None` |
217 | | | if present | |
218 +------------------+-------+-------------------------+----------------------+
219
220 See section :ref:`urlparse-result-object` for more information on the result
221 object.
222
223 .. versionadded:: 2.2
224
225 .. versionchanged:: 2.5
226 Added attributes to return value.
227
228
229.. function:: urlunsplit(parts)
230
231 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
232 URL as a string. The *parts* argument can be any five-item iterable. This may
233 result in a slightly different, but equivalent URL, if the URL that was parsed
234 originally had unnecessary delimiters (for example, a ? with an empty query; the
235 RFC states that these are equivalent).
236
237 .. versionadded:: 2.2
238
239
240.. function:: urljoin(base, url[, allow_fragments])
241
242 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
243 another URL (*url*). Informally, this uses components of the base URL, in
244 particular the addressing scheme, the network location and (part of) the path,
Georg Brandle8f1b002008-03-22 22:04:10 +0000245 to provide missing components in the relative URL. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000246
247 >>> from urlparse import urljoin
248 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
249 'http://www.cwi.nl/%7Eguido/FAQ.html'
250
251 The *allow_fragments* argument has the same meaning and default as for
252 :func:`urlparse`.
253
254 .. note::
255
256 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
257 the *url*'s host name and/or scheme will be present in the result. For example:
258
Georg Brandle8f1b002008-03-22 22:04:10 +0000259 .. doctest::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000260
261 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
262 ... '//www.python.org/%7Eguido')
263 'http://www.python.org/%7Eguido'
264
265 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
266 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
267
268
269.. function:: urldefrag(url)
270
271 If *url* contains a fragment identifier, returns a modified version of *url*
272 with no fragment identifier, and the fragment identifier as a separate string.
273 If there is no fragment identifier in *url*, returns *url* unmodified and an
274 empty string.
275
276
277.. seealso::
278
Senthil Kumaran0a361812010-04-22 05:48:35 +0000279 :rfc:`3986` - Uniform Resource Identifiers
280 This is the current standard (STD66). Any changes to urlparse module
281 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran39824612010-04-22 12:10:13 +0000282 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran0a361812010-04-22 05:48:35 +0000283 parsing requirements as commonly observed in major browsers.
284
285 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
286 This specifies the parsing requirements of IPv6 URLs.
287
288 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
289 Document describing the generic syntactic requirements for both Uniform Resource
290 Names (URNs) and Uniform Resource Locators (URLs).
291
292 :rfc:`2368` - The mailto URL scheme.
293 Parsing requirements for mailto url schemes.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000294
295 :rfc:`1808` - Relative Uniform Resource Locators
296 This Request For Comments includes the rules for joining an absolute and a
297 relative URL, including a fair number of "Abnormal Examples" which govern the
298 treatment of border cases.
299
Senthil Kumaran0a361812010-04-22 05:48:35 +0000300 :rfc:`1738` - Uniform Resource Locators (URL)
301 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000302
303
304.. _urlparse-result-object:
305
306Results of :func:`urlparse` and :func:`urlsplit`
307------------------------------------------------
308
309The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
310subclasses of the :class:`tuple` type. These subclasses add the attributes
311described in those functions, as well as provide an additional method:
312
313
314.. method:: ParseResult.geturl()
315
316 Return the re-combined version of the original URL as a string. This may differ
317 from the original URL in that the scheme will always be normalized to lower case
318 and empty components may be dropped. Specifically, empty parameters, queries,
319 and fragment identifiers will be removed.
320
321 The result of this method is a fixpoint if passed back through the original
Georg Brandle8f1b002008-03-22 22:04:10 +0000322 parsing function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000323
324 >>> import urlparse
325 >>> url = 'HTTP://www.Python.org/doc/#'
326
327 >>> r1 = urlparse.urlsplit(url)
328 >>> r1.geturl()
329 'http://www.Python.org/doc/'
330
331 >>> r2 = urlparse.urlsplit(r1.geturl())
332 >>> r2.geturl()
333 'http://www.Python.org/doc/'
334
335 .. versionadded:: 2.5
336
Georg Brandlfc29f272009-01-02 20:25:14 +0000337The following classes provide the implementations of the parse results:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000338
339
340.. class:: BaseResult
341
342 Base class for the concrete result classes. This provides most of the attribute
343 definitions. It does not provide a :meth:`geturl` method. It is derived from
344 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
345 methods.
346
347
348.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
349
350 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
351 overridden to support checking that the right number of arguments are passed.
352
353
354.. class:: SplitResult(scheme, netloc, path, query, fragment)
355
356 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
357 overridden to support checking that the right number of arguments are passed.
358