blob: 88dafa97c5381bb51a2d79cdaebe957d9e8d7c94 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
Brett Cannonf6afa332008-07-11 00:16:30 +000015.. note::
16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0.
17 The :term:`2to3` tool will automatically adapt imports when converting
18 your sources to 3.0.
19
20
Georg Brandl8ec7f652007-08-15 14:28:01 +000021This module defines a standard interface to break Uniform Resource Locator (URL)
22strings up in components (addressing scheme, network location, path etc.), to
23combine the components back into a URL string, and to convert a "relative URL"
24to an absolute URL given a "base URL."
25
26The module has been designed to match the Internet RFC on Relative Uniform
27Resource Locators (and discovered a bug in an earlier draft!). It supports the
28following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
29``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
30``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
31``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
32
33.. versionadded:: 2.5
34 Support for the ``sftp`` and ``sips`` schemes.
35
Raymond Hettingere0e08222010-11-06 07:10:31 +000036.. seealso::
37
38 Latest version of the `urlparse module Python source code
39 <http://svn.python.org/view/python/branches/release27-maint/Lib/urlparse.py?view=markup>`_
40
Georg Brandl8ec7f652007-08-15 14:28:01 +000041The :mod:`urlparse` module defines the following functions:
42
43
R. David Murray172e06e2010-05-25 15:32:06 +000044.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +000045
46 Parse a URL into six components, returning a 6-tuple. This corresponds to the
47 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
48 Each tuple item is a string, possibly empty. The components are not broken up in
49 smaller parts (for example, the network location is a single string), and %
50 escapes are not expanded. The delimiters as shown above are not part of the
51 result, except for a leading slash in the *path* component, which is retained if
Georg Brandle8f1b002008-03-22 22:04:10 +000052 present. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +000053
54 >>> from urlparse import urlparse
55 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Georg Brandle8f1b002008-03-22 22:04:10 +000056 >>> o # doctest: +NORMALIZE_WHITESPACE
57 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
58 params='', query='', fragment='')
Georg Brandl8ec7f652007-08-15 14:28:01 +000059 >>> o.scheme
60 'http'
61 >>> o.port
62 80
63 >>> o.geturl()
64 'http://www.cwi.nl:80/%7Eguido/Python.html'
65
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000066
Senthil Kumaran683beb62010-11-07 13:10:02 +000067 Following the syntax specifications in :rfc:`1808`, urlparse recognizes
68 a netloc only if it is properly introduced by '//'. Otherwise the
69 input is presumed to be a relative URL and thus to start with
70 a path component.
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000071
72 >>> from urlparse import urlparse
73 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
74 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
75 params='', query='', fragment='')
76 >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
77 ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
78 params='', query='', fragment='')
79 >>> urlparse('help/Python.html')
80 ParseResult(scheme='', netloc='', path='help/Python.html', params='',
81 query='', fragment='')
82
R. David Murray172e06e2010-05-25 15:32:06 +000083 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl8ec7f652007-08-15 14:28:01 +000084 scheme, to be used only if the URL does not specify one. The default value for
85 this argument is the empty string.
86
87 If the *allow_fragments* argument is false, fragment identifiers are not
88 allowed, even if the URL's addressing scheme normally does support them. The
89 default value for this argument is :const:`True`.
90
91 The return value is actually an instance of a subclass of :class:`tuple`. This
92 class has the following additional read-only convenience attributes:
93
94 +------------------+-------+--------------------------+----------------------+
95 | Attribute | Index | Value | Value if not present |
96 +==================+=======+==========================+======================+
97 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
98 +------------------+-------+--------------------------+----------------------+
99 | :attr:`netloc` | 1 | Network location part | empty string |
100 +------------------+-------+--------------------------+----------------------+
101 | :attr:`path` | 2 | Hierarchical path | empty string |
102 +------------------+-------+--------------------------+----------------------+
103 | :attr:`params` | 3 | Parameters for last path | empty string |
104 | | | element | |
105 +------------------+-------+--------------------------+----------------------+
106 | :attr:`query` | 4 | Query component | empty string |
107 +------------------+-------+--------------------------+----------------------+
108 | :attr:`fragment` | 5 | Fragment identifier | empty string |
109 +------------------+-------+--------------------------+----------------------+
110 | :attr:`username` | | User name | :const:`None` |
111 +------------------+-------+--------------------------+----------------------+
112 | :attr:`password` | | Password | :const:`None` |
113 +------------------+-------+--------------------------+----------------------+
114 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
115 +------------------+-------+--------------------------+----------------------+
116 | :attr:`port` | | Port number as integer, | :const:`None` |
117 | | | if present | |
118 +------------------+-------+--------------------------+----------------------+
119
120 See section :ref:`urlparse-result-object` for more information on the result
121 object.
122
123 .. versionchanged:: 2.5
124 Added attributes to return value.
125
Senthil Kumaran39824612010-04-22 12:10:13 +0000126 .. versionchanged:: 2.7
127 Added IPv6 URL parsing capabilities.
128
129
Facundo Batistac585df92008-09-03 22:35:50 +0000130.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
131
132 Parse a query string given as a string argument (data of type
133 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
134 dictionary. The dictionary keys are the unique query variable names and the
135 values are lists of values for each name.
136
137 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000138 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000139 indicates that blanks should be retained as blank strings. The default false
140 value indicates that blank values are to be ignored and treated as if they were
141 not included.
142
143 The optional argument *strict_parsing* is a flag indicating what to do with
144 parsing errors. If false (the default), errors are silently ignored. If true,
145 errors raise a :exc:`ValueError` exception.
146
147 Use the :func:`urllib.urlencode` function to convert such dictionaries into
148 query strings.
149
Georg Brandla6714b22009-11-03 18:34:27 +0000150 .. versionadded:: 2.6
151 Copied from the :mod:`cgi` module.
152
Facundo Batistac585df92008-09-03 22:35:50 +0000153
154.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
155
156 Parse a query string given as a string argument (data of type
157 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
158 name, value pairs.
159
160 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000161 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000162 indicates that blanks should be retained as blank strings. The default false
163 value indicates that blank values are to be ignored and treated as if they were
164 not included.
165
166 The optional argument *strict_parsing* is a flag indicating what to do with
167 parsing errors. If false (the default), errors are silently ignored. If true,
168 errors raise a :exc:`ValueError` exception.
169
170 Use the :func:`urllib.urlencode` function to convert such lists of pairs into
171 query strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000172
Georg Brandla6714b22009-11-03 18:34:27 +0000173 .. versionadded:: 2.6
174 Copied from the :mod:`cgi` module.
175
176
Georg Brandl8ec7f652007-08-15 14:28:01 +0000177.. function:: urlunparse(parts)
178
179 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
180 can be any six-item iterable. This may result in a slightly different, but
181 equivalent URL, if the URL that was parsed originally had unnecessary delimiters
182 (for example, a ? with an empty query; the RFC states that these are
183 equivalent).
184
185
R. David Murray172e06e2010-05-25 15:32:06 +0000186.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000187
188 This is similar to :func:`urlparse`, but does not split the params from the URL.
189 This should generally be used instead of :func:`urlparse` if the more recent URL
190 syntax allowing parameters to be applied to each segment of the *path* portion
191 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
192 separate the path segments and parameters. This function returns a 5-tuple:
193 (addressing scheme, network location, path, query, fragment identifier).
194
195 The return value is actually an instance of a subclass of :class:`tuple`. This
196 class has the following additional read-only convenience attributes:
197
198 +------------------+-------+-------------------------+----------------------+
199 | Attribute | Index | Value | Value if not present |
200 +==================+=======+=========================+======================+
201 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
202 +------------------+-------+-------------------------+----------------------+
203 | :attr:`netloc` | 1 | Network location part | empty string |
204 +------------------+-------+-------------------------+----------------------+
205 | :attr:`path` | 2 | Hierarchical path | empty string |
206 +------------------+-------+-------------------------+----------------------+
207 | :attr:`query` | 3 | Query component | empty string |
208 +------------------+-------+-------------------------+----------------------+
209 | :attr:`fragment` | 4 | Fragment identifier | empty string |
210 +------------------+-------+-------------------------+----------------------+
211 | :attr:`username` | | User name | :const:`None` |
212 +------------------+-------+-------------------------+----------------------+
213 | :attr:`password` | | Password | :const:`None` |
214 +------------------+-------+-------------------------+----------------------+
215 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
216 +------------------+-------+-------------------------+----------------------+
217 | :attr:`port` | | Port number as integer, | :const:`None` |
218 | | | if present | |
219 +------------------+-------+-------------------------+----------------------+
220
221 See section :ref:`urlparse-result-object` for more information on the result
222 object.
223
224 .. versionadded:: 2.2
225
226 .. versionchanged:: 2.5
227 Added attributes to return value.
228
229
230.. function:: urlunsplit(parts)
231
232 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
233 URL as a string. The *parts* argument can be any five-item iterable. This may
234 result in a slightly different, but equivalent URL, if the URL that was parsed
235 originally had unnecessary delimiters (for example, a ? with an empty query; the
236 RFC states that these are equivalent).
237
238 .. versionadded:: 2.2
239
240
241.. function:: urljoin(base, url[, allow_fragments])
242
243 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
244 another URL (*url*). Informally, this uses components of the base URL, in
245 particular the addressing scheme, the network location and (part of) the path,
Georg Brandle8f1b002008-03-22 22:04:10 +0000246 to provide missing components in the relative URL. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000247
248 >>> from urlparse import urljoin
249 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
250 'http://www.cwi.nl/%7Eguido/FAQ.html'
251
252 The *allow_fragments* argument has the same meaning and default as for
253 :func:`urlparse`.
254
255 .. note::
256
257 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
258 the *url*'s host name and/or scheme will be present in the result. For example:
259
Georg Brandle8f1b002008-03-22 22:04:10 +0000260 .. doctest::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000261
262 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
263 ... '//www.python.org/%7Eguido')
264 'http://www.python.org/%7Eguido'
265
266 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
267 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
268
269
270.. function:: urldefrag(url)
271
272 If *url* contains a fragment identifier, returns a modified version of *url*
273 with no fragment identifier, and the fragment identifier as a separate string.
274 If there is no fragment identifier in *url*, returns *url* unmodified and an
275 empty string.
276
277
278.. seealso::
279
Senthil Kumaran0a361812010-04-22 05:48:35 +0000280 :rfc:`3986` - Uniform Resource Identifiers
281 This is the current standard (STD66). Any changes to urlparse module
282 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran39824612010-04-22 12:10:13 +0000283 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran0a361812010-04-22 05:48:35 +0000284 parsing requirements as commonly observed in major browsers.
285
286 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
287 This specifies the parsing requirements of IPv6 URLs.
288
289 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
290 Document describing the generic syntactic requirements for both Uniform Resource
291 Names (URNs) and Uniform Resource Locators (URLs).
292
293 :rfc:`2368` - The mailto URL scheme.
294 Parsing requirements for mailto url schemes.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000295
296 :rfc:`1808` - Relative Uniform Resource Locators
297 This Request For Comments includes the rules for joining an absolute and a
298 relative URL, including a fair number of "Abnormal Examples" which govern the
299 treatment of border cases.
300
Senthil Kumaran0a361812010-04-22 05:48:35 +0000301 :rfc:`1738` - Uniform Resource Locators (URL)
302 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000303
304
305.. _urlparse-result-object:
306
307Results of :func:`urlparse` and :func:`urlsplit`
308------------------------------------------------
309
310The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
311subclasses of the :class:`tuple` type. These subclasses add the attributes
312described in those functions, as well as provide an additional method:
313
314
315.. method:: ParseResult.geturl()
316
317 Return the re-combined version of the original URL as a string. This may differ
318 from the original URL in that the scheme will always be normalized to lower case
319 and empty components may be dropped. Specifically, empty parameters, queries,
320 and fragment identifiers will be removed.
321
322 The result of this method is a fixpoint if passed back through the original
Georg Brandle8f1b002008-03-22 22:04:10 +0000323 parsing function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000324
325 >>> import urlparse
326 >>> url = 'HTTP://www.Python.org/doc/#'
327
328 >>> r1 = urlparse.urlsplit(url)
329 >>> r1.geturl()
330 'http://www.Python.org/doc/'
331
332 >>> r2 = urlparse.urlsplit(r1.geturl())
333 >>> r2.geturl()
334 'http://www.Python.org/doc/'
335
336 .. versionadded:: 2.5
337
Georg Brandlfc29f272009-01-02 20:25:14 +0000338The following classes provide the implementations of the parse results:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000339
340
341.. class:: BaseResult
342
343 Base class for the concrete result classes. This provides most of the attribute
344 definitions. It does not provide a :meth:`geturl` method. It is derived from
345 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
346 methods.
347
348
349.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
350
351 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
352 overridden to support checking that the right number of arguments are passed.
353
354
355.. class:: SplitResult(scheme, netloc, path, query, fragment)
356
357 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
358 overridden to support checking that the right number of arguments are passed.
359