blob: 26d422fd117aa0d01662e5c8ce9393774e0a8e4f [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
Brett Cannonf6afa332008-07-11 00:16:30 +000015.. note::
16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0.
17 The :term:`2to3` tool will automatically adapt imports when converting
18 your sources to 3.0.
19
20
Georg Brandl8ec7f652007-08-15 14:28:01 +000021This module defines a standard interface to break Uniform Resource Locator (URL)
22strings up in components (addressing scheme, network location, path etc.), to
23combine the components back into a URL string, and to convert a "relative URL"
24to an absolute URL given a "base URL."
25
26The module has been designed to match the Internet RFC on Relative Uniform
27Resource Locators (and discovered a bug in an earlier draft!). It supports the
28following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
29``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
30``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
31``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
32
33.. versionadded:: 2.5
34 Support for the ``sftp`` and ``sips`` schemes.
35
Raymond Hettingere0e08222010-11-06 07:10:31 +000036.. seealso::
37
38 Latest version of the `urlparse module Python source code
39 <http://svn.python.org/view/python/branches/release27-maint/Lib/urlparse.py?view=markup>`_
40
Georg Brandl8ec7f652007-08-15 14:28:01 +000041The :mod:`urlparse` module defines the following functions:
42
43
R. David Murray172e06e2010-05-25 15:32:06 +000044.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +000045
46 Parse a URL into six components, returning a 6-tuple. This corresponds to the
47 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
48 Each tuple item is a string, possibly empty. The components are not broken up in
49 smaller parts (for example, the network location is a single string), and %
50 escapes are not expanded. The delimiters as shown above are not part of the
51 result, except for a leading slash in the *path* component, which is retained if
Georg Brandle8f1b002008-03-22 22:04:10 +000052 present. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +000053
54 >>> from urlparse import urlparse
55 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Georg Brandle8f1b002008-03-22 22:04:10 +000056 >>> o # doctest: +NORMALIZE_WHITESPACE
57 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
58 params='', query='', fragment='')
Georg Brandl8ec7f652007-08-15 14:28:01 +000059 >>> o.scheme
60 'http'
61 >>> o.port
62 80
63 >>> o.geturl()
64 'http://www.cwi.nl:80/%7Eguido/Python.html'
65
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000066
67 If the scheme value is not specified, urlparse following the syntax
68 specifications from RFC 1808, expects the netloc value to start with '//',
69 Otherwise, it is not possible to distinguish between net_loc and path
70 component and would classify the indistinguishable component as path as in
71 a relative url.
72
73 >>> from urlparse import urlparse
74 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
75 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
76 params='', query='', fragment='')
77 >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
78 ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
79 params='', query='', fragment='')
80 >>> urlparse('help/Python.html')
81 ParseResult(scheme='', netloc='', path='help/Python.html', params='',
82 query='', fragment='')
83
R. David Murray172e06e2010-05-25 15:32:06 +000084 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl8ec7f652007-08-15 14:28:01 +000085 scheme, to be used only if the URL does not specify one. The default value for
86 this argument is the empty string.
87
88 If the *allow_fragments* argument is false, fragment identifiers are not
89 allowed, even if the URL's addressing scheme normally does support them. The
90 default value for this argument is :const:`True`.
91
92 The return value is actually an instance of a subclass of :class:`tuple`. This
93 class has the following additional read-only convenience attributes:
94
95 +------------------+-------+--------------------------+----------------------+
96 | Attribute | Index | Value | Value if not present |
97 +==================+=======+==========================+======================+
98 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
99 +------------------+-------+--------------------------+----------------------+
100 | :attr:`netloc` | 1 | Network location part | empty string |
101 +------------------+-------+--------------------------+----------------------+
102 | :attr:`path` | 2 | Hierarchical path | empty string |
103 +------------------+-------+--------------------------+----------------------+
104 | :attr:`params` | 3 | Parameters for last path | empty string |
105 | | | element | |
106 +------------------+-------+--------------------------+----------------------+
107 | :attr:`query` | 4 | Query component | empty string |
108 +------------------+-------+--------------------------+----------------------+
109 | :attr:`fragment` | 5 | Fragment identifier | empty string |
110 +------------------+-------+--------------------------+----------------------+
111 | :attr:`username` | | User name | :const:`None` |
112 +------------------+-------+--------------------------+----------------------+
113 | :attr:`password` | | Password | :const:`None` |
114 +------------------+-------+--------------------------+----------------------+
115 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
116 +------------------+-------+--------------------------+----------------------+
117 | :attr:`port` | | Port number as integer, | :const:`None` |
118 | | | if present | |
119 +------------------+-------+--------------------------+----------------------+
120
121 See section :ref:`urlparse-result-object` for more information on the result
122 object.
123
124 .. versionchanged:: 2.5
125 Added attributes to return value.
126
Senthil Kumaran39824612010-04-22 12:10:13 +0000127 .. versionchanged:: 2.7
128 Added IPv6 URL parsing capabilities.
129
130
Facundo Batistac585df92008-09-03 22:35:50 +0000131.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
132
133 Parse a query string given as a string argument (data of type
134 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
135 dictionary. The dictionary keys are the unique query variable names and the
136 values are lists of values for each name.
137
138 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000139 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000140 indicates that blanks should be retained as blank strings. The default false
141 value indicates that blank values are to be ignored and treated as if they were
142 not included.
143
144 The optional argument *strict_parsing* is a flag indicating what to do with
145 parsing errors. If false (the default), errors are silently ignored. If true,
146 errors raise a :exc:`ValueError` exception.
147
148 Use the :func:`urllib.urlencode` function to convert such dictionaries into
149 query strings.
150
Georg Brandla6714b22009-11-03 18:34:27 +0000151 .. versionadded:: 2.6
152 Copied from the :mod:`cgi` module.
153
Facundo Batistac585df92008-09-03 22:35:50 +0000154
155.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
156
157 Parse a query string given as a string argument (data of type
158 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
159 name, value pairs.
160
161 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000162 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000163 indicates that blanks should be retained as blank strings. The default false
164 value indicates that blank values are to be ignored and treated as if they were
165 not included.
166
167 The optional argument *strict_parsing* is a flag indicating what to do with
168 parsing errors. If false (the default), errors are silently ignored. If true,
169 errors raise a :exc:`ValueError` exception.
170
171 Use the :func:`urllib.urlencode` function to convert such lists of pairs into
172 query strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000173
Georg Brandla6714b22009-11-03 18:34:27 +0000174 .. versionadded:: 2.6
175 Copied from the :mod:`cgi` module.
176
177
Georg Brandl8ec7f652007-08-15 14:28:01 +0000178.. function:: urlunparse(parts)
179
180 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
181 can be any six-item iterable. This may result in a slightly different, but
182 equivalent URL, if the URL that was parsed originally had unnecessary delimiters
183 (for example, a ? with an empty query; the RFC states that these are
184 equivalent).
185
186
R. David Murray172e06e2010-05-25 15:32:06 +0000187.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000188
189 This is similar to :func:`urlparse`, but does not split the params from the URL.
190 This should generally be used instead of :func:`urlparse` if the more recent URL
191 syntax allowing parameters to be applied to each segment of the *path* portion
192 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
193 separate the path segments and parameters. This function returns a 5-tuple:
194 (addressing scheme, network location, path, query, fragment identifier).
195
196 The return value is actually an instance of a subclass of :class:`tuple`. This
197 class has the following additional read-only convenience attributes:
198
199 +------------------+-------+-------------------------+----------------------+
200 | Attribute | Index | Value | Value if not present |
201 +==================+=======+=========================+======================+
202 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
203 +------------------+-------+-------------------------+----------------------+
204 | :attr:`netloc` | 1 | Network location part | empty string |
205 +------------------+-------+-------------------------+----------------------+
206 | :attr:`path` | 2 | Hierarchical path | empty string |
207 +------------------+-------+-------------------------+----------------------+
208 | :attr:`query` | 3 | Query component | empty string |
209 +------------------+-------+-------------------------+----------------------+
210 | :attr:`fragment` | 4 | Fragment identifier | empty string |
211 +------------------+-------+-------------------------+----------------------+
212 | :attr:`username` | | User name | :const:`None` |
213 +------------------+-------+-------------------------+----------------------+
214 | :attr:`password` | | Password | :const:`None` |
215 +------------------+-------+-------------------------+----------------------+
216 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
217 +------------------+-------+-------------------------+----------------------+
218 | :attr:`port` | | Port number as integer, | :const:`None` |
219 | | | if present | |
220 +------------------+-------+-------------------------+----------------------+
221
222 See section :ref:`urlparse-result-object` for more information on the result
223 object.
224
225 .. versionadded:: 2.2
226
227 .. versionchanged:: 2.5
228 Added attributes to return value.
229
230
231.. function:: urlunsplit(parts)
232
233 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
234 URL as a string. The *parts* argument can be any five-item iterable. This may
235 result in a slightly different, but equivalent URL, if the URL that was parsed
236 originally had unnecessary delimiters (for example, a ? with an empty query; the
237 RFC states that these are equivalent).
238
239 .. versionadded:: 2.2
240
241
242.. function:: urljoin(base, url[, allow_fragments])
243
244 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
245 another URL (*url*). Informally, this uses components of the base URL, in
246 particular the addressing scheme, the network location and (part of) the path,
Georg Brandle8f1b002008-03-22 22:04:10 +0000247 to provide missing components in the relative URL. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000248
249 >>> from urlparse import urljoin
250 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
251 'http://www.cwi.nl/%7Eguido/FAQ.html'
252
253 The *allow_fragments* argument has the same meaning and default as for
254 :func:`urlparse`.
255
256 .. note::
257
258 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
259 the *url*'s host name and/or scheme will be present in the result. For example:
260
Georg Brandle8f1b002008-03-22 22:04:10 +0000261 .. doctest::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000262
263 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
264 ... '//www.python.org/%7Eguido')
265 'http://www.python.org/%7Eguido'
266
267 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
268 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
269
270
271.. function:: urldefrag(url)
272
273 If *url* contains a fragment identifier, returns a modified version of *url*
274 with no fragment identifier, and the fragment identifier as a separate string.
275 If there is no fragment identifier in *url*, returns *url* unmodified and an
276 empty string.
277
278
279.. seealso::
280
Senthil Kumaran0a361812010-04-22 05:48:35 +0000281 :rfc:`3986` - Uniform Resource Identifiers
282 This is the current standard (STD66). Any changes to urlparse module
283 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran39824612010-04-22 12:10:13 +0000284 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran0a361812010-04-22 05:48:35 +0000285 parsing requirements as commonly observed in major browsers.
286
287 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
288 This specifies the parsing requirements of IPv6 URLs.
289
290 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
291 Document describing the generic syntactic requirements for both Uniform Resource
292 Names (URNs) and Uniform Resource Locators (URLs).
293
294 :rfc:`2368` - The mailto URL scheme.
295 Parsing requirements for mailto url schemes.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000296
297 :rfc:`1808` - Relative Uniform Resource Locators
298 This Request For Comments includes the rules for joining an absolute and a
299 relative URL, including a fair number of "Abnormal Examples" which govern the
300 treatment of border cases.
301
Senthil Kumaran0a361812010-04-22 05:48:35 +0000302 :rfc:`1738` - Uniform Resource Locators (URL)
303 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000304
305
306.. _urlparse-result-object:
307
308Results of :func:`urlparse` and :func:`urlsplit`
309------------------------------------------------
310
311The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
312subclasses of the :class:`tuple` type. These subclasses add the attributes
313described in those functions, as well as provide an additional method:
314
315
316.. method:: ParseResult.geturl()
317
318 Return the re-combined version of the original URL as a string. This may differ
319 from the original URL in that the scheme will always be normalized to lower case
320 and empty components may be dropped. Specifically, empty parameters, queries,
321 and fragment identifiers will be removed.
322
323 The result of this method is a fixpoint if passed back through the original
Georg Brandle8f1b002008-03-22 22:04:10 +0000324 parsing function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000325
326 >>> import urlparse
327 >>> url = 'HTTP://www.Python.org/doc/#'
328
329 >>> r1 = urlparse.urlsplit(url)
330 >>> r1.geturl()
331 'http://www.Python.org/doc/'
332
333 >>> r2 = urlparse.urlsplit(r1.geturl())
334 >>> r2.geturl()
335 'http://www.Python.org/doc/'
336
337 .. versionadded:: 2.5
338
Georg Brandlfc29f272009-01-02 20:25:14 +0000339The following classes provide the implementations of the parse results:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000340
341
342.. class:: BaseResult
343
344 Base class for the concrete result classes. This provides most of the attribute
345 definitions. It does not provide a :meth:`geturl` method. It is derived from
346 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
347 methods.
348
349
350.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
351
352 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
353 overridden to support checking that the right number of arguments are passed.
354
355
356.. class:: SplitResult(scheme, netloc, path, query, fragment)
357
358 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
359 overridden to support checking that the right number of arguments are passed.
360