blob: f3730e3308509ba194d9507bff676922a5e991d7 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001:mod:`urlparse` --- Parse URLs into components
2==============================================
3
4.. module:: urlparse
5 :synopsis: Parse URLs into or assemble them from components.
6
7
8.. index::
9 single: WWW
10 single: World Wide Web
11 single: URL
12 pair: URL; parsing
13 pair: relative; URL
14
Brett Cannonf6afa332008-07-11 00:16:30 +000015.. note::
16 The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0.
17 The :term:`2to3` tool will automatically adapt imports when converting
18 your sources to 3.0.
19
20
Georg Brandl8ec7f652007-08-15 14:28:01 +000021This module defines a standard interface to break Uniform Resource Locator (URL)
22strings up in components (addressing scheme, network location, path etc.), to
23combine the components back into a URL string, and to convert a "relative URL"
24to an absolute URL given a "base URL."
25
26The module has been designed to match the Internet RFC on Relative Uniform
27Resource Locators (and discovered a bug in an earlier draft!). It supports the
28following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
29``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
30``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
31``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
32
33.. versionadded:: 2.5
34 Support for the ``sftp`` and ``sips`` schemes.
35
36The :mod:`urlparse` module defines the following functions:
37
38
R. David Murray172e06e2010-05-25 15:32:06 +000039.. function:: urlparse(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +000040
41 Parse a URL into six components, returning a 6-tuple. This corresponds to the
42 general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``.
43 Each tuple item is a string, possibly empty. The components are not broken up in
44 smaller parts (for example, the network location is a single string), and %
45 escapes are not expanded. The delimiters as shown above are not part of the
46 result, except for a leading slash in the *path* component, which is retained if
Georg Brandle8f1b002008-03-22 22:04:10 +000047 present. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +000048
49 >>> from urlparse import urlparse
50 >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
Georg Brandle8f1b002008-03-22 22:04:10 +000051 >>> o # doctest: +NORMALIZE_WHITESPACE
52 ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
53 params='', query='', fragment='')
Georg Brandl8ec7f652007-08-15 14:28:01 +000054 >>> o.scheme
55 'http'
56 >>> o.port
57 80
58 >>> o.geturl()
59 'http://www.cwi.nl:80/%7Eguido/Python.html'
60
Senthil Kumaran0b5019f2010-08-04 04:45:31 +000061
62 If the scheme value is not specified, urlparse following the syntax
63 specifications from RFC 1808, expects the netloc value to start with '//',
64 Otherwise, it is not possible to distinguish between net_loc and path
65 component and would classify the indistinguishable component as path as in
66 a relative url.
67
68 >>> from urlparse import urlparse
69 >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
70 ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
71 params='', query='', fragment='')
72 >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
73 ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
74 params='', query='', fragment='')
75 >>> urlparse('help/Python.html')
76 ParseResult(scheme='', netloc='', path='help/Python.html', params='',
77 query='', fragment='')
78
R. David Murray172e06e2010-05-25 15:32:06 +000079 If the *scheme* argument is specified, it gives the default addressing
Georg Brandl8ec7f652007-08-15 14:28:01 +000080 scheme, to be used only if the URL does not specify one. The default value for
81 this argument is the empty string.
82
83 If the *allow_fragments* argument is false, fragment identifiers are not
84 allowed, even if the URL's addressing scheme normally does support them. The
85 default value for this argument is :const:`True`.
86
87 The return value is actually an instance of a subclass of :class:`tuple`. This
88 class has the following additional read-only convenience attributes:
89
90 +------------------+-------+--------------------------+----------------------+
91 | Attribute | Index | Value | Value if not present |
92 +==================+=======+==========================+======================+
93 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
94 +------------------+-------+--------------------------+----------------------+
95 | :attr:`netloc` | 1 | Network location part | empty string |
96 +------------------+-------+--------------------------+----------------------+
97 | :attr:`path` | 2 | Hierarchical path | empty string |
98 +------------------+-------+--------------------------+----------------------+
99 | :attr:`params` | 3 | Parameters for last path | empty string |
100 | | | element | |
101 +------------------+-------+--------------------------+----------------------+
102 | :attr:`query` | 4 | Query component | empty string |
103 +------------------+-------+--------------------------+----------------------+
104 | :attr:`fragment` | 5 | Fragment identifier | empty string |
105 +------------------+-------+--------------------------+----------------------+
106 | :attr:`username` | | User name | :const:`None` |
107 +------------------+-------+--------------------------+----------------------+
108 | :attr:`password` | | Password | :const:`None` |
109 +------------------+-------+--------------------------+----------------------+
110 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
111 +------------------+-------+--------------------------+----------------------+
112 | :attr:`port` | | Port number as integer, | :const:`None` |
113 | | | if present | |
114 +------------------+-------+--------------------------+----------------------+
115
116 See section :ref:`urlparse-result-object` for more information on the result
117 object.
118
119 .. versionchanged:: 2.5
120 Added attributes to return value.
121
Senthil Kumaran39824612010-04-22 12:10:13 +0000122 .. versionchanged:: 2.7
123 Added IPv6 URL parsing capabilities.
124
125
Facundo Batistac585df92008-09-03 22:35:50 +0000126.. function:: parse_qs(qs[, keep_blank_values[, strict_parsing]])
127
128 Parse a query string given as a string argument (data of type
129 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a
130 dictionary. The dictionary keys are the unique query variable names and the
131 values are lists of values for each name.
132
133 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000134 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000135 indicates that blanks should be retained as blank strings. The default false
136 value indicates that blank values are to be ignored and treated as if they were
137 not included.
138
139 The optional argument *strict_parsing* is a flag indicating what to do with
140 parsing errors. If false (the default), errors are silently ignored. If true,
141 errors raise a :exc:`ValueError` exception.
142
143 Use the :func:`urllib.urlencode` function to convert such dictionaries into
144 query strings.
145
Georg Brandla6714b22009-11-03 18:34:27 +0000146 .. versionadded:: 2.6
147 Copied from the :mod:`cgi` module.
148
Facundo Batistac585df92008-09-03 22:35:50 +0000149
150.. function:: parse_qsl(qs[, keep_blank_values[, strict_parsing]])
151
152 Parse a query string given as a string argument (data of type
153 :mimetype:`application/x-www-form-urlencoded`). Data are returned as a list of
154 name, value pairs.
155
156 The optional argument *keep_blank_values* is a flag indicating whether blank
Senthil Kumaranbd13f452010-08-09 20:14:11 +0000157 values in percent-encoded queries should be treated as blank strings. A true value
Facundo Batistac585df92008-09-03 22:35:50 +0000158 indicates that blanks should be retained as blank strings. The default false
159 value indicates that blank values are to be ignored and treated as if they were
160 not included.
161
162 The optional argument *strict_parsing* is a flag indicating what to do with
163 parsing errors. If false (the default), errors are silently ignored. If true,
164 errors raise a :exc:`ValueError` exception.
165
166 Use the :func:`urllib.urlencode` function to convert such lists of pairs into
167 query strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000168
Georg Brandla6714b22009-11-03 18:34:27 +0000169 .. versionadded:: 2.6
170 Copied from the :mod:`cgi` module.
171
172
Georg Brandl8ec7f652007-08-15 14:28:01 +0000173.. function:: urlunparse(parts)
174
175 Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
176 can be any six-item iterable. This may result in a slightly different, but
177 equivalent URL, if the URL that was parsed originally had unnecessary delimiters
178 (for example, a ? with an empty query; the RFC states that these are
179 equivalent).
180
181
R. David Murray172e06e2010-05-25 15:32:06 +0000182.. function:: urlsplit(urlstring[, scheme[, allow_fragments]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000183
184 This is similar to :func:`urlparse`, but does not split the params from the URL.
185 This should generally be used instead of :func:`urlparse` if the more recent URL
186 syntax allowing parameters to be applied to each segment of the *path* portion
187 of the URL (see :rfc:`2396`) is wanted. A separate function is needed to
188 separate the path segments and parameters. This function returns a 5-tuple:
189 (addressing scheme, network location, path, query, fragment identifier).
190
191 The return value is actually an instance of a subclass of :class:`tuple`. This
192 class has the following additional read-only convenience attributes:
193
194 +------------------+-------+-------------------------+----------------------+
195 | Attribute | Index | Value | Value if not present |
196 +==================+=======+=========================+======================+
197 | :attr:`scheme` | 0 | URL scheme specifier | empty string |
198 +------------------+-------+-------------------------+----------------------+
199 | :attr:`netloc` | 1 | Network location part | empty string |
200 +------------------+-------+-------------------------+----------------------+
201 | :attr:`path` | 2 | Hierarchical path | empty string |
202 +------------------+-------+-------------------------+----------------------+
203 | :attr:`query` | 3 | Query component | empty string |
204 +------------------+-------+-------------------------+----------------------+
205 | :attr:`fragment` | 4 | Fragment identifier | empty string |
206 +------------------+-------+-------------------------+----------------------+
207 | :attr:`username` | | User name | :const:`None` |
208 +------------------+-------+-------------------------+----------------------+
209 | :attr:`password` | | Password | :const:`None` |
210 +------------------+-------+-------------------------+----------------------+
211 | :attr:`hostname` | | Host name (lower case) | :const:`None` |
212 +------------------+-------+-------------------------+----------------------+
213 | :attr:`port` | | Port number as integer, | :const:`None` |
214 | | | if present | |
215 +------------------+-------+-------------------------+----------------------+
216
217 See section :ref:`urlparse-result-object` for more information on the result
218 object.
219
220 .. versionadded:: 2.2
221
222 .. versionchanged:: 2.5
223 Added attributes to return value.
224
225
226.. function:: urlunsplit(parts)
227
228 Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
229 URL as a string. The *parts* argument can be any five-item iterable. This may
230 result in a slightly different, but equivalent URL, if the URL that was parsed
231 originally had unnecessary delimiters (for example, a ? with an empty query; the
232 RFC states that these are equivalent).
233
234 .. versionadded:: 2.2
235
236
237.. function:: urljoin(base, url[, allow_fragments])
238
239 Construct a full ("absolute") URL by combining a "base URL" (*base*) with
240 another URL (*url*). Informally, this uses components of the base URL, in
241 particular the addressing scheme, the network location and (part of) the path,
Georg Brandle8f1b002008-03-22 22:04:10 +0000242 to provide missing components in the relative URL. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000243
244 >>> from urlparse import urljoin
245 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
246 'http://www.cwi.nl/%7Eguido/FAQ.html'
247
248 The *allow_fragments* argument has the same meaning and default as for
249 :func:`urlparse`.
250
251 .. note::
252
253 If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``),
254 the *url*'s host name and/or scheme will be present in the result. For example:
255
Georg Brandle8f1b002008-03-22 22:04:10 +0000256 .. doctest::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000257
258 >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html',
259 ... '//www.python.org/%7Eguido')
260 'http://www.python.org/%7Eguido'
261
262 If you do not want that behavior, preprocess the *url* with :func:`urlsplit` and
263 :func:`urlunsplit`, removing possible *scheme* and *netloc* parts.
264
265
266.. function:: urldefrag(url)
267
268 If *url* contains a fragment identifier, returns a modified version of *url*
269 with no fragment identifier, and the fragment identifier as a separate string.
270 If there is no fragment identifier in *url*, returns *url* unmodified and an
271 empty string.
272
273
274.. seealso::
275
Senthil Kumaran0a361812010-04-22 05:48:35 +0000276 :rfc:`3986` - Uniform Resource Identifiers
277 This is the current standard (STD66). Any changes to urlparse module
278 should conform to this. Certain deviations could be observed, which are
Senthil Kumaran39824612010-04-22 12:10:13 +0000279 mostly due backward compatiblity purposes and for certain de-facto
Senthil Kumaran0a361812010-04-22 05:48:35 +0000280 parsing requirements as commonly observed in major browsers.
281
282 :rfc:`2732` - Format for Literal IPv6 Addresses in URL's.
283 This specifies the parsing requirements of IPv6 URLs.
284
285 :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax
286 Document describing the generic syntactic requirements for both Uniform Resource
287 Names (URNs) and Uniform Resource Locators (URLs).
288
289 :rfc:`2368` - The mailto URL scheme.
290 Parsing requirements for mailto url schemes.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000291
292 :rfc:`1808` - Relative Uniform Resource Locators
293 This Request For Comments includes the rules for joining an absolute and a
294 relative URL, including a fair number of "Abnormal Examples" which govern the
295 treatment of border cases.
296
Senthil Kumaran0a361812010-04-22 05:48:35 +0000297 :rfc:`1738` - Uniform Resource Locators (URL)
298 This specifies the formal syntax and semantics of absolute URLs.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000299
300
301.. _urlparse-result-object:
302
303Results of :func:`urlparse` and :func:`urlsplit`
304------------------------------------------------
305
306The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
307subclasses of the :class:`tuple` type. These subclasses add the attributes
308described in those functions, as well as provide an additional method:
309
310
311.. method:: ParseResult.geturl()
312
313 Return the re-combined version of the original URL as a string. This may differ
314 from the original URL in that the scheme will always be normalized to lower case
315 and empty components may be dropped. Specifically, empty parameters, queries,
316 and fragment identifiers will be removed.
317
318 The result of this method is a fixpoint if passed back through the original
Georg Brandle8f1b002008-03-22 22:04:10 +0000319 parsing function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000320
321 >>> import urlparse
322 >>> url = 'HTTP://www.Python.org/doc/#'
323
324 >>> r1 = urlparse.urlsplit(url)
325 >>> r1.geturl()
326 'http://www.Python.org/doc/'
327
328 >>> r2 = urlparse.urlsplit(r1.geturl())
329 >>> r2.geturl()
330 'http://www.Python.org/doc/'
331
332 .. versionadded:: 2.5
333
Georg Brandlfc29f272009-01-02 20:25:14 +0000334The following classes provide the implementations of the parse results:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000335
336
337.. class:: BaseResult
338
339 Base class for the concrete result classes. This provides most of the attribute
340 definitions. It does not provide a :meth:`geturl` method. It is derived from
341 :class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
342 methods.
343
344
345.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
346
347 Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
348 overridden to support checking that the right number of arguments are passed.
349
350
351.. class:: SplitResult(scheme, netloc, path, query, fragment)
352
353 Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
354 overridden to support checking that the right number of arguments are passed.
355