Fred Drake | 295da24 | 1998-08-10 19:42:37 +0000 | [diff] [blame] | 1 | \section{\module{urlparse} --- |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 2 | Parse URLs into components} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 3 | \declaremodule{standard}{urlparse} |
| 4 | |
Fred Drake | 72d157e | 1998-08-06 21:23:17 +0000 | [diff] [blame] | 5 | \modulesynopsis{Parse URLs into components.} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 6 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 7 | \index{WWW} |
Fred Drake | 8ee679f | 2001-07-14 02:50:55 +0000 | [diff] [blame] | 8 | \index{World Wide Web} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 9 | \index{URL} |
| 10 | \indexii{URL}{parsing} |
| 11 | \indexii{relative}{URL} |
| 12 | |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 13 | |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 14 | This module defines a standard interface to break Uniform Resource |
| 15 | Locator (URL) strings up in components (addressing scheme, network |
| 16 | location, path etc.), to combine the components back into a URL |
| 17 | string, and to convert a ``relative URL'' to an absolute URL given a |
| 18 | ``base URL.'' |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 19 | |
Fred Drake | d1cc9c2 | 1998-01-21 04:55:02 +0000 | [diff] [blame] | 20 | The module has been designed to match the Internet RFC on Relative |
| 21 | Uniform Resource Locators (and discovered a bug in an earlier |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 22 | draft!). It supports the following URL schemes: |
| 23 | \code{file}, \code{ftp}, \code{gopher}, \code{hdl}, \code{http}, |
| 24 | \code{https}, \code{imap}, \code{mailto}, \code{mms}, \code{news}, |
| 25 | \code{nntp}, \code{prospero}, \code{rsync}, \code{rtsp}, \code{rtspu}, |
Fred Drake | 23fd3d4 | 2006-04-01 06:11:07 +0000 | [diff] [blame^] | 26 | \code{sftp}, \code{shttp}, \code{sip}, \code{sips}, \code{snews}, \code{svn}, |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 27 | \code{svn+ssh}, \code{telnet}, \code{wais}. |
Fred Drake | 23fd3d4 | 2006-04-01 06:11:07 +0000 | [diff] [blame^] | 28 | \versionadded[Support for the \code{sftp} and \code{sips} schemes]{2.5} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 29 | |
Georg Brandl | 1de3700 | 2006-01-20 21:17:01 +0000 | [diff] [blame] | 30 | The \module{urlparse} module defines the following functions: |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 31 | |
Fred Drake | 6884e3b | 1997-12-29 19:09:37 +0000 | [diff] [blame] | 32 | \begin{funcdesc}{urlparse}{urlstring\optional{, default_scheme\optional{, allow_fragments}}} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 33 | Parse a URL into 6 components, returning a 6-tuple: (addressing |
| 34 | scheme, network location, path, parameters, query, fragment |
| 35 | identifier). This corresponds to the general structure of a URL: |
| 36 | \code{\var{scheme}://\var{netloc}/\var{path};\var{parameters}?\var{query}\#\var{fragment}}. |
| 37 | Each tuple item is a string, possibly empty. |
| 38 | The components are not broken up in smaller parts (e.g. the network |
| 39 | location is a single string), and \% escapes are not expanded. |
Guido van Rossum | 470be14 | 1995-03-17 16:07:09 +0000 | [diff] [blame] | 40 | The delimiters as shown above are not part of the tuple items, |
| 41 | except for a leading slash in the \var{path} component, which is |
| 42 | retained if present. |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 43 | |
| 44 | Example: |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 45 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 46 | \begin{verbatim} |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 47 | urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 48 | \end{verbatim} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 49 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 50 | yields the tuple |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 51 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 52 | \begin{verbatim} |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 53 | ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '') |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 54 | \end{verbatim} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 55 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 56 | If the \var{default_scheme} argument is specified, it gives the |
| 57 | default addressing scheme, to be used only if the URL string does not |
| 58 | specify one. The default value for this argument is the empty string. |
| 59 | |
| 60 | If the \var{allow_fragments} argument is zero, fragment identifiers |
| 61 | are not allowed, even if the URL's addressing scheme normally does |
| 62 | support them. The default value for this argument is \code{1}. |
| 63 | \end{funcdesc} |
| 64 | |
| 65 | \begin{funcdesc}{urlunparse}{tuple} |
Fred Drake | d1cc9c2 | 1998-01-21 04:55:02 +0000 | [diff] [blame] | 66 | Construct a URL string from a tuple as returned by \code{urlparse()}. |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 67 | This may result in a slightly different, but equivalent URL, if the |
| 68 | URL that was parsed originally had redundant delimiters, e.g. a ? with |
| 69 | an empty query (the draft states that these are equivalent). |
| 70 | \end{funcdesc} |
| 71 | |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 72 | \begin{funcdesc}{urlsplit}{urlstring\optional{, |
| 73 | default_scheme\optional{, allow_fragments}}} |
| 74 | This is similar to \function{urlparse()}, but does not split the |
| 75 | params from the URL. This should generally be used instead of |
| 76 | \function{urlparse()} if the more recent URL syntax allowing |
| 77 | parameters to be applied to each segment of the \var{path} portion of |
Walter Dörwald | ff9ca5e | 2005-08-31 11:03:12 +0000 | [diff] [blame] | 78 | the URL (see \rfc{2396}) is wanted. A separate function is needed to |
| 79 | separate the path segments and parameters. This function returns a |
| 80 | 5-tuple: (addressing scheme, network location, path, query, fragment |
Fred Drake | 5545219 | 2001-11-16 03:22:15 +0000 | [diff] [blame] | 81 | identifier). |
| 82 | \versionadded{2.2} |
| 83 | \end{funcdesc} |
| 84 | |
| 85 | \begin{funcdesc}{urlunsplit}{tuple} |
| 86 | Combine the elements of a tuple as returned by \function{urlsplit()} |
| 87 | into a complete URL as a string. |
| 88 | \versionadded{2.2} |
| 89 | \end{funcdesc} |
| 90 | |
Fred Drake | cce1090 | 1998-03-17 06:33:25 +0000 | [diff] [blame] | 91 | \begin{funcdesc}{urljoin}{base, url\optional{, allow_fragments}} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 92 | Construct a full (``absolute'') URL by combining a ``base URL'' |
| 93 | (\var{base}) with a ``relative URL'' (\var{url}). Informally, this |
| 94 | uses components of the base URL, in particular the addressing scheme, |
| 95 | the network location and (part of) the path, to provide missing |
| 96 | components in the relative URL. |
| 97 | |
| 98 | Example: |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 99 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 100 | \begin{verbatim} |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 101 | urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 102 | \end{verbatim} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 103 | |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 104 | yields the string |
| 105 | |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 106 | \begin{verbatim} |
Guido van Rossum | 96628a9 | 1995-04-10 11:34:00 +0000 | [diff] [blame] | 107 | 'http://www.cwi.nl/%7Eguido/FAQ.html' |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 108 | \end{verbatim} |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 109 | |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 110 | The \var{allow_fragments} argument has the same meaning as for |
Fred Drake | d1cc9c2 | 1998-01-21 04:55:02 +0000 | [diff] [blame] | 111 | \code{urlparse()}. |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 112 | \end{funcdesc} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 113 | |
Fred Drake | 98ef20d | 2002-10-16 20:07:54 +0000 | [diff] [blame] | 114 | \begin{funcdesc}{urldefrag}{url} |
| 115 | If \var{url} contains a fragment identifier, returns a modified |
| 116 | version of \var{url} with no fragment identifier, and the fragment |
| 117 | identifier as a separate string. If there is no fragment identifier |
| 118 | in \var{url}, returns \var{url} unmodified and an empty string. |
| 119 | \end{funcdesc} |
| 120 | |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 121 | |
| 122 | \begin{seealso} |
| 123 | \seerfc{1738}{Uniform Resource Locators (URL)}{ |
| 124 | This specifies the formal syntax and semantics of absolute |
| 125 | URLs.} |
| 126 | \seerfc{1808}{Relative Uniform Resource Locators}{ |
| 127 | This Request For Comments includes the rules for joining an |
Fred Drake | 5f2c1d2 | 2002-10-17 19:23:43 +0000 | [diff] [blame] | 128 | absolute and a relative URL, including a fair number of |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 129 | ``Abnormal Examples'' which govern the treatment of border |
| 130 | cases.} |
Fred Drake | 0308ff8 | 2000-08-25 17:29:35 +0000 | [diff] [blame] | 131 | \seerfc{2396}{Uniform Resource Identifiers (URI): Generic Syntax}{ |
| 132 | Document describing the generic syntactic requirements for |
| 133 | both Uniform Resource Names (URNs) and Uniform Resource |
| 134 | Locators (URLs).} |
Fred Drake | 45ca333 | 2000-08-24 04:58:25 +0000 | [diff] [blame] | 135 | \end{seealso} |