blob: 7a985822eb38efc80eaffc7373fef82f51e4fe60 [file] [log] [blame]
Fred Drake3a0351c1998-04-04 07:23:21 +00001\section{Standard Module \module{urllib}}
Fred Drakeb91e9341998-07-23 17:59:49 +00002\declaremodule{standard}{urllib}
3
4\modulesynopsis{Open an arbitrary object given by URL (requires sockets).}
5
Guido van Rossuma8db1df1995-02-16 16:29:46 +00006\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00007\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00008\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00009
Guido van Rossum86751151995-02-28 17:14:32 +000010
Guido van Rossuma8db1df1995-02-16 16:29:46 +000011This module provides a high-level interface for fetching data across
Fred Drake6ef871c1998-03-12 06:52:05 +000012the World-Wide Web. In particular, the \function{urlopen()} function
13is similar to the built-in function \function{open()}, but accepts
14Universal Resource Locators (URLs) instead of filenames. Some
15restrictions apply --- it can only open URLs for reading, and no seek
16operations are available.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000017
Fred Drakef5eaa2e1997-12-15 22:13:50 +000018It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000019
Guido van Rossum0af2f631998-07-22 21:34:21 +000020\begin{funcdesc}{urlopen}{url\optional{, data}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000021Open a network object denoted by a URL for reading. If the URL does
Fred Drake6ef871c1998-03-12 06:52:05 +000022not have a scheme identifier, or if it has \file{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000023identifier, this opens a local file; otherwise it opens a socket to a
24server somewhere on the network. If the connection cannot be made, or
Fred Drake6ef871c1998-03-12 06:52:05 +000025if the server returns an error code, the \exception{IOError} exception
26is raised. If all went well, a file-like object is returned. This
27supports the following methods: \method{read()}, \method{readline()},
28\method{readlines()}, \method{fileno()}, \method{close()} and
29\method{info()}.
Guido van Rossum0af2f631998-07-22 21:34:21 +000030
31Except for the \method{info()} method,
32these methods have the same interface as for
Fred Drake6ef871c1998-03-12 06:52:05 +000033file objects --- see section \ref{bltin-file-objects} in this
34manual. (It is not a built-in file object, however, so it can't be
Guido van Rossum470be141995-03-17 16:07:09 +000035used at those few places where a true built-in file object is
36required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000037
Fred Drake6ef871c1998-03-12 06:52:05 +000038The \method{info()} method returns an instance of the class
39\class{mimetools.Message} containing the headers received from the
40server, if the protocol uses such headers (currently the only
41supported protocol that uses this is HTTP). See the description of
42the \module{mimetools}\refstmodindex{mimetools} module.
Guido van Rossum0af2f631998-07-22 21:34:21 +000043
44If the \var{url} uses the \file{http:} scheme identifier, the optional
45\var{data} argument may be given to specify a \code{POST} request
46(normally the request type is \code{GET}). The \var{data} argument
47must in standard \file{application/x-www-form-urlencoded} format;
48see the \function{urlencode()} function below.
49
Guido van Rossuma8db1df1995-02-16 16:29:46 +000050\end{funcdesc}
51
52\begin{funcdesc}{urlretrieve}{url}
53Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000054If the URL points to a local file, or a valid cached copy of the
Fred Drake6ef871c1998-03-12 06:52:05 +000055object exists, the object is not copied. Return a tuple
56\code{(\var{filename}, \var{headers})} where \var{filename} is the
57local file name under which the object can be found, and \var{headers}
58is either \code{None} (for a local object) or whatever the
59\method{info()} method of the object returned by \function{urlopen()}
60returned (for a remote object, possibly cached). Exceptions are the
61same as for \function{urlopen()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000062\end{funcdesc}
63
64\begin{funcdesc}{urlcleanup}{}
65Clear the cache that may have been built up by previous calls to
Fred Drake6ef871c1998-03-12 06:52:05 +000066\function{urlretrieve()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000067\end{funcdesc}
68
Guido van Rossum0af2f631998-07-22 21:34:21 +000069\begin{funcdesc}{quote}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000070Replace special characters in \var{string} using the \samp{\%xx} escape.
71Letters, digits, and the characters \character{_,.-} are never quoted.
Guido van Rossum0af2f631998-07-22 21:34:21 +000072The optional \var{safe} parameter specifies additional characters
Guido van Rossum61d34f41995-02-27 17:51:51 +000073that should not be quoted --- its default value is \code{'/'}.
74
Guido van Rossum8d40c841996-12-13 14:48:47 +000075Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
76\end{funcdesc}
77
Guido van Rossum0af2f631998-07-22 21:34:21 +000078\begin{funcdesc}{quote_plus}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000079Like \function{quote()}, but also replaces spaces by plus signs, as
Guido van Rossum0af2f631998-07-22 21:34:21 +000080required for quoting HTML form values. Plus signs in the original
81string are escaped unless they are included in \var{safe}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000082\end{funcdesc}
83
84\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +000085Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +000086
Guido van Rossum86751151995-02-28 17:14:32 +000087Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000088\end{funcdesc}
89
Guido van Rossum8d40c841996-12-13 14:48:47 +000090\begin{funcdesc}{unquote_plus}{string}
Fred Drake6ef871c1998-03-12 06:52:05 +000091Like \function{unquote()}, but also replaces plus signs by spaces, as
Guido van Rossum8d40c841996-12-13 14:48:47 +000092required for unquoting HTML form values.
93\end{funcdesc}
94
Guido van Rossum0af2f631998-07-22 21:34:21 +000095\begin{funcdesc}{urlencode}{dict}
96Convert a dictionary to a ``url-encoded'' string, suitable to pass to
97\function{urlopen()} above as the optional \var{data} argument. This
98is useful to pass a dictionary of form fields to a \code{POST}
Guido van Rossumbe260101998-07-22 21:51:41 +000099request. The resulting string is a series of \var{key}\code{=}\var{value}
100pairs separated by \code{\&} characters, where both \var{key} and
Guido van Rossum0af2f631998-07-22 21:34:21 +0000101\var{value} are quoted using \function{quote_plus()} above.
102\end{funcdesc}
103
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000104Restrictions:
105
106\begin{itemize}
107
108\item
109Currently, only the following protocols are supported: HTTP, (versions
1100.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
Fred Drake6ef871c1998-03-12 06:52:05 +0000111\indexii{HTTP}{protocol}
112\indexii{Gopher}{protocol}
113\indexii{FTP}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000114
115\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000116The caching feature of \function{urlretrieve()} has been disabled
117until I find the time to hack proper processing of Expiration time
118headers.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000119
120\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000121There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000122the cache.
123
124\item
125For backward compatibility, if a URL appears to point to a local file
126but the file can't be opened, the URL is re-interpreted using the FTP
127protocol. This can sometimes cause confusing error messages.
128
129\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000130The \function{urlopen()} and \function{urlretrieve()} functions can
131cause arbitrarily long delays while waiting for a network connection
132to be set up. This means that it is difficult to build an interactive
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000133web client using these functions without using threads.
134
135\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000136The data returned by \function{urlopen()} or \function{urlretrieve()}
137is the raw data returned by the server. This may be binary data
138(e.g. an image), plain text or (for example) HTML. The HTTP protocol
139provides type information in the reply header, which can be inspected
140by looking at the \code{content-type} header. For the Gopher protocol,
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000141type information is encoded in the URL; there is currently no easy way
142to extract it. If the returned data is HTML, you can use the module
Fred Drake6ef871c1998-03-12 06:52:05 +0000143\module{htmllib}\refstmodindex{htmllib} to parse it.
144\index{HTML}
145\indexii{HTTP}{protocol}
146\indexii{Gopher}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000147
148\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000149Although the \module{urllib} module contains (undocumented) routines
150to parse and unparse URL strings, the recommended interface for URL
151manipulation is in module \module{urlparse}\refstmodindex{urlparse}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000152
153\end{itemize}