blob: 01eb7a00959aeba091e692fd48a332a483043e84 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{urllib} ---
2 Open an arbitrary object given by URL.}
Fred Drakeb91e9341998-07-23 17:59:49 +00003\declaremodule{standard}{urllib}
4
5\modulesynopsis{Open an arbitrary object given by URL (requires sockets).}
6
Guido van Rossuma8db1df1995-02-16 16:29:46 +00007\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00008\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00009\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000010
Guido van Rossum86751151995-02-28 17:14:32 +000011
Guido van Rossuma8db1df1995-02-16 16:29:46 +000012This module provides a high-level interface for fetching data across
Fred Drake6ef871c1998-03-12 06:52:05 +000013the World-Wide Web. In particular, the \function{urlopen()} function
14is similar to the built-in function \function{open()}, but accepts
15Universal Resource Locators (URLs) instead of filenames. Some
16restrictions apply --- it can only open URLs for reading, and no seek
17operations are available.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000018
Fred Drakef5eaa2e1997-12-15 22:13:50 +000019It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000020
Guido van Rossum0af2f631998-07-22 21:34:21 +000021\begin{funcdesc}{urlopen}{url\optional{, data}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000022Open a network object denoted by a URL for reading. If the URL does
Fred Drake6ef871c1998-03-12 06:52:05 +000023not have a scheme identifier, or if it has \file{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000024identifier, this opens a local file; otherwise it opens a socket to a
25server somewhere on the network. If the connection cannot be made, or
Fred Drake6ef871c1998-03-12 06:52:05 +000026if the server returns an error code, the \exception{IOError} exception
27is raised. If all went well, a file-like object is returned. This
28supports the following methods: \method{read()}, \method{readline()},
29\method{readlines()}, \method{fileno()}, \method{close()} and
30\method{info()}.
Guido van Rossum0af2f631998-07-22 21:34:21 +000031
32Except for the \method{info()} method,
33these methods have the same interface as for
Fred Drake6ef871c1998-03-12 06:52:05 +000034file objects --- see section \ref{bltin-file-objects} in this
35manual. (It is not a built-in file object, however, so it can't be
Guido van Rossum470be141995-03-17 16:07:09 +000036used at those few places where a true built-in file object is
37required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000038
Fred Drake6ef871c1998-03-12 06:52:05 +000039The \method{info()} method returns an instance of the class
Guido van Rossum954b9ad1998-09-28 14:08:29 +000040\class{mimetools.Message} containing meta-information associated
41with the URL. When the method is HTTP, these headers are those
42returned by the server at the head of the retrieved HTML page
43(including Content-Length and Content-Type). When the method is FTP,
44a Content-Length header will be present if (as is now usual) the
45server passed back a file length in response to the FTP retrieval
46request. When the method is local-file, returned headers will include
47a Date representing the file's last-modified time, a Content-Length
48giving file size, and a Content-Type containing a guess at the file's
49type. See also the description of the
50\module{mimetools}\refstmodindex{mimetools} module.
Guido van Rossum0af2f631998-07-22 21:34:21 +000051
52If the \var{url} uses the \file{http:} scheme identifier, the optional
53\var{data} argument may be given to specify a \code{POST} request
54(normally the request type is \code{GET}). The \var{data} argument
55must in standard \file{application/x-www-form-urlencoded} format;
56see the \function{urlencode()} function below.
57
Guido van Rossuma8db1df1995-02-16 16:29:46 +000058\end{funcdesc}
59
Guido van Rossum954b9ad1998-09-28 14:08:29 +000060\begin{funcdesc}{urlretrieve}{url\optional{, filename}\optional{, hook}}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000061Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000062If the URL points to a local file, or a valid cached copy of the
Fred Drake6ef871c1998-03-12 06:52:05 +000063object exists, the object is not copied. Return a tuple
64\code{(\var{filename}, \var{headers})} where \var{filename} is the
65local file name under which the object can be found, and \var{headers}
66is either \code{None} (for a local object) or whatever the
67\method{info()} method of the object returned by \function{urlopen()}
68returned (for a remote object, possibly cached). Exceptions are the
69same as for \function{urlopen()}.
Guido van Rossum954b9ad1998-09-28 14:08:29 +000070
71The second argument, if present, specifies the file location to copy
72to (if absent, the location will be a tempfile with a generated name).
73The third argument, if present, is a hook function that will be called
74once on establishment of the network connection and once after each
75block read thereafter. The hook will be passed three arguments; a
76count of blocks transferred so far, a block size in bytes, and the
77total size of the file. The third argument may be -1 on older FTP
78servers which do not return a file size in response to a retrieval
79request.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000080\end{funcdesc}
81
82\begin{funcdesc}{urlcleanup}{}
83Clear the cache that may have been built up by previous calls to
Fred Drake6ef871c1998-03-12 06:52:05 +000084\function{urlretrieve()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000085\end{funcdesc}
86
Guido van Rossum0af2f631998-07-22 21:34:21 +000087\begin{funcdesc}{quote}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000088Replace special characters in \var{string} using the \samp{\%xx} escape.
89Letters, digits, and the characters \character{_,.-} are never quoted.
Guido van Rossum0af2f631998-07-22 21:34:21 +000090The optional \var{safe} parameter specifies additional characters
Guido van Rossum61d34f41995-02-27 17:51:51 +000091that should not be quoted --- its default value is \code{'/'}.
92
Guido van Rossum8d40c841996-12-13 14:48:47 +000093Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
94\end{funcdesc}
95
Guido van Rossum0af2f631998-07-22 21:34:21 +000096\begin{funcdesc}{quote_plus}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000097Like \function{quote()}, but also replaces spaces by plus signs, as
Guido van Rossum0af2f631998-07-22 21:34:21 +000098required for quoting HTML form values. Plus signs in the original
99string are escaped unless they are included in \var{safe}.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000100\end{funcdesc}
101
102\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000103Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000104
Guido van Rossum86751151995-02-28 17:14:32 +0000105Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000106\end{funcdesc}
107
Guido van Rossum8d40c841996-12-13 14:48:47 +0000108\begin{funcdesc}{unquote_plus}{string}
Fred Drake6ef871c1998-03-12 06:52:05 +0000109Like \function{unquote()}, but also replaces plus signs by spaces, as
Guido van Rossum8d40c841996-12-13 14:48:47 +0000110required for unquoting HTML form values.
111\end{funcdesc}
112
Guido van Rossum0af2f631998-07-22 21:34:21 +0000113\begin{funcdesc}{urlencode}{dict}
114Convert a dictionary to a ``url-encoded'' string, suitable to pass to
115\function{urlopen()} above as the optional \var{data} argument. This
116is useful to pass a dictionary of form fields to a \code{POST}
Guido van Rossumbe260101998-07-22 21:51:41 +0000117request. The resulting string is a series of \var{key}\code{=}\var{value}
118pairs separated by \code{\&} characters, where both \var{key} and
Guido van Rossum0af2f631998-07-22 21:34:21 +0000119\var{value} are quoted using \function{quote_plus()} above.
120\end{funcdesc}
121
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000122Restrictions:
123
124\begin{itemize}
125
126\item
127Currently, only the following protocols are supported: HTTP, (versions
1280.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
Fred Drake6ef871c1998-03-12 06:52:05 +0000129\indexii{HTTP}{protocol}
130\indexii{Gopher}{protocol}
131\indexii{FTP}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000132
133\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000134The caching feature of \function{urlretrieve()} has been disabled
135until I find the time to hack proper processing of Expiration time
136headers.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000137
138\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000139There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000140the cache.
141
142\item
143For backward compatibility, if a URL appears to point to a local file
144but the file can't be opened, the URL is re-interpreted using the FTP
145protocol. This can sometimes cause confusing error messages.
146
147\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000148The \function{urlopen()} and \function{urlretrieve()} functions can
149cause arbitrarily long delays while waiting for a network connection
150to be set up. This means that it is difficult to build an interactive
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000151web client using these functions without using threads.
152
153\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000154The data returned by \function{urlopen()} or \function{urlretrieve()}
155is the raw data returned by the server. This may be binary data
156(e.g. an image), plain text or (for example) HTML. The HTTP protocol
157provides type information in the reply header, which can be inspected
158by looking at the \code{content-type} header. For the Gopher protocol,
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000159type information is encoded in the URL; there is currently no easy way
160to extract it. If the returned data is HTML, you can use the module
Fred Drake6ef871c1998-03-12 06:52:05 +0000161\module{htmllib}\refstmodindex{htmllib} to parse it.
162\index{HTML}
163\indexii{HTTP}{protocol}
164\indexii{Gopher}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000165
166\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000167Although the \module{urllib} module contains (undocumented) routines
168to parse and unparse URL strings, the recommended interface for URL
169manipulation is in module \module{urlparse}\refstmodindex{urlparse}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000170
171\end{itemize}