blob: c47afe847522083974ed23b690eba8d05dae4f58 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{urllib} ---
2 Open an arbitrary object given by URL.}
Fred Drakeb91e9341998-07-23 17:59:49 +00003\declaremodule{standard}{urllib}
4
5\modulesynopsis{Open an arbitrary object given by URL (requires sockets).}
6
Guido van Rossuma8db1df1995-02-16 16:29:46 +00007\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00008\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00009\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000010
Guido van Rossum86751151995-02-28 17:14:32 +000011
Guido van Rossuma8db1df1995-02-16 16:29:46 +000012This module provides a high-level interface for fetching data across
Fred Drake6ef871c1998-03-12 06:52:05 +000013the World-Wide Web. In particular, the \function{urlopen()} function
14is similar to the built-in function \function{open()}, but accepts
15Universal Resource Locators (URLs) instead of filenames. Some
16restrictions apply --- it can only open URLs for reading, and no seek
17operations are available.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000018
Fred Drakef5eaa2e1997-12-15 22:13:50 +000019It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000020
Guido van Rossum0af2f631998-07-22 21:34:21 +000021\begin{funcdesc}{urlopen}{url\optional{, data}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000022Open a network object denoted by a URL for reading. If the URL does
Fred Drake6ef871c1998-03-12 06:52:05 +000023not have a scheme identifier, or if it has \file{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000024identifier, this opens a local file; otherwise it opens a socket to a
25server somewhere on the network. If the connection cannot be made, or
Fred Drake6ef871c1998-03-12 06:52:05 +000026if the server returns an error code, the \exception{IOError} exception
27is raised. If all went well, a file-like object is returned. This
28supports the following methods: \method{read()}, \method{readline()},
29\method{readlines()}, \method{fileno()}, \method{close()} and
30\method{info()}.
Guido van Rossum0af2f631998-07-22 21:34:21 +000031
32Except for the \method{info()} method,
33these methods have the same interface as for
Fred Drake6ef871c1998-03-12 06:52:05 +000034file objects --- see section \ref{bltin-file-objects} in this
35manual. (It is not a built-in file object, however, so it can't be
Guido van Rossum470be141995-03-17 16:07:09 +000036used at those few places where a true built-in file object is
37required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000038
Fred Drake6ef871c1998-03-12 06:52:05 +000039The \method{info()} method returns an instance of the class
Guido van Rossum954b9ad1998-09-28 14:08:29 +000040\class{mimetools.Message} containing meta-information associated
41with the URL. When the method is HTTP, these headers are those
42returned by the server at the head of the retrieved HTML page
43(including Content-Length and Content-Type). When the method is FTP,
44a Content-Length header will be present if (as is now usual) the
45server passed back a file length in response to the FTP retrieval
46request. When the method is local-file, returned headers will include
47a Date representing the file's last-modified time, a Content-Length
48giving file size, and a Content-Type containing a guess at the file's
49type. See also the description of the
50\module{mimetools}\refstmodindex{mimetools} module.
Guido van Rossum0af2f631998-07-22 21:34:21 +000051
52If the \var{url} uses the \file{http:} scheme identifier, the optional
53\var{data} argument may be given to specify a \code{POST} request
54(normally the request type is \code{GET}). The \var{data} argument
55must in standard \file{application/x-www-form-urlencoded} format;
56see the \function{urlencode()} function below.
57
Guido van Rossuma8db1df1995-02-16 16:29:46 +000058\end{funcdesc}
59
Fred Drake09b29571998-10-01 20:43:13 +000060\begin{funcdesc}{urlretrieve}{url\optional{, filename}\optional{, hook}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000061Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000062If the URL points to a local file, or a valid cached copy of the
Fred Drake6ef871c1998-03-12 06:52:05 +000063object exists, the object is not copied. Return a tuple
64\code{(\var{filename}, \var{headers})} where \var{filename} is the
65local file name under which the object can be found, and \var{headers}
66is either \code{None} (for a local object) or whatever the
67\method{info()} method of the object returned by \function{urlopen()}
68returned (for a remote object, possibly cached). Exceptions are the
69same as for \function{urlopen()}.
Guido van Rossum954b9ad1998-09-28 14:08:29 +000070
71The second argument, if present, specifies the file location to copy
72to (if absent, the location will be a tempfile with a generated name).
73The third argument, if present, is a hook function that will be called
74once on establishment of the network connection and once after each
75block read thereafter. The hook will be passed three arguments; a
76count of blocks transferred so far, a block size in bytes, and the
Fred Drake09b29571998-10-01 20:43:13 +000077total size of the file. The third argument may be \code{-1} on older
78FTP servers which do not return a file size in response to a retrieval
Guido van Rossum954b9ad1998-09-28 14:08:29 +000079request.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000080\end{funcdesc}
81
82\begin{funcdesc}{urlcleanup}{}
83Clear the cache that may have been built up by previous calls to
Fred Drake6ef871c1998-03-12 06:52:05 +000084\function{urlretrieve()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000085\end{funcdesc}
86
Guido van Rossum0af2f631998-07-22 21:34:21 +000087\begin{funcdesc}{quote}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000088Replace special characters in \var{string} using the \samp{\%xx} escape.
89Letters, digits, and the characters \character{_,.-} are never quoted.
Guido van Rossum0af2f631998-07-22 21:34:21 +000090The optional \var{safe} parameter specifies additional characters
Guido van Rossum61d34f41995-02-27 17:51:51 +000091that should not be quoted --- its default value is \code{'/'}.
92
Guido van Rossum8d40c841996-12-13 14:48:47 +000093Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
94\end{funcdesc}
95
Guido van Rossum0af2f631998-07-22 21:34:21 +000096\begin{funcdesc}{quote_plus}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000097Like \function{quote()}, but also replaces spaces by plus signs, as
Guido van Rossum0af2f631998-07-22 21:34:21 +000098required for quoting HTML form values. Plus signs in the original
99string are escaped unless they are included in \var{safe}.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000100\end{funcdesc}
101
102\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000103Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000104
Guido van Rossum86751151995-02-28 17:14:32 +0000105Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +0000106\end{funcdesc}
107
Guido van Rossum8d40c841996-12-13 14:48:47 +0000108\begin{funcdesc}{unquote_plus}{string}
Fred Drake6ef871c1998-03-12 06:52:05 +0000109Like \function{unquote()}, but also replaces plus signs by spaces, as
Guido van Rossum8d40c841996-12-13 14:48:47 +0000110required for unquoting HTML form values.
111\end{funcdesc}
112
Guido van Rossum0af2f631998-07-22 21:34:21 +0000113\begin{funcdesc}{urlencode}{dict}
114Convert a dictionary to a ``url-encoded'' string, suitable to pass to
115\function{urlopen()} above as the optional \var{data} argument. This
116is useful to pass a dictionary of form fields to a \code{POST}
Fred Drake09b29571998-10-01 20:43:13 +0000117request. The resulting string is a series of
118\code{\var{key}=\var{value}} pairs separated by \character{\&}
119characters, where both \var{key} and \var{value} are quoted using
120\function{quote_plus()} above.
Guido van Rossum0af2f631998-07-22 21:34:21 +0000121\end{funcdesc}
122
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000123Restrictions:
124
125\begin{itemize}
126
127\item
128Currently, only the following protocols are supported: HTTP, (versions
1290.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
Fred Drake6ef871c1998-03-12 06:52:05 +0000130\indexii{HTTP}{protocol}
131\indexii{Gopher}{protocol}
132\indexii{FTP}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000133
134\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000135The caching feature of \function{urlretrieve()} has been disabled
136until I find the time to hack proper processing of Expiration time
137headers.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000138
139\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000140There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000141the cache.
142
143\item
144For backward compatibility, if a URL appears to point to a local file
145but the file can't be opened, the URL is re-interpreted using the FTP
146protocol. This can sometimes cause confusing error messages.
147
148\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000149The \function{urlopen()} and \function{urlretrieve()} functions can
150cause arbitrarily long delays while waiting for a network connection
151to be set up. This means that it is difficult to build an interactive
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000152web client using these functions without using threads.
153
154\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000155The data returned by \function{urlopen()} or \function{urlretrieve()}
156is the raw data returned by the server. This may be binary data
157(e.g. an image), plain text or (for example) HTML. The HTTP protocol
158provides type information in the reply header, which can be inspected
159by looking at the \code{content-type} header. For the Gopher protocol,
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000160type information is encoded in the URL; there is currently no easy way
161to extract it. If the returned data is HTML, you can use the module
Fred Drake6ef871c1998-03-12 06:52:05 +0000162\module{htmllib}\refstmodindex{htmllib} to parse it.
163\index{HTML}
164\indexii{HTTP}{protocol}
165\indexii{Gopher}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000166
167\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000168Although the \module{urllib} module contains (undocumented) routines
169to parse and unparse URL strings, the recommended interface for URL
170manipulation is in module \module{urlparse}\refstmodindex{urlparse}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000171
172\end{itemize}