blob: 60e87b54d44b8cbd90cc35b48ca7810dfda9c880 [file] [log] [blame]
Fred Drake3a0351c1998-04-04 07:23:21 +00001\section{Standard Module \module{urllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-urllib}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00003\stmodindex{urllib}
4\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00005\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00006\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00007
Guido van Rossum86751151995-02-28 17:14:32 +00008
Guido van Rossuma8db1df1995-02-16 16:29:46 +00009This module provides a high-level interface for fetching data across
Fred Drake6ef871c1998-03-12 06:52:05 +000010the World-Wide Web. In particular, the \function{urlopen()} function
11is similar to the built-in function \function{open()}, but accepts
12Universal Resource Locators (URLs) instead of filenames. Some
13restrictions apply --- it can only open URLs for reading, and no seek
14operations are available.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000015
Fred Drakef5eaa2e1997-12-15 22:13:50 +000016It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000017
Guido van Rossum0af2f631998-07-22 21:34:21 +000018\begin{funcdesc}{urlopen}{url\optional{, data}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000019Open a network object denoted by a URL for reading. If the URL does
Fred Drake6ef871c1998-03-12 06:52:05 +000020not have a scheme identifier, or if it has \file{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000021identifier, this opens a local file; otherwise it opens a socket to a
22server somewhere on the network. If the connection cannot be made, or
Fred Drake6ef871c1998-03-12 06:52:05 +000023if the server returns an error code, the \exception{IOError} exception
24is raised. If all went well, a file-like object is returned. This
25supports the following methods: \method{read()}, \method{readline()},
26\method{readlines()}, \method{fileno()}, \method{close()} and
27\method{info()}.
Guido van Rossum0af2f631998-07-22 21:34:21 +000028
29Except for the \method{info()} method,
30these methods have the same interface as for
Fred Drake6ef871c1998-03-12 06:52:05 +000031file objects --- see section \ref{bltin-file-objects} in this
32manual. (It is not a built-in file object, however, so it can't be
Guido van Rossum470be141995-03-17 16:07:09 +000033used at those few places where a true built-in file object is
34required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000035
Fred Drake6ef871c1998-03-12 06:52:05 +000036The \method{info()} method returns an instance of the class
37\class{mimetools.Message} containing the headers received from the
38server, if the protocol uses such headers (currently the only
39supported protocol that uses this is HTTP). See the description of
40the \module{mimetools}\refstmodindex{mimetools} module.
Guido van Rossum0af2f631998-07-22 21:34:21 +000041
42If the \var{url} uses the \file{http:} scheme identifier, the optional
43\var{data} argument may be given to specify a \code{POST} request
44(normally the request type is \code{GET}). The \var{data} argument
45must in standard \file{application/x-www-form-urlencoded} format;
46see the \function{urlencode()} function below.
47
Guido van Rossuma8db1df1995-02-16 16:29:46 +000048\end{funcdesc}
49
50\begin{funcdesc}{urlretrieve}{url}
51Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000052If the URL points to a local file, or a valid cached copy of the
Fred Drake6ef871c1998-03-12 06:52:05 +000053object exists, the object is not copied. Return a tuple
54\code{(\var{filename}, \var{headers})} where \var{filename} is the
55local file name under which the object can be found, and \var{headers}
56is either \code{None} (for a local object) or whatever the
57\method{info()} method of the object returned by \function{urlopen()}
58returned (for a remote object, possibly cached). Exceptions are the
59same as for \function{urlopen()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000060\end{funcdesc}
61
62\begin{funcdesc}{urlcleanup}{}
63Clear the cache that may have been built up by previous calls to
Fred Drake6ef871c1998-03-12 06:52:05 +000064\function{urlretrieve()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000065\end{funcdesc}
66
Guido van Rossum0af2f631998-07-22 21:34:21 +000067\begin{funcdesc}{quote}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000068Replace special characters in \var{string} using the \samp{\%xx} escape.
69Letters, digits, and the characters \character{_,.-} are never quoted.
Guido van Rossum0af2f631998-07-22 21:34:21 +000070The optional \var{safe} parameter specifies additional characters
Guido van Rossum61d34f41995-02-27 17:51:51 +000071that should not be quoted --- its default value is \code{'/'}.
72
Guido van Rossum8d40c841996-12-13 14:48:47 +000073Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
74\end{funcdesc}
75
Guido van Rossum0af2f631998-07-22 21:34:21 +000076\begin{funcdesc}{quote_plus}{string\optional{, safe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000077Like \function{quote()}, but also replaces spaces by plus signs, as
Guido van Rossum0af2f631998-07-22 21:34:21 +000078required for quoting HTML form values. Plus signs in the original
79string are escaped unless they are included in \var{safe}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000080\end{funcdesc}
81
82\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +000083Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +000084
Guido van Rossum86751151995-02-28 17:14:32 +000085Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000086\end{funcdesc}
87
Guido van Rossum8d40c841996-12-13 14:48:47 +000088\begin{funcdesc}{unquote_plus}{string}
Fred Drake6ef871c1998-03-12 06:52:05 +000089Like \function{unquote()}, but also replaces plus signs by spaces, as
Guido van Rossum8d40c841996-12-13 14:48:47 +000090required for unquoting HTML form values.
91\end{funcdesc}
92
Guido van Rossum0af2f631998-07-22 21:34:21 +000093\begin{funcdesc}{urlencode}{dict}
94Convert a dictionary to a ``url-encoded'' string, suitable to pass to
95\function{urlopen()} above as the optional \var{data} argument. This
96is useful to pass a dictionary of form fields to a \code{POST}
97request. The resulting string is a series of \var{key}=\var{value}
98pairs separated by \code{&} characters, where both \var{key} and
99\var{value} are quoted using \function{quote_plus()} above.
100\end{funcdesc}
101
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000102Restrictions:
103
104\begin{itemize}
105
106\item
107Currently, only the following protocols are supported: HTTP, (versions
1080.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
Fred Drake6ef871c1998-03-12 06:52:05 +0000109\indexii{HTTP}{protocol}
110\indexii{Gopher}{protocol}
111\indexii{FTP}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000112
113\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000114The caching feature of \function{urlretrieve()} has been disabled
115until I find the time to hack proper processing of Expiration time
116headers.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000117
118\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000119There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000120the cache.
121
122\item
123For backward compatibility, if a URL appears to point to a local file
124but the file can't be opened, the URL is re-interpreted using the FTP
125protocol. This can sometimes cause confusing error messages.
126
127\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000128The \function{urlopen()} and \function{urlretrieve()} functions can
129cause arbitrarily long delays while waiting for a network connection
130to be set up. This means that it is difficult to build an interactive
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000131web client using these functions without using threads.
132
133\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000134The data returned by \function{urlopen()} or \function{urlretrieve()}
135is the raw data returned by the server. This may be binary data
136(e.g. an image), plain text or (for example) HTML. The HTTP protocol
137provides type information in the reply header, which can be inspected
138by looking at the \code{content-type} header. For the Gopher protocol,
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000139type information is encoded in the URL; there is currently no easy way
140to extract it. If the returned data is HTML, you can use the module
Fred Drake6ef871c1998-03-12 06:52:05 +0000141\module{htmllib}\refstmodindex{htmllib} to parse it.
142\index{HTML}
143\indexii{HTTP}{protocol}
144\indexii{Gopher}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000145
146\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000147Although the \module{urllib} module contains (undocumented) routines
148to parse and unparse URL strings, the recommended interface for URL
149manipulation is in module \module{urlparse}\refstmodindex{urlparse}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000150
151\end{itemize}