blob: 89957985b2e0d9fc615d8128516fd1228ca372e6 [file] [log] [blame]
Fred Drake3a0351c1998-04-04 07:23:21 +00001\section{Standard Module \module{urllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-urllib}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00003\stmodindex{urllib}
4\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00005\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00006\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00007
Guido van Rossum86751151995-02-28 17:14:32 +00008
Guido van Rossuma8db1df1995-02-16 16:29:46 +00009This module provides a high-level interface for fetching data across
Fred Drake6ef871c1998-03-12 06:52:05 +000010the World-Wide Web. In particular, the \function{urlopen()} function
11is similar to the built-in function \function{open()}, but accepts
12Universal Resource Locators (URLs) instead of filenames. Some
13restrictions apply --- it can only open URLs for reading, and no seek
14operations are available.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000015
Fred Drakef5eaa2e1997-12-15 22:13:50 +000016It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000017
18\begin{funcdesc}{urlopen}{url}
19Open a network object denoted by a URL for reading. If the URL does
Fred Drake6ef871c1998-03-12 06:52:05 +000020not have a scheme identifier, or if it has \file{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000021identifier, this opens a local file; otherwise it opens a socket to a
22server somewhere on the network. If the connection cannot be made, or
Fred Drake6ef871c1998-03-12 06:52:05 +000023if the server returns an error code, the \exception{IOError} exception
24is raised. If all went well, a file-like object is returned. This
25supports the following methods: \method{read()}, \method{readline()},
26\method{readlines()}, \method{fileno()}, \method{close()} and
27\method{info()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000028Except for the last one, these methods have the same interface as for
Fred Drake6ef871c1998-03-12 06:52:05 +000029file objects --- see section \ref{bltin-file-objects} in this
30manual. (It is not a built-in file object, however, so it can't be
Guido van Rossum470be141995-03-17 16:07:09 +000031used at those few places where a true built-in file object is
32required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000033
Fred Drake6ef871c1998-03-12 06:52:05 +000034The \method{info()} method returns an instance of the class
35\class{mimetools.Message} containing the headers received from the
36server, if the protocol uses such headers (currently the only
37supported protocol that uses this is HTTP). See the description of
38the \module{mimetools}\refstmodindex{mimetools} module.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000039\end{funcdesc}
40
41\begin{funcdesc}{urlretrieve}{url}
42Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000043If the URL points to a local file, or a valid cached copy of the
Fred Drake6ef871c1998-03-12 06:52:05 +000044object exists, the object is not copied. Return a tuple
45\code{(\var{filename}, \var{headers})} where \var{filename} is the
46local file name under which the object can be found, and \var{headers}
47is either \code{None} (for a local object) or whatever the
48\method{info()} method of the object returned by \function{urlopen()}
49returned (for a remote object, possibly cached). Exceptions are the
50same as for \function{urlopen()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000051\end{funcdesc}
52
53\begin{funcdesc}{urlcleanup}{}
54Clear the cache that may have been built up by previous calls to
Fred Drake6ef871c1998-03-12 06:52:05 +000055\function{urlretrieve()}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000056\end{funcdesc}
57
Fred Drakecce10901998-03-17 06:33:25 +000058\begin{funcdesc}{quote}{string\optional{, addsafe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000059Replace special characters in \var{string} using the \samp{\%xx} escape.
60Letters, digits, and the characters \character{_,.-} are never quoted.
Guido van Rossum61d34f41995-02-27 17:51:51 +000061The optional \var{addsafe} parameter specifies additional characters
62that should not be quoted --- its default value is \code{'/'}.
63
Guido van Rossum8d40c841996-12-13 14:48:47 +000064Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
65\end{funcdesc}
66
Fred Drakecce10901998-03-17 06:33:25 +000067\begin{funcdesc}{quote_plus}{string\optional{, addsafe}}
Fred Drake6ef871c1998-03-12 06:52:05 +000068Like \function{quote()}, but also replaces spaces by plus signs, as
Guido van Rossum8d40c841996-12-13 14:48:47 +000069required for quoting HTML form values.
Guido van Rossum61d34f41995-02-27 17:51:51 +000070\end{funcdesc}
71
72\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +000073Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +000074
Guido van Rossum86751151995-02-28 17:14:32 +000075Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000076\end{funcdesc}
77
Guido van Rossum8d40c841996-12-13 14:48:47 +000078\begin{funcdesc}{unquote_plus}{string}
Fred Drake6ef871c1998-03-12 06:52:05 +000079Like \function{unquote()}, but also replaces plus signs by spaces, as
Guido van Rossum8d40c841996-12-13 14:48:47 +000080required for unquoting HTML form values.
81\end{funcdesc}
82
Guido van Rossuma8db1df1995-02-16 16:29:46 +000083Restrictions:
84
85\begin{itemize}
86
87\item
88Currently, only the following protocols are supported: HTTP, (versions
890.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
Fred Drake6ef871c1998-03-12 06:52:05 +000090\indexii{HTTP}{protocol}
91\indexii{Gopher}{protocol}
92\indexii{FTP}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000093
94\item
Fred Drake6ef871c1998-03-12 06:52:05 +000095The caching feature of \function{urlretrieve()} has been disabled
96until I find the time to hack proper processing of Expiration time
97headers.
Guido van Rossuma8db1df1995-02-16 16:29:46 +000098
99\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000100There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000101the cache.
102
103\item
104For backward compatibility, if a URL appears to point to a local file
105but the file can't be opened, the URL is re-interpreted using the FTP
106protocol. This can sometimes cause confusing error messages.
107
108\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000109The \function{urlopen()} and \function{urlretrieve()} functions can
110cause arbitrarily long delays while waiting for a network connection
111to be set up. This means that it is difficult to build an interactive
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000112web client using these functions without using threads.
113
114\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000115The data returned by \function{urlopen()} or \function{urlretrieve()}
116is the raw data returned by the server. This may be binary data
117(e.g. an image), plain text or (for example) HTML. The HTTP protocol
118provides type information in the reply header, which can be inspected
119by looking at the \code{content-type} header. For the Gopher protocol,
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000120type information is encoded in the URL; there is currently no easy way
121to extract it. If the returned data is HTML, you can use the module
Fred Drake6ef871c1998-03-12 06:52:05 +0000122\module{htmllib}\refstmodindex{htmllib} to parse it.
123\index{HTML}
124\indexii{HTTP}{protocol}
125\indexii{Gopher}{protocol}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000126
127\item
Fred Drake6ef871c1998-03-12 06:52:05 +0000128Although the \module{urllib} module contains (undocumented) routines
129to parse and unparse URL strings, the recommended interface for URL
130manipulation is in module \module{urlparse}\refstmodindex{urlparse}.
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000131
132\end{itemize}