blob: 5f8297de13ad10f35e64d578f902d3c2b409308f [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{urllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-urllib}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00003\stmodindex{urllib}
4\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00005\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00006\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00007
Guido van Rossum86751151995-02-28 17:14:32 +00008\renewcommand{\indexsubitem}{(in module urllib)}
9
Guido van Rossuma8db1df1995-02-16 16:29:46 +000010This module provides a high-level interface for fetching data across
Fred Drakef5eaa2e1997-12-15 22:13:50 +000011the World-Wide Web. In particular, the \code{urlopen()} function is
12similar to the built-in function \code{open()}, but accepts URLs
Guido van Rossuma8db1df1995-02-16 16:29:46 +000013(Universal Resource Locators) instead of filenames. Some restrictions
14apply --- it can only open URLs for reading, and no seek operations
15are available.
16
Fred Drakef5eaa2e1997-12-15 22:13:50 +000017It defines the following public functions:
Guido van Rossuma8db1df1995-02-16 16:29:46 +000018
19\begin{funcdesc}{urlopen}{url}
20Open a network object denoted by a URL for reading. If the URL does
Guido van Rossum470be141995-03-17 16:07:09 +000021not have a scheme identifier, or if it has \samp{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000022identifier, this opens a local file; otherwise it opens a socket to a
23server somewhere on the network. If the connection cannot be made, or
24if the server returns an error code, the \code{IOError} exception is
25raised. If all went well, a file-like object is returned. This
26supports the following methods: \code{read()}, \code{readline()},
27\code{readlines()}, \code{fileno()}, \code{close()} and \code{info()}.
28Except for the last one, these methods have the same interface as for
29file objects --- see the section on File Objects earlier in this
Guido van Rossum470be141995-03-17 16:07:09 +000030manual. (It's not a built-in file object, however, so it can't be
31used at those few places where a true built-in file object is
32required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000033
34The \code{info()} method returns an instance of the class
Guido van Rossum98b43eb1997-06-02 17:34:22 +000035\code{mimetools.Message} containing the headers received from the server,
Guido van Rossuma8db1df1995-02-16 16:29:46 +000036if the protocol uses such headers (currently the only supported
37protocol that uses this is HTTP). See the description of the
Guido van Rossum98b43eb1997-06-02 17:34:22 +000038\code{mimetools} module.
Fred Drakef5eaa2e1997-12-15 22:13:50 +000039\refstmodindex{mimetools}
Guido van Rossuma8db1df1995-02-16 16:29:46 +000040\end{funcdesc}
41
42\begin{funcdesc}{urlretrieve}{url}
43Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000044If the URL points to a local file, or a valid cached copy of the
Guido van Rossuma8db1df1995-02-16 16:29:46 +000045object exists, the object is not copied. Return a tuple (\var{filename},
46\var{headers}) where \var{filename} is the local file name under which
47the object can be found, and \var{headers} is either \code{None} (for
48a local object) or whatever the \code{info()} method of the object
49returned by \code{urlopen()} returned (for a remote object, possibly
50cached). Exceptions are the same as for \code{urlopen()}.
51\end{funcdesc}
52
53\begin{funcdesc}{urlcleanup}{}
54Clear the cache that may have been built up by previous calls to
55\code{urlretrieve()}.
56\end{funcdesc}
57
Guido van Rossum61d34f41995-02-27 17:51:51 +000058\begin{funcdesc}{quote}{string\optional{\, addsafe}}
59Replace special characters in \var{string} using the \code{\%xx} escape.
60Letters, digits, and the characters ``\code{_,.-}'' are never quoted.
61The optional \var{addsafe} parameter specifies additional characters
62that should not be quoted --- its default value is \code{'/'}.
63
Guido van Rossum8d40c841996-12-13 14:48:47 +000064Example: \code{quote('/\~connolly/')} yields \code{'/\%7econnolly/'}.
65\end{funcdesc}
66
67\begin{funcdesc}{quote_plus}{string\optional{\, addsafe}}
68Like \code{quote()}, but also replaces spaces by plus signs, as
69required for quoting HTML form values.
Guido van Rossum61d34f41995-02-27 17:51:51 +000070\end{funcdesc}
71
72\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +000073Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +000074
Guido van Rossum86751151995-02-28 17:14:32 +000075Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000076\end{funcdesc}
77
Guido van Rossum8d40c841996-12-13 14:48:47 +000078\begin{funcdesc}{unquote_plus}{string}
79Like \code{unquote()}, but also replaces plus signs by spaces, as
80required for unquoting HTML form values.
81\end{funcdesc}
82
Guido van Rossuma8db1df1995-02-16 16:29:46 +000083Restrictions:
84
85\begin{itemize}
86
87\item
88Currently, only the following protocols are supported: HTTP, (versions
890.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
90\index{HTTP}
91\index{Gopher}
92\index{FTP}
93
94\item
95The caching feature of \code{urlretrieve()} has been disabled until I
96find the time to hack proper processing of Expiration time headers.
97
98\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +000099There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000100the cache.
101
102\item
103For backward compatibility, if a URL appears to point to a local file
104but the file can't be opened, the URL is re-interpreted using the FTP
105protocol. This can sometimes cause confusing error messages.
106
107\item
108The \code{urlopen()} and \code{urlretrieve()} functions can cause
109arbitrarily long delays while waiting for a network connection to be
110set up. This means that it is difficult to build an interactive
111web client using these functions without using threads.
112
113\item
114The data returned by \code{urlopen()} or \code{urlretrieve()} is the
115raw data returned by the server. This may be binary data (e.g. an
116image), plain text or (for example) HTML. The HTTP protocol provides
117type information in the reply header, which can be inspected by
118looking at the \code{Content-type} header. For the Gopher protocol,
119type information is encoded in the URL; there is currently no easy way
120to extract it. If the returned data is HTML, you can use the module
121\code{htmllib} to parse it.
Fred Drakef5eaa2e1997-12-15 22:13:50 +0000122\index{HTML}%
123\index{HTTP}%
124\index{Gopher}%
125\refstmodindex{htmllib}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000126
127\item
128Although the \code{urllib} module contains (undocumented) routines to
129parse and unparse URL strings, the recommended interface for URL
130manipulation is in module \code{urlparse}.
Fred Drakef5eaa2e1997-12-15 22:13:50 +0000131\refstmodindex{urlparse}
Guido van Rossuma8db1df1995-02-16 16:29:46 +0000132
133\end{itemize}