blob: b26228f12eb99d75aed073497312cd7a96189fe9 [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{urllib}}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00002\stmodindex{urllib}
3\index{WWW}
Guido van Rossum470be141995-03-17 16:07:09 +00004\index{World-Wide Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00005\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00006
Guido van Rossum86751151995-02-28 17:14:32 +00007\renewcommand{\indexsubitem}{(in module urllib)}
8
Guido van Rossuma8db1df1995-02-16 16:29:46 +00009This module provides a high-level interface for fetching data across
10the World-Wide Web. In particular, the \code{urlopen} function is
11similar to the built-in function \code{open}, but accepts URLs
12(Universal Resource Locators) instead of filenames. Some restrictions
13apply --- it can only open URLs for reading, and no seek operations
14are available.
15
16it defines the following public functions:
17
18\begin{funcdesc}{urlopen}{url}
19Open a network object denoted by a URL for reading. If the URL does
Guido van Rossum470be141995-03-17 16:07:09 +000020not have a scheme identifier, or if it has \samp{file:} as its scheme
Guido van Rossuma8db1df1995-02-16 16:29:46 +000021identifier, this opens a local file; otherwise it opens a socket to a
22server somewhere on the network. If the connection cannot be made, or
23if the server returns an error code, the \code{IOError} exception is
24raised. If all went well, a file-like object is returned. This
25supports the following methods: \code{read()}, \code{readline()},
26\code{readlines()}, \code{fileno()}, \code{close()} and \code{info()}.
27Except for the last one, these methods have the same interface as for
28file objects --- see the section on File Objects earlier in this
Guido van Rossum470be141995-03-17 16:07:09 +000029manual. (It's not a built-in file object, however, so it can't be
30used at those few places where a true built-in file object is
31required.)
Guido van Rossuma8db1df1995-02-16 16:29:46 +000032
33The \code{info()} method returns an instance of the class
34\code{rfc822.Message} containing the headers received from the server,
35if the protocol uses such headers (currently the only supported
36protocol that uses this is HTTP). See the description of the
37\code{rfc822} module.
38\end{funcdesc}
39
40\begin{funcdesc}{urlretrieve}{url}
41Copy a network object denoted by a URL to a local file, if necessary.
Guido van Rossum6c4f0031995-03-07 10:14:09 +000042If the URL points to a local file, or a valid cached copy of the
Guido van Rossuma8db1df1995-02-16 16:29:46 +000043object exists, the object is not copied. Return a tuple (\var{filename},
44\var{headers}) where \var{filename} is the local file name under which
45the object can be found, and \var{headers} is either \code{None} (for
46a local object) or whatever the \code{info()} method of the object
47returned by \code{urlopen()} returned (for a remote object, possibly
48cached). Exceptions are the same as for \code{urlopen()}.
49\end{funcdesc}
50
51\begin{funcdesc}{urlcleanup}{}
52Clear the cache that may have been built up by previous calls to
53\code{urlretrieve()}.
54\end{funcdesc}
55
Guido van Rossum61d34f41995-02-27 17:51:51 +000056\begin{funcdesc}{quote}{string\optional{\, addsafe}}
57Replace special characters in \var{string} using the \code{\%xx} escape.
58Letters, digits, and the characters ``\code{_,.-}'' are never quoted.
59The optional \var{addsafe} parameter specifies additional characters
60that should not be quoted --- its default value is \code{'/'}.
61
Guido van Rossum86751151995-02-28 17:14:32 +000062Example: \code{quote('/\~conolly/')} yields \code{'/\%7econnolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000063\end{funcdesc}
64
65\begin{funcdesc}{unquote}{string}
Guido van Rossum6c4f0031995-03-07 10:14:09 +000066Replace \samp{\%xx} escapes by their single-character equivalent.
Guido van Rossum61d34f41995-02-27 17:51:51 +000067
Guido van Rossum86751151995-02-28 17:14:32 +000068Example: \code{unquote('/\%7Econnolly/')} yields \code{'/\~connolly/'}.
Guido van Rossum61d34f41995-02-27 17:51:51 +000069\end{funcdesc}
70
Guido van Rossuma8db1df1995-02-16 16:29:46 +000071Restrictions:
72
73\begin{itemize}
74
75\item
76Currently, only the following protocols are supported: HTTP, (versions
770.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
78\index{HTTP}
79\index{Gopher}
80\index{FTP}
81
82\item
83The caching feature of \code{urlretrieve()} has been disabled until I
84find the time to hack proper processing of Expiration time headers.
85
86\item
Guido van Rossum6c4f0031995-03-07 10:14:09 +000087There should be a function to query whether a particular URL is in
Guido van Rossuma8db1df1995-02-16 16:29:46 +000088the cache.
89
90\item
91For backward compatibility, if a URL appears to point to a local file
92but the file can't be opened, the URL is re-interpreted using the FTP
93protocol. This can sometimes cause confusing error messages.
94
95\item
96The \code{urlopen()} and \code{urlretrieve()} functions can cause
97arbitrarily long delays while waiting for a network connection to be
98set up. This means that it is difficult to build an interactive
99web client using these functions without using threads.
100
101\item
102The data returned by \code{urlopen()} or \code{urlretrieve()} is the
103raw data returned by the server. This may be binary data (e.g. an
104image), plain text or (for example) HTML. The HTTP protocol provides
105type information in the reply header, which can be inspected by
106looking at the \code{Content-type} header. For the Gopher protocol,
107type information is encoded in the URL; there is currently no easy way
108to extract it. If the returned data is HTML, you can use the module
109\code{htmllib} to parse it.
110\index{HTML}
111\index{HTTP}
112\index{Gopher}
113\stmodindex{htmllib}
114
115\item
116Although the \code{urllib} module contains (undocumented) routines to
117parse and unparse URL strings, the recommended interface for URL
118manipulation is in module \code{urlparse}.
119\stmodindex{urlparse}
120
121\end{itemize}