blob: 689a5c9e0435a8711ba651fd1256dfa9853985fa [file] [log] [blame]
Guido van Rossuma8db1df1995-02-16 16:29:46 +00001\section{Built-in module \sectcode{urllib}}
2\stmodindex{urllib}
3\index{WWW}
4\indexii{World-Wide}{Web}
Guido van Rossum61d34f41995-02-27 17:51:51 +00005\index{URL}
Guido van Rossuma8db1df1995-02-16 16:29:46 +00006
7This module provides a high-level interface for fetching data across
8the World-Wide Web. In particular, the \code{urlopen} function is
9similar to the built-in function \code{open}, but accepts URLs
10(Universal Resource Locators) instead of filenames. Some restrictions
11apply --- it can only open URLs for reading, and no seek operations
12are available.
13
14it defines the following public functions:
15
16\begin{funcdesc}{urlopen}{url}
17Open a network object denoted by a URL for reading. If the URL does
18not have a scheme identifier, or if it has \code{file:} as its scheme
19identifier, this opens a local file; otherwise it opens a socket to a
20server somewhere on the network. If the connection cannot be made, or
21if the server returns an error code, the \code{IOError} exception is
22raised. If all went well, a file-like object is returned. This
23supports the following methods: \code{read()}, \code{readline()},
24\code{readlines()}, \code{fileno()}, \code{close()} and \code{info()}.
25Except for the last one, these methods have the same interface as for
26file objects --- see the section on File Objects earlier in this
27manual.
28
29The \code{info()} method returns an instance of the class
30\code{rfc822.Message} containing the headers received from the server,
31if the protocol uses such headers (currently the only supported
32protocol that uses this is HTTP). See the description of the
33\code{rfc822} module.
34\end{funcdesc}
35
36\begin{funcdesc}{urlretrieve}{url}
37Copy a network object denoted by a URL to a local file, if necessary.
38If the URL points to a local file, or a valid cached copy of the the
39object exists, the object is not copied. Return a tuple (\var{filename},
40\var{headers}) where \var{filename} is the local file name under which
41the object can be found, and \var{headers} is either \code{None} (for
42a local object) or whatever the \code{info()} method of the object
43returned by \code{urlopen()} returned (for a remote object, possibly
44cached). Exceptions are the same as for \code{urlopen()}.
45\end{funcdesc}
46
47\begin{funcdesc}{urlcleanup}{}
48Clear the cache that may have been built up by previous calls to
49\code{urlretrieve()}.
50\end{funcdesc}
51
Guido van Rossum61d34f41995-02-27 17:51:51 +000052\begin{funcdesc}{quote}{string\optional{\, addsafe}}
53Replace special characters in \var{string} using the \code{\%xx} escape.
54Letters, digits, and the characters ``\code{_,.-}'' are never quoted.
55The optional \var{addsafe} parameter specifies additional characters
56that should not be quoted --- its default value is \code{'/'}.
57
58Example: \code{quote('/~conolly/')} yields \code{'/\%7econnolly/'}.
59\end{funcdesc}
60
61\begin{funcdesc}{unquote}{string}
62Remove \code{\%xx} escapes by their single-character equivalent.
63
64Example: \code{unquote('/\%7Econnolly/')} yields \code{'/~connolly/'}.
65\end{funcdesc}
66
Guido van Rossuma8db1df1995-02-16 16:29:46 +000067Restrictions:
68
69\begin{itemize}
70
71\item
72Currently, only the following protocols are supported: HTTP, (versions
730.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
74\index{HTTP}
75\index{Gopher}
76\index{FTP}
77
78\item
79The caching feature of \code{urlretrieve()} has been disabled until I
80find the time to hack proper processing of Expiration time headers.
81
82\item
83There should be an function to query whether a particular URL is in
84the cache.
85
86\item
87For backward compatibility, if a URL appears to point to a local file
88but the file can't be opened, the URL is re-interpreted using the FTP
89protocol. This can sometimes cause confusing error messages.
90
91\item
92The \code{urlopen()} and \code{urlretrieve()} functions can cause
93arbitrarily long delays while waiting for a network connection to be
94set up. This means that it is difficult to build an interactive
95web client using these functions without using threads.
96
97\item
98The data returned by \code{urlopen()} or \code{urlretrieve()} is the
99raw data returned by the server. This may be binary data (e.g. an
100image), plain text or (for example) HTML. The HTTP protocol provides
101type information in the reply header, which can be inspected by
102looking at the \code{Content-type} header. For the Gopher protocol,
103type information is encoded in the URL; there is currently no easy way
104to extract it. If the returned data is HTML, you can use the module
105\code{htmllib} to parse it.
106\index{HTML}
107\index{HTTP}
108\index{Gopher}
109\stmodindex{htmllib}
110
111\item
112Although the \code{urllib} module contains (undocumented) routines to
113parse and unparse URL strings, the recommended interface for URL
114manipulation is in module \code{urlparse}.
115\stmodindex{urlparse}
116
117\end{itemize}