| \section{\module{robotparser} --- |
| Parser for robots.txt} |
| |
| \declaremodule{standard}{robotparser} |
| \modulesynopsis{Loads a \protect\file{robots.txt} file and |
| answers questions about fetchability of other URLs.} |
| \sectionauthor{Skip Montanaro}{skip@mojam.com} |
| |
| \index{WWW} |
| \index{World Wide Web} |
| \index{URL} |
| \index{robots.txt} |
| |
| This module provides a single class, \class{RobotFileParser}, which answers |
| questions about whether or not a particular user agent can fetch a URL on |
| the Web site that published the \file{robots.txt} file. For more details on |
| the structure of \file{robots.txt} files, see |
| \url{http://info.webcrawler.com/mak/projects/robots/norobots.html}. |
| |
| \begin{classdesc}{RobotFileParser}{} |
| |
| This class provides a set of methods to read, parse and answer questions |
| about a single \file{robots.txt} file. |
| |
| \begin{methoddesc}{set_url}{url} |
| Sets the URL referring to a \file{robots.txt} file. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{read}{} |
| Reads the \file{robots.txt} URL and feeds it to the parser. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{parse}{lines} |
| Parses the lines argument. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{can_fetch}{useragent, url} |
| Returns \code{True} if the \var{useragent} is allowed to fetch the \var{url} |
| according to the rules contained in the parsed \file{robots.txt} file. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{mtime}{} |
| Returns the time the \code{robots.txt} file was last fetched. This is |
| useful for long-running web spiders that need to check for new |
| \code{robots.txt} files periodically. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{modified}{} |
| Sets the time the \code{robots.txt} file was last fetched to the current |
| time. |
| \end{methoddesc} |
| |
| \end{classdesc} |
| |
| The following example demonstrates basic use of the RobotFileParser class. |
| |
| \begin{verbatim} |
| >>> import robotparser |
| >>> rp = robotparser.RobotFileParser() |
| >>> rp.set_url("http://www.musi-cal.com/robots.txt") |
| >>> rp.read() |
| >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") |
| False |
| >>> rp.can_fetch("*", "http://www.musi-cal.com/") |
| True |
| \end{verbatim} |