blob: 5837788ca1125294a7857037a2474f9125f1b3b2 [file] [log] [blame]
Fred Drake1189fa91998-12-22 18:24:13 +00001\section{\module{shlex} ---
Fred Drake184e8361999-05-11 15:14:15 +00002 Simple lexical analysis}
Fred Drake1189fa91998-12-22 18:24:13 +00003
4\declaremodule{standard}{shlex}
Fred Drakec116b822001-05-09 15:50:17 +00005\modulesynopsis{Simple lexical analysis for \UNIX\ shell-like languages.}
Fred Drake1189fa91998-12-22 18:24:13 +00006\moduleauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00007\moduleauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +00008\sectionauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00009\sectionauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +000010
Fred Drake292b9eb1998-12-22 18:40:50 +000011\versionadded{1.5.2}
Fred Drake1189fa91998-12-22 18:24:13 +000012
13The \class{shlex} class makes it easy to write lexical analyzers for
14simple syntaxes resembling that of the \UNIX{} shell. This will often
Fred Drakeaf785122003-12-31 05:18:46 +000015be useful for writing minilanguages, (for example, in run control
16files for Python applications) or for parsing quoted strings.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000017
Georg Brandl1aa74ee2005-09-29 20:24:06 +000018\note{The \module{shlex} module currently does not support Unicode input.}
19
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000020The \module{shlex} module defines the following functions:
21
Fred Drakeaf785122003-12-31 05:18:46 +000022\begin{funcdesc}{split}{s\optional{, comments}}
Gustavo Niemeyer48f3dcc2003-04-20 01:57:03 +000023Split the string \var{s} using shell-like syntax. If \var{comments} is
Fred Drakeaf785122003-12-31 05:18:46 +000024\constant{False} (the default), the parsing of comments in the given
25string will be disabled (setting the \member{commenters} member of the
26\class{shlex} instance to the empty string). This function operates
27in \POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000028\versionadded{2.3}
Georg Brandl4508df22007-06-25 15:21:26 +000029\note{Since the \function{split()} function instantiates a \class{shlex}
30 instance, passing \code{None} for \var{s} will read the string
31 to split from standard input.}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000032\end{funcdesc}
33
Fred Drakeaf785122003-12-31 05:18:46 +000034The \module{shlex} module defines the following class:
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000035
Fred Drakeaf785122003-12-31 05:18:46 +000036\begin{classdesc}{shlex}{\optional{instream\optional{,
37 infile\optional{, posix}}}}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000038A \class{shlex} instance or subclass instance is a lexical analyzer
39object. The initialization argument, if present, specifies where to
40read characters from. It must be a file-/stream-like object with
41\method{read()} and \method{readline()} methods, or a string (strings
42are accepted since Python 2.3). If no argument is given, input will be
43taken from \code{sys.stdin}. The second optional argument is a filename
44string, which sets the initial value of the \member{infile} member. If
45the \var{instream} argument is omitted or equal to \code{sys.stdin},
46this second argument defaults to ``stdin''. The \var{posix} argument
Fred Drakeaa3b5d22003-04-17 21:49:04 +000047was introduced in Python 2.3, and defines the operational mode. When
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000048\var{posix} is not true (default), the \class{shlex} instance will
Fred Drakeaa3b5d22003-04-17 21:49:04 +000049operate in compatibility mode. When operating in \POSIX{} mode,
50\class{shlex} will try to be as close as possible to the \POSIX{} shell
Fred Drakeaf785122003-12-31 05:18:46 +000051parsing rules. See section~\ref{shlex-objects}.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000052\end{classdesc}
53
Fred Drakeaf785122003-12-31 05:18:46 +000054\begin{seealso}
55 \seemodule{ConfigParser}{Parser for configuration files similar to the
56 Windows \file{.ini} files.}
57\end{seealso}
58
59
Fred Drake1189fa91998-12-22 18:24:13 +000060\subsection{shlex Objects \label{shlex-objects}}
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000061
62A \class{shlex} instance has the following methods:
63
64\begin{methoddesc}{get_token}{}
Fred Drake1189fa91998-12-22 18:24:13 +000065Return a token. If tokens have been stacked using
66\method{push_token()}, pop a token off the stack. Otherwise, read one
67from the input stream. If reading encounters an immediate
Fred Drakeaa3b5d22003-04-17 21:49:04 +000068end-of-file, \member{self.eof} is returned (the empty string (\code{''})
69in non-\POSIX{} mode, and \code{None} in \POSIX{} mode).
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000070\end{methoddesc}
71
72\begin{methoddesc}{push_token}{str}
73Push the argument onto the token stack.
74\end{methoddesc}
75
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000076\begin{methoddesc}{read_token}{}
77Read a raw token. Ignore the pushback stack, and do not interpret source
78requests. (This is not ordinarily a useful entry point, and is
79documented here only for the sake of completeness.)
80\end{methoddesc}
81
Fred Drake52dc76c2000-07-03 09:56:23 +000082\begin{methoddesc}{sourcehook}{filename}
83When \class{shlex} detects a source request (see
84\member{source} below) this method is given the following token as
85argument, and expected to return a tuple consisting of a filename and
86an open file-like object.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000087
Fred Drake52dc76c2000-07-03 09:56:23 +000088Normally, this method first strips any quotes off the argument. If
89the result is an absolute pathname, or there was no previous source
90request in effect, or the previous source was a stream
Fred Drakeaf785122003-12-31 05:18:46 +000091(such as \code{sys.stdin}), the result is left alone. Otherwise, if the
Fred Drake52dc76c2000-07-03 09:56:23 +000092result is a relative pathname, the directory part of the name of the
93file immediately before it on the source inclusion stack is prepended
94(this behavior is like the way the C preprocessor handles
Eric S. Raymondbd1a4892001-01-16 14:18:55 +000095\code{\#include "file.h"}).
96
97The result of the manipulations is treated as a filename, and returned
98as the first component of the tuple, with
99\function{open()} called on it to yield the second component. (Note:
100this is the reverse of the order of arguments in instance initialization!)
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000101
Fred Drake52dc76c2000-07-03 09:56:23 +0000102This hook is exposed so that you can use it to implement directory
103search paths, addition of file extensions, and other namespace hacks.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000104There is no corresponding `close' hook, but a shlex instance will call
Fred Drake52dc76c2000-07-03 09:56:23 +0000105the \method{close()} method of the sourced input stream when it
106returns \EOF.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000107
Fred Drake25be1932001-01-16 20:52:41 +0000108For more explicit control of source stacking, use the
109\method{push_source()} and \method{pop_source()} methods.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000110\end{methoddesc}
111
112\begin{methoddesc}{push_source}{stream\optional{, filename}}
113Push an input source stream onto the input stack. If the filename
114argument is specified it will later be available for use in error
115messages. This is the same method used internally by the
Fred Drake25be1932001-01-16 20:52:41 +0000116\method{sourcehook} method.
117\versionadded{2.1}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000118\end{methoddesc}
119
Fred Drake25be1932001-01-16 20:52:41 +0000120\begin{methoddesc}{pop_source}{}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000121Pop the last-pushed input source from the input stack.
122This is the same method used internally when the lexer reaches
Raymond Hettingerb67449d2003-09-08 18:52:18 +0000123\EOF{} on a stacked input stream.
Fred Drake25be1932001-01-16 20:52:41 +0000124\versionadded{2.1}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000125\end{methoddesc}
126
Fred Drake52dc76c2000-07-03 09:56:23 +0000127\begin{methoddesc}{error_leader}{\optional{file\optional{, line}}}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000128This method generates an error message leader in the format of a
Fred Drake25be1932001-01-16 20:52:41 +0000129\UNIX{} C compiler error label; the format is \code{'"\%s", line \%d: '},
Fred Drake52dc76c2000-07-03 09:56:23 +0000130where the \samp{\%s} is replaced with the name of the current source
131file and the \samp{\%d} with the current input line number (the
132optional arguments can be used to override these).
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000133
Fred Drake52dc76c2000-07-03 09:56:23 +0000134This convenience is provided to encourage \module{shlex} users to
135generate error messages in the standard, parseable format understood
136by Emacs and other \UNIX{} tools.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000137\end{methoddesc}
138
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000139Instances of \class{shlex} subclasses have some public instance
Fred Drake52dc76c2000-07-03 09:56:23 +0000140variables which either control lexical analysis or can be used for
141debugging:
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000142
143\begin{memberdesc}{commenters}
144The string of characters that are recognized as comment beginners.
145All characters from the comment beginner to end of line are ignored.
Fred Drake1189fa91998-12-22 18:24:13 +0000146Includes just \character{\#} by default.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000147\end{memberdesc}
148
149\begin{memberdesc}{wordchars}
150The string of characters that will accumulate into multi-character
Fred Drake52dc76c2000-07-03 09:56:23 +0000151tokens. By default, includes all \ASCII{} alphanumerics and
Fred Drake1189fa91998-12-22 18:24:13 +0000152underscore.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000153\end{memberdesc}
154
155\begin{memberdesc}{whitespace}
156Characters that will be considered whitespace and skipped. Whitespace
Fred Drake1189fa91998-12-22 18:24:13 +0000157bounds tokens. By default, includes space, tab, linefeed and
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000158carriage-return.
159\end{memberdesc}
160
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000161\begin{memberdesc}{escape}
162Characters that will be considered as escape. This will be only used
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000163in \POSIX{} mode, and includes just \character{\textbackslash} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000164\versionadded{2.3}
165\end{memberdesc}
166
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000167\begin{memberdesc}{quotes}
168Characters that will be considered string quotes. The token
169accumulates until the same quote is encountered again (thus, different
Fred Drake184e8361999-05-11 15:14:15 +0000170quote types protect each other as in the shell.) By default, includes
Fred Drake1189fa91998-12-22 18:24:13 +0000171\ASCII{} single and double quotes.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000172\end{memberdesc}
173
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000174\begin{memberdesc}{escapedquotes}
175Characters in \member{quotes} that will interpret escape characters
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000176defined in \member{escape}. This is only used in \POSIX{} mode, and
177includes just \character{"} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000178\versionadded{2.3}
179\end{memberdesc}
180
181\begin{memberdesc}{whitespace_split}
Neal Norwitz10cf2182003-04-17 23:09:08 +0000182If \code{True}, tokens will only be split in whitespaces. This is useful, for
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000183example, for parsing command lines with \class{shlex}, getting tokens
184in a similar way to shell arguments.
185\versionadded{2.3}
186\end{memberdesc}
187
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000188\begin{memberdesc}{infile}
189The name of the current input file, as initially set at class
190instantiation time or stacked by later source requests. It may
191be useful to examine this when constructing error messages.
192\end{memberdesc}
193
194\begin{memberdesc}{instream}
Fred Drake52dc76c2000-07-03 09:56:23 +0000195The input stream from which this \class{shlex} instance is reading
196characters.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000197\end{memberdesc}
198
199\begin{memberdesc}{source}
Fred Drake52dc76c2000-07-03 09:56:23 +0000200This member is \code{None} by default. If you assign a string to it,
201that string will be recognized as a lexical-level inclusion request
202similar to the \samp{source} keyword in various shells. That is, the
203immediately following token will opened as a filename and input taken
204from that stream until \EOF, at which point the \method{close()}
205method of that stream will be called and the input source will again
206become the original input stream. Source requests may be stacked any
207number of levels deep.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000208\end{memberdesc}
209
210\begin{memberdesc}{debug}
Fred Drake52dc76c2000-07-03 09:56:23 +0000211If this member is numeric and \code{1} or more, a \class{shlex}
212instance will print verbose progress output on its behavior. If you
213need to use this, you can read the module source code to learn the
214details.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000215\end{memberdesc}
216
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000217\begin{memberdesc}{lineno}
218Source line number (count of newlines seen so far plus one).
219\end{memberdesc}
220
221\begin{memberdesc}{token}
Fred Drake1189fa91998-12-22 18:24:13 +0000222The token buffer. It may be useful to examine this when catching
223exceptions.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000224\end{memberdesc}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000225
226\begin{memberdesc}{eof}
227Token used to determine end of file. This will be set to the empty
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000228string (\code{''}), in non-\POSIX{} mode, and to \code{None} in
229\POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000230\versionadded{2.3}
231\end{memberdesc}
232
233\subsection{Parsing Rules\label{shlex-parsing-rules}}
234
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000235When operating in non-\POSIX{} mode, \class{shlex} will try to obey to
236the following rules.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000237
238\begin{itemize}
239\item Quote characters are not recognized within words
240 (\code{Do"Not"Separate} is parsed as the single word
241 \code{Do"Not"Separate});
242\item Escape characters are not recognized;
243\item Enclosing characters in quotes preserve the literal value of
244 all characters within the quotes;
245\item Closing quotes separate words (\code{"Do"Separate} is parsed
246 as \code{"Do"} and \code{Separate});
247\item If \member{whitespace_split} is \code{False}, any character not
248 declared to be a word character, whitespace, or a quote will be
249 returned as a single-character token. If it is \code{True},
250 \class{shlex} will only split words in whitespaces;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000251\item EOF is signaled with an empty string (\code{''});
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000252\item It's not possible to parse empty strings, even if quoted.
253\end{itemize}
254
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000255When operating in \POSIX{} mode, \class{shlex} will try to obey to the
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000256following parsing rules.
257
258\begin{itemize}
259\item Quotes are stripped out, and do not separate words
260 (\code{"Do"Not"Separate"} is parsed as the single word
261 \code{DoNotSeparate});
262\item Non-quoted escape characters (e.g. \character{\textbackslash})
263 preserve the literal value of the next character that follows;
264\item Enclosing characters in quotes which are not part of
265 \member{escapedquotes} (e.g. \character{'}) preserve the literal
266 value of all characters within the quotes;
267\item Enclosing characters in quotes which are part of
268 \member{escapedquotes} (e.g. \character{"}) preserves the literal
269 value of all characters within the quotes, with the exception of
270 the characters mentioned in \member{escape}. The escape characters
271 retain its special meaning only when followed by the quote in use,
272 or the escape character itself. Otherwise the escape character
273 will be considered a normal character.
Fred Drakeaf785122003-12-31 05:18:46 +0000274\item EOF is signaled with a \constant{None} value;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000275\item Quoted empty strings (\code{''}) are allowed;
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000276\end{itemize}
277