blob: 3a43648897f7a0dcc3f2badb00ae29f61408a6dc [file] [log] [blame]
Fred Drake1189fa91998-12-22 18:24:13 +00001\section{\module{shlex} ---
Fred Drake184e8361999-05-11 15:14:15 +00002 Simple lexical analysis}
Fred Drake1189fa91998-12-22 18:24:13 +00003
4\declaremodule{standard}{shlex}
Fred Drakec116b822001-05-09 15:50:17 +00005\modulesynopsis{Simple lexical analysis for \UNIX\ shell-like languages.}
Fred Drake1189fa91998-12-22 18:24:13 +00006\moduleauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00007\moduleauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +00008\sectionauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00009\sectionauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +000010
Fred Drake292b9eb1998-12-22 18:40:50 +000011\versionadded{1.5.2}
Fred Drake1189fa91998-12-22 18:24:13 +000012
13The \class{shlex} class makes it easy to write lexical analyzers for
14simple syntaxes resembling that of the \UNIX{} shell. This will often
Fred Drakeaf785122003-12-31 05:18:46 +000015be useful for writing minilanguages, (for example, in run control
16files for Python applications) or for parsing quoted strings.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000017
Georg Brandl1aa74ee2005-09-29 20:24:06 +000018\note{The \module{shlex} module currently does not support Unicode input.}
19
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000020The \module{shlex} module defines the following functions:
21
Fred Drakeaf785122003-12-31 05:18:46 +000022\begin{funcdesc}{split}{s\optional{, comments}}
Gustavo Niemeyer48f3dcc2003-04-20 01:57:03 +000023Split the string \var{s} using shell-like syntax. If \var{comments} is
Fred Drakeaf785122003-12-31 05:18:46 +000024\constant{False} (the default), the parsing of comments in the given
25string will be disabled (setting the \member{commenters} member of the
26\class{shlex} instance to the empty string). This function operates
27in \POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000028\versionadded{2.3}
29\end{funcdesc}
30
Fred Drakeaf785122003-12-31 05:18:46 +000031The \module{shlex} module defines the following class:
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000032
Fred Drakeaf785122003-12-31 05:18:46 +000033\begin{classdesc}{shlex}{\optional{instream\optional{,
34 infile\optional{, posix}}}}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000035A \class{shlex} instance or subclass instance is a lexical analyzer
36object. The initialization argument, if present, specifies where to
37read characters from. It must be a file-/stream-like object with
38\method{read()} and \method{readline()} methods, or a string (strings
39are accepted since Python 2.3). If no argument is given, input will be
40taken from \code{sys.stdin}. The second optional argument is a filename
41string, which sets the initial value of the \member{infile} member. If
42the \var{instream} argument is omitted or equal to \code{sys.stdin},
43this second argument defaults to ``stdin''. The \var{posix} argument
Fred Drakeaa3b5d22003-04-17 21:49:04 +000044was introduced in Python 2.3, and defines the operational mode. When
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000045\var{posix} is not true (default), the \class{shlex} instance will
Fred Drakeaa3b5d22003-04-17 21:49:04 +000046operate in compatibility mode. When operating in \POSIX{} mode,
47\class{shlex} will try to be as close as possible to the \POSIX{} shell
Fred Drakeaf785122003-12-31 05:18:46 +000048parsing rules. See section~\ref{shlex-objects}.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000049\end{classdesc}
50
Fred Drakeaf785122003-12-31 05:18:46 +000051\begin{seealso}
52 \seemodule{ConfigParser}{Parser for configuration files similar to the
53 Windows \file{.ini} files.}
54\end{seealso}
55
56
Fred Drake1189fa91998-12-22 18:24:13 +000057\subsection{shlex Objects \label{shlex-objects}}
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000058
59A \class{shlex} instance has the following methods:
60
61\begin{methoddesc}{get_token}{}
Fred Drake1189fa91998-12-22 18:24:13 +000062Return a token. If tokens have been stacked using
63\method{push_token()}, pop a token off the stack. Otherwise, read one
64from the input stream. If reading encounters an immediate
Fred Drakeaa3b5d22003-04-17 21:49:04 +000065end-of-file, \member{self.eof} is returned (the empty string (\code{''})
66in non-\POSIX{} mode, and \code{None} in \POSIX{} mode).
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000067\end{methoddesc}
68
69\begin{methoddesc}{push_token}{str}
70Push the argument onto the token stack.
71\end{methoddesc}
72
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000073\begin{methoddesc}{read_token}{}
74Read a raw token. Ignore the pushback stack, and do not interpret source
75requests. (This is not ordinarily a useful entry point, and is
76documented here only for the sake of completeness.)
77\end{methoddesc}
78
Fred Drake52dc76c2000-07-03 09:56:23 +000079\begin{methoddesc}{sourcehook}{filename}
80When \class{shlex} detects a source request (see
81\member{source} below) this method is given the following token as
82argument, and expected to return a tuple consisting of a filename and
83an open file-like object.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000084
Fred Drake52dc76c2000-07-03 09:56:23 +000085Normally, this method first strips any quotes off the argument. If
86the result is an absolute pathname, or there was no previous source
87request in effect, or the previous source was a stream
Fred Drakeaf785122003-12-31 05:18:46 +000088(such as \code{sys.stdin}), the result is left alone. Otherwise, if the
Fred Drake52dc76c2000-07-03 09:56:23 +000089result is a relative pathname, the directory part of the name of the
90file immediately before it on the source inclusion stack is prepended
91(this behavior is like the way the C preprocessor handles
Eric S. Raymondbd1a4892001-01-16 14:18:55 +000092\code{\#include "file.h"}).
93
94The result of the manipulations is treated as a filename, and returned
95as the first component of the tuple, with
96\function{open()} called on it to yield the second component. (Note:
97this is the reverse of the order of arguments in instance initialization!)
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000098
Fred Drake52dc76c2000-07-03 09:56:23 +000099This hook is exposed so that you can use it to implement directory
100search paths, addition of file extensions, and other namespace hacks.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000101There is no corresponding `close' hook, but a shlex instance will call
Fred Drake52dc76c2000-07-03 09:56:23 +0000102the \method{close()} method of the sourced input stream when it
103returns \EOF.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000104
Fred Drake25be1932001-01-16 20:52:41 +0000105For more explicit control of source stacking, use the
106\method{push_source()} and \method{pop_source()} methods.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000107\end{methoddesc}
108
109\begin{methoddesc}{push_source}{stream\optional{, filename}}
110Push an input source stream onto the input stack. If the filename
111argument is specified it will later be available for use in error
112messages. This is the same method used internally by the
Fred Drake25be1932001-01-16 20:52:41 +0000113\method{sourcehook} method.
114\versionadded{2.1}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000115\end{methoddesc}
116
Fred Drake25be1932001-01-16 20:52:41 +0000117\begin{methoddesc}{pop_source}{}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000118Pop the last-pushed input source from the input stack.
119This is the same method used internally when the lexer reaches
Raymond Hettingerb67449d2003-09-08 18:52:18 +0000120\EOF{} on a stacked input stream.
Fred Drake25be1932001-01-16 20:52:41 +0000121\versionadded{2.1}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000122\end{methoddesc}
123
Fred Drake52dc76c2000-07-03 09:56:23 +0000124\begin{methoddesc}{error_leader}{\optional{file\optional{, line}}}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000125This method generates an error message leader in the format of a
Fred Drake25be1932001-01-16 20:52:41 +0000126\UNIX{} C compiler error label; the format is \code{'"\%s", line \%d: '},
Fred Drake52dc76c2000-07-03 09:56:23 +0000127where the \samp{\%s} is replaced with the name of the current source
128file and the \samp{\%d} with the current input line number (the
129optional arguments can be used to override these).
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000130
Fred Drake52dc76c2000-07-03 09:56:23 +0000131This convenience is provided to encourage \module{shlex} users to
132generate error messages in the standard, parseable format understood
133by Emacs and other \UNIX{} tools.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000134\end{methoddesc}
135
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000136Instances of \class{shlex} subclasses have some public instance
Fred Drake52dc76c2000-07-03 09:56:23 +0000137variables which either control lexical analysis or can be used for
138debugging:
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000139
140\begin{memberdesc}{commenters}
141The string of characters that are recognized as comment beginners.
142All characters from the comment beginner to end of line are ignored.
Fred Drake1189fa91998-12-22 18:24:13 +0000143Includes just \character{\#} by default.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000144\end{memberdesc}
145
146\begin{memberdesc}{wordchars}
147The string of characters that will accumulate into multi-character
Fred Drake52dc76c2000-07-03 09:56:23 +0000148tokens. By default, includes all \ASCII{} alphanumerics and
Fred Drake1189fa91998-12-22 18:24:13 +0000149underscore.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000150\end{memberdesc}
151
152\begin{memberdesc}{whitespace}
153Characters that will be considered whitespace and skipped. Whitespace
Fred Drake1189fa91998-12-22 18:24:13 +0000154bounds tokens. By default, includes space, tab, linefeed and
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000155carriage-return.
156\end{memberdesc}
157
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000158\begin{memberdesc}{escape}
159Characters that will be considered as escape. This will be only used
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000160in \POSIX{} mode, and includes just \character{\textbackslash} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000161\versionadded{2.3}
162\end{memberdesc}
163
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000164\begin{memberdesc}{quotes}
165Characters that will be considered string quotes. The token
166accumulates until the same quote is encountered again (thus, different
Fred Drake184e8361999-05-11 15:14:15 +0000167quote types protect each other as in the shell.) By default, includes
Fred Drake1189fa91998-12-22 18:24:13 +0000168\ASCII{} single and double quotes.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000169\end{memberdesc}
170
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000171\begin{memberdesc}{escapedquotes}
172Characters in \member{quotes} that will interpret escape characters
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000173defined in \member{escape}. This is only used in \POSIX{} mode, and
174includes just \character{"} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000175\versionadded{2.3}
176\end{memberdesc}
177
178\begin{memberdesc}{whitespace_split}
Neal Norwitz10cf2182003-04-17 23:09:08 +0000179If \code{True}, tokens will only be split in whitespaces. This is useful, for
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000180example, for parsing command lines with \class{shlex}, getting tokens
181in a similar way to shell arguments.
182\versionadded{2.3}
183\end{memberdesc}
184
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000185\begin{memberdesc}{infile}
186The name of the current input file, as initially set at class
187instantiation time or stacked by later source requests. It may
188be useful to examine this when constructing error messages.
189\end{memberdesc}
190
191\begin{memberdesc}{instream}
Fred Drake52dc76c2000-07-03 09:56:23 +0000192The input stream from which this \class{shlex} instance is reading
193characters.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000194\end{memberdesc}
195
196\begin{memberdesc}{source}
Fred Drake52dc76c2000-07-03 09:56:23 +0000197This member is \code{None} by default. If you assign a string to it,
198that string will be recognized as a lexical-level inclusion request
199similar to the \samp{source} keyword in various shells. That is, the
200immediately following token will opened as a filename and input taken
201from that stream until \EOF, at which point the \method{close()}
202method of that stream will be called and the input source will again
203become the original input stream. Source requests may be stacked any
204number of levels deep.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000205\end{memberdesc}
206
207\begin{memberdesc}{debug}
Fred Drake52dc76c2000-07-03 09:56:23 +0000208If this member is numeric and \code{1} or more, a \class{shlex}
209instance will print verbose progress output on its behavior. If you
210need to use this, you can read the module source code to learn the
211details.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000212\end{memberdesc}
213
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000214\begin{memberdesc}{lineno}
215Source line number (count of newlines seen so far plus one).
216\end{memberdesc}
217
218\begin{memberdesc}{token}
Fred Drake1189fa91998-12-22 18:24:13 +0000219The token buffer. It may be useful to examine this when catching
220exceptions.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000221\end{memberdesc}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000222
223\begin{memberdesc}{eof}
224Token used to determine end of file. This will be set to the empty
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000225string (\code{''}), in non-\POSIX{} mode, and to \code{None} in
226\POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000227\versionadded{2.3}
228\end{memberdesc}
229
230\subsection{Parsing Rules\label{shlex-parsing-rules}}
231
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000232When operating in non-\POSIX{} mode, \class{shlex} will try to obey to
233the following rules.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000234
235\begin{itemize}
236\item Quote characters are not recognized within words
237 (\code{Do"Not"Separate} is parsed as the single word
238 \code{Do"Not"Separate});
239\item Escape characters are not recognized;
240\item Enclosing characters in quotes preserve the literal value of
241 all characters within the quotes;
242\item Closing quotes separate words (\code{"Do"Separate} is parsed
243 as \code{"Do"} and \code{Separate});
244\item If \member{whitespace_split} is \code{False}, any character not
245 declared to be a word character, whitespace, or a quote will be
246 returned as a single-character token. If it is \code{True},
247 \class{shlex} will only split words in whitespaces;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000248\item EOF is signaled with an empty string (\code{''});
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000249\item It's not possible to parse empty strings, even if quoted.
250\end{itemize}
251
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000252When operating in \POSIX{} mode, \class{shlex} will try to obey to the
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000253following parsing rules.
254
255\begin{itemize}
256\item Quotes are stripped out, and do not separate words
257 (\code{"Do"Not"Separate"} is parsed as the single word
258 \code{DoNotSeparate});
259\item Non-quoted escape characters (e.g. \character{\textbackslash})
260 preserve the literal value of the next character that follows;
261\item Enclosing characters in quotes which are not part of
262 \member{escapedquotes} (e.g. \character{'}) preserve the literal
263 value of all characters within the quotes;
264\item Enclosing characters in quotes which are part of
265 \member{escapedquotes} (e.g. \character{"}) preserves the literal
266 value of all characters within the quotes, with the exception of
267 the characters mentioned in \member{escape}. The escape characters
268 retain its special meaning only when followed by the quote in use,
269 or the escape character itself. Otherwise the escape character
270 will be considered a normal character.
Fred Drakeaf785122003-12-31 05:18:46 +0000271\item EOF is signaled with a \constant{None} value;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000272\item Quoted empty strings (\code{''}) are allowed;
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000273\end{itemize}
274