blob: cddae67a92dd4e848b83cee00da6935ed3bf4756 [file] [log] [blame]
Fred Drake1189fa91998-12-22 18:24:13 +00001\section{\module{shlex} ---
Fred Drake184e8361999-05-11 15:14:15 +00002 Simple lexical analysis}
Fred Drake1189fa91998-12-22 18:24:13 +00003
4\declaremodule{standard}{shlex}
Fred Drakec116b822001-05-09 15:50:17 +00005\modulesynopsis{Simple lexical analysis for \UNIX\ shell-like languages.}
Fred Drake1189fa91998-12-22 18:24:13 +00006\moduleauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00007\moduleauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +00008\sectionauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00009\sectionauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +000010
Fred Drake292b9eb1998-12-22 18:40:50 +000011\versionadded{1.5.2}
Fred Drake1189fa91998-12-22 18:24:13 +000012
13The \class{shlex} class makes it easy to write lexical analyzers for
14simple syntaxes resembling that of the \UNIX{} shell. This will often
Fred Drakeaf785122003-12-31 05:18:46 +000015be useful for writing minilanguages, (for example, in run control
16files for Python applications) or for parsing quoted strings.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000017
18The \module{shlex} module defines the following functions:
19
Fred Drakeaf785122003-12-31 05:18:46 +000020\begin{funcdesc}{split}{s\optional{, comments}}
Gustavo Niemeyer48f3dcc2003-04-20 01:57:03 +000021Split the string \var{s} using shell-like syntax. If \var{comments} is
Fred Drakeaf785122003-12-31 05:18:46 +000022\constant{False} (the default), the parsing of comments in the given
23string will be disabled (setting the \member{commenters} member of the
24\class{shlex} instance to the empty string). This function operates
25in \POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000026\versionadded{2.3}
27\end{funcdesc}
28
Fred Drakeaf785122003-12-31 05:18:46 +000029The \module{shlex} module defines the following class:
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000030
Fred Drakeaf785122003-12-31 05:18:46 +000031\begin{classdesc}{shlex}{\optional{instream\optional{,
32 infile\optional{, posix}}}}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000033A \class{shlex} instance or subclass instance is a lexical analyzer
34object. The initialization argument, if present, specifies where to
35read characters from. It must be a file-/stream-like object with
36\method{read()} and \method{readline()} methods, or a string (strings
37are accepted since Python 2.3). If no argument is given, input will be
38taken from \code{sys.stdin}. The second optional argument is a filename
39string, which sets the initial value of the \member{infile} member. If
40the \var{instream} argument is omitted or equal to \code{sys.stdin},
41this second argument defaults to ``stdin''. The \var{posix} argument
Fred Drakeaa3b5d22003-04-17 21:49:04 +000042was introduced in Python 2.3, and defines the operational mode. When
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000043\var{posix} is not true (default), the \class{shlex} instance will
Fred Drakeaa3b5d22003-04-17 21:49:04 +000044operate in compatibility mode. When operating in \POSIX{} mode,
45\class{shlex} will try to be as close as possible to the \POSIX{} shell
Fred Drakeaf785122003-12-31 05:18:46 +000046parsing rules. See section~\ref{shlex-objects}.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000047\end{classdesc}
48
Fred Drakeaf785122003-12-31 05:18:46 +000049\begin{seealso}
50 \seemodule{ConfigParser}{Parser for configuration files similar to the
51 Windows \file{.ini} files.}
52\end{seealso}
53
54
Fred Drake1189fa91998-12-22 18:24:13 +000055\subsection{shlex Objects \label{shlex-objects}}
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000056
57A \class{shlex} instance has the following methods:
58
59\begin{methoddesc}{get_token}{}
Fred Drake1189fa91998-12-22 18:24:13 +000060Return a token. If tokens have been stacked using
61\method{push_token()}, pop a token off the stack. Otherwise, read one
62from the input stream. If reading encounters an immediate
Fred Drakeaa3b5d22003-04-17 21:49:04 +000063end-of-file, \member{self.eof} is returned (the empty string (\code{''})
64in non-\POSIX{} mode, and \code{None} in \POSIX{} mode).
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000065\end{methoddesc}
66
67\begin{methoddesc}{push_token}{str}
68Push the argument onto the token stack.
69\end{methoddesc}
70
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000071\begin{methoddesc}{read_token}{}
72Read a raw token. Ignore the pushback stack, and do not interpret source
73requests. (This is not ordinarily a useful entry point, and is
74documented here only for the sake of completeness.)
75\end{methoddesc}
76
Fred Drake52dc76c2000-07-03 09:56:23 +000077\begin{methoddesc}{sourcehook}{filename}
78When \class{shlex} detects a source request (see
79\member{source} below) this method is given the following token as
80argument, and expected to return a tuple consisting of a filename and
81an open file-like object.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000082
Fred Drake52dc76c2000-07-03 09:56:23 +000083Normally, this method first strips any quotes off the argument. If
84the result is an absolute pathname, or there was no previous source
85request in effect, or the previous source was a stream
Fred Drakeaf785122003-12-31 05:18:46 +000086(such as \code{sys.stdin}), the result is left alone. Otherwise, if the
Fred Drake52dc76c2000-07-03 09:56:23 +000087result is a relative pathname, the directory part of the name of the
88file immediately before it on the source inclusion stack is prepended
89(this behavior is like the way the C preprocessor handles
Eric S. Raymondbd1a4892001-01-16 14:18:55 +000090\code{\#include "file.h"}).
91
92The result of the manipulations is treated as a filename, and returned
93as the first component of the tuple, with
94\function{open()} called on it to yield the second component. (Note:
95this is the reverse of the order of arguments in instance initialization!)
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000096
Fred Drake52dc76c2000-07-03 09:56:23 +000097This hook is exposed so that you can use it to implement directory
98search paths, addition of file extensions, and other namespace hacks.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000099There is no corresponding `close' hook, but a shlex instance will call
Fred Drake52dc76c2000-07-03 09:56:23 +0000100the \method{close()} method of the sourced input stream when it
101returns \EOF.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000102
Fred Drake25be1932001-01-16 20:52:41 +0000103For more explicit control of source stacking, use the
104\method{push_source()} and \method{pop_source()} methods.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000105\end{methoddesc}
106
107\begin{methoddesc}{push_source}{stream\optional{, filename}}
108Push an input source stream onto the input stack. If the filename
109argument is specified it will later be available for use in error
110messages. This is the same method used internally by the
Fred Drake25be1932001-01-16 20:52:41 +0000111\method{sourcehook} method.
112\versionadded{2.1}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000113\end{methoddesc}
114
Fred Drake25be1932001-01-16 20:52:41 +0000115\begin{methoddesc}{pop_source}{}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000116Pop the last-pushed input source from the input stack.
117This is the same method used internally when the lexer reaches
Raymond Hettingerb67449d2003-09-08 18:52:18 +0000118\EOF{} on a stacked input stream.
Fred Drake25be1932001-01-16 20:52:41 +0000119\versionadded{2.1}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000120\end{methoddesc}
121
Fred Drake52dc76c2000-07-03 09:56:23 +0000122\begin{methoddesc}{error_leader}{\optional{file\optional{, line}}}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000123This method generates an error message leader in the format of a
Fred Drake25be1932001-01-16 20:52:41 +0000124\UNIX{} C compiler error label; the format is \code{'"\%s", line \%d: '},
Fred Drake52dc76c2000-07-03 09:56:23 +0000125where the \samp{\%s} is replaced with the name of the current source
126file and the \samp{\%d} with the current input line number (the
127optional arguments can be used to override these).
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000128
Fred Drake52dc76c2000-07-03 09:56:23 +0000129This convenience is provided to encourage \module{shlex} users to
130generate error messages in the standard, parseable format understood
131by Emacs and other \UNIX{} tools.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000132\end{methoddesc}
133
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000134Instances of \class{shlex} subclasses have some public instance
Fred Drake52dc76c2000-07-03 09:56:23 +0000135variables which either control lexical analysis or can be used for
136debugging:
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000137
138\begin{memberdesc}{commenters}
139The string of characters that are recognized as comment beginners.
140All characters from the comment beginner to end of line are ignored.
Fred Drake1189fa91998-12-22 18:24:13 +0000141Includes just \character{\#} by default.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000142\end{memberdesc}
143
144\begin{memberdesc}{wordchars}
145The string of characters that will accumulate into multi-character
Fred Drake52dc76c2000-07-03 09:56:23 +0000146tokens. By default, includes all \ASCII{} alphanumerics and
Fred Drake1189fa91998-12-22 18:24:13 +0000147underscore.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000148\end{memberdesc}
149
150\begin{memberdesc}{whitespace}
151Characters that will be considered whitespace and skipped. Whitespace
Fred Drake1189fa91998-12-22 18:24:13 +0000152bounds tokens. By default, includes space, tab, linefeed and
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000153carriage-return.
154\end{memberdesc}
155
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000156\begin{memberdesc}{escape}
157Characters that will be considered as escape. This will be only used
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000158in \POSIX{} mode, and includes just \character{\textbackslash} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000159\versionadded{2.3}
160\end{memberdesc}
161
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000162\begin{memberdesc}{quotes}
163Characters that will be considered string quotes. The token
164accumulates until the same quote is encountered again (thus, different
Fred Drake184e8361999-05-11 15:14:15 +0000165quote types protect each other as in the shell.) By default, includes
Fred Drake1189fa91998-12-22 18:24:13 +0000166\ASCII{} single and double quotes.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000167\end{memberdesc}
168
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000169\begin{memberdesc}{escapedquotes}
170Characters in \member{quotes} that will interpret escape characters
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000171defined in \member{escape}. This is only used in \POSIX{} mode, and
172includes just \character{"} by default.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000173\versionadded{2.3}
174\end{memberdesc}
175
176\begin{memberdesc}{whitespace_split}
Neal Norwitz10cf2182003-04-17 23:09:08 +0000177If \code{True}, tokens will only be split in whitespaces. This is useful, for
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000178example, for parsing command lines with \class{shlex}, getting tokens
179in a similar way to shell arguments.
180\versionadded{2.3}
181\end{memberdesc}
182
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000183\begin{memberdesc}{infile}
184The name of the current input file, as initially set at class
185instantiation time or stacked by later source requests. It may
186be useful to examine this when constructing error messages.
187\end{memberdesc}
188
189\begin{memberdesc}{instream}
Fred Drake52dc76c2000-07-03 09:56:23 +0000190The input stream from which this \class{shlex} instance is reading
191characters.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000192\end{memberdesc}
193
194\begin{memberdesc}{source}
Fred Drake52dc76c2000-07-03 09:56:23 +0000195This member is \code{None} by default. If you assign a string to it,
196that string will be recognized as a lexical-level inclusion request
197similar to the \samp{source} keyword in various shells. That is, the
198immediately following token will opened as a filename and input taken
199from that stream until \EOF, at which point the \method{close()}
200method of that stream will be called and the input source will again
201become the original input stream. Source requests may be stacked any
202number of levels deep.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000203\end{memberdesc}
204
205\begin{memberdesc}{debug}
Fred Drake52dc76c2000-07-03 09:56:23 +0000206If this member is numeric and \code{1} or more, a \class{shlex}
207instance will print verbose progress output on its behavior. If you
208need to use this, you can read the module source code to learn the
209details.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000210\end{memberdesc}
211
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000212\begin{memberdesc}{lineno}
213Source line number (count of newlines seen so far plus one).
214\end{memberdesc}
215
216\begin{memberdesc}{token}
Fred Drake1189fa91998-12-22 18:24:13 +0000217The token buffer. It may be useful to examine this when catching
218exceptions.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000219\end{memberdesc}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000220
221\begin{memberdesc}{eof}
222Token used to determine end of file. This will be set to the empty
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000223string (\code{''}), in non-\POSIX{} mode, and to \code{None} in
224\POSIX{} mode.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000225\versionadded{2.3}
226\end{memberdesc}
227
228\subsection{Parsing Rules\label{shlex-parsing-rules}}
229
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000230When operating in non-\POSIX{} mode, \class{shlex} will try to obey to
231the following rules.
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000232
233\begin{itemize}
234\item Quote characters are not recognized within words
235 (\code{Do"Not"Separate} is parsed as the single word
236 \code{Do"Not"Separate});
237\item Escape characters are not recognized;
238\item Enclosing characters in quotes preserve the literal value of
239 all characters within the quotes;
240\item Closing quotes separate words (\code{"Do"Separate} is parsed
241 as \code{"Do"} and \code{Separate});
242\item If \member{whitespace_split} is \code{False}, any character not
243 declared to be a word character, whitespace, or a quote will be
244 returned as a single-character token. If it is \code{True},
245 \class{shlex} will only split words in whitespaces;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000246\item EOF is signaled with an empty string (\code{''});
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000247\item It's not possible to parse empty strings, even if quoted.
248\end{itemize}
249
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000250When operating in \POSIX{} mode, \class{shlex} will try to obey to the
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000251following parsing rules.
252
253\begin{itemize}
254\item Quotes are stripped out, and do not separate words
255 (\code{"Do"Not"Separate"} is parsed as the single word
256 \code{DoNotSeparate});
257\item Non-quoted escape characters (e.g. \character{\textbackslash})
258 preserve the literal value of the next character that follows;
259\item Enclosing characters in quotes which are not part of
260 \member{escapedquotes} (e.g. \character{'}) preserve the literal
261 value of all characters within the quotes;
262\item Enclosing characters in quotes which are part of
263 \member{escapedquotes} (e.g. \character{"}) preserves the literal
264 value of all characters within the quotes, with the exception of
265 the characters mentioned in \member{escape}. The escape characters
266 retain its special meaning only when followed by the quote in use,
267 or the escape character itself. Otherwise the escape character
268 will be considered a normal character.
Fred Drakeaf785122003-12-31 05:18:46 +0000269\item EOF is signaled with a \constant{None} value;
Fred Drakeaa3b5d22003-04-17 21:49:04 +0000270\item Quoted empty strings (\code{''}) are allowed;
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000271\end{itemize}
272