blob: c0c4e568e3de2a91e15432cdbcb3045dc1cf8d17 [file] [log] [blame]
Fred Drake1189fa91998-12-22 18:24:13 +00001\section{\module{shlex} ---
Fred Drake184e8361999-05-11 15:14:15 +00002 Simple lexical analysis}
Fred Drake1189fa91998-12-22 18:24:13 +00003
4\declaremodule{standard}{shlex}
Fred Drakec116b822001-05-09 15:50:17 +00005\modulesynopsis{Simple lexical analysis for \UNIX\ shell-like languages.}
Fred Drake1189fa91998-12-22 18:24:13 +00006\moduleauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00007\moduleauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +00008\sectionauthor{Eric S. Raymond}{esr@snark.thyrsus.com}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +00009\sectionauthor{Gustavo Niemeyer}{niemeyer@conectiva.com}
Fred Drake1189fa91998-12-22 18:24:13 +000010
Fred Drake292b9eb1998-12-22 18:40:50 +000011\versionadded{1.5.2}
Fred Drake1189fa91998-12-22 18:24:13 +000012
13The \class{shlex} class makes it easy to write lexical analyzers for
14simple syntaxes resembling that of the \UNIX{} shell. This will often
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000015be useful for writing minilanguages, (e.g. in run control files for
16Python applications) or for parsing quoted strings.
Fred Drake184e8361999-05-11 15:14:15 +000017
18\begin{seealso}
19 \seemodule{ConfigParser}{Parser for configuration files similar to the
20 Windows \file{.ini} files.}
21\end{seealso}
22
23
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000024\subsection{Module Contents}
25
26The \module{shlex} module defines the following functions:
27
28\begin{funcdesc}{split}{s\optional{, posix=\code{True}\optional{,
29 spaces=\code{True}}}}
30Split the string \var{s} using shell-like syntax. If \code{posix} is
31\code{True}, operate in posix mode. If \code{spaces} is \code{True}, it
32will only split words in whitespaces (setting the
33\member{whitespace_split} member of the \class{shlex} instance).
34\versionadded{2.3}
35\end{funcdesc}
36
37The \module{shlex} module defines the following classes:
38
39\begin{classdesc}{shlex}{\optional{instream=\code{sys.stdin}\optional{,
40 infile=\code{None}\optional{,
41 posix=\code{False}}}}}
42A \class{shlex} instance or subclass instance is a lexical analyzer
43object. The initialization argument, if present, specifies where to
44read characters from. It must be a file-/stream-like object with
45\method{read()} and \method{readline()} methods, or a string (strings
46are accepted since Python 2.3). If no argument is given, input will be
47taken from \code{sys.stdin}. The second optional argument is a filename
48string, which sets the initial value of the \member{infile} member. If
49the \var{instream} argument is omitted or equal to \code{sys.stdin},
50this second argument defaults to ``stdin''. The \var{posix} argument
51was introduced in Python 2.3, and defines the operational mode. When
52\var{posix} is not true (default), the \class{shlex} instance will
53operate in compatibility mode. When operating in posix mode,
54\class{shlex} will try to be as close as possible to the posix shell
55parsing rules. See~\ref{shlex-objects}.
56\end{classdesc}
57
Fred Drake1189fa91998-12-22 18:24:13 +000058\subsection{shlex Objects \label{shlex-objects}}
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000059
60A \class{shlex} instance has the following methods:
61
62\begin{methoddesc}{get_token}{}
Fred Drake1189fa91998-12-22 18:24:13 +000063Return a token. If tokens have been stacked using
64\method{push_token()}, pop a token off the stack. Otherwise, read one
65from the input stream. If reading encounters an immediate
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +000066end-of-file, \member{self.eof} is returned (the empty string (\code{""})
67in non-posix mode, and \code{None} in posix mode).
Guido van Rossum5e97c9d1998-12-22 05:18:24 +000068\end{methoddesc}
69
70\begin{methoddesc}{push_token}{str}
71Push the argument onto the token stack.
72\end{methoddesc}
73
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000074\begin{methoddesc}{read_token}{}
75Read a raw token. Ignore the pushback stack, and do not interpret source
76requests. (This is not ordinarily a useful entry point, and is
77documented here only for the sake of completeness.)
78\end{methoddesc}
79
Fred Drake52dc76c2000-07-03 09:56:23 +000080\begin{methoddesc}{sourcehook}{filename}
81When \class{shlex} detects a source request (see
82\member{source} below) this method is given the following token as
83argument, and expected to return a tuple consisting of a filename and
84an open file-like object.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000085
Fred Drake52dc76c2000-07-03 09:56:23 +000086Normally, this method first strips any quotes off the argument. If
87the result is an absolute pathname, or there was no previous source
88request in effect, or the previous source was a stream
89(e.g. \code{sys.stdin}), the result is left alone. Otherwise, if the
90result is a relative pathname, the directory part of the name of the
91file immediately before it on the source inclusion stack is prepended
92(this behavior is like the way the C preprocessor handles
Eric S. Raymondbd1a4892001-01-16 14:18:55 +000093\code{\#include "file.h"}).
94
95The result of the manipulations is treated as a filename, and returned
96as the first component of the tuple, with
97\function{open()} called on it to yield the second component. (Note:
98this is the reverse of the order of arguments in instance initialization!)
Guido van Rossumd67ddbb2000-05-01 20:14:47 +000099
Fred Drake52dc76c2000-07-03 09:56:23 +0000100This hook is exposed so that you can use it to implement directory
101search paths, addition of file extensions, and other namespace hacks.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000102There is no corresponding `close' hook, but a shlex instance will call
Fred Drake52dc76c2000-07-03 09:56:23 +0000103the \method{close()} method of the sourced input stream when it
104returns \EOF.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000105
Fred Drake25be1932001-01-16 20:52:41 +0000106For more explicit control of source stacking, use the
107\method{push_source()} and \method{pop_source()} methods.
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000108\end{methoddesc}
109
110\begin{methoddesc}{push_source}{stream\optional{, filename}}
111Push an input source stream onto the input stack. If the filename
112argument is specified it will later be available for use in error
113messages. This is the same method used internally by the
Fred Drake25be1932001-01-16 20:52:41 +0000114\method{sourcehook} method.
115\versionadded{2.1}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000116\end{methoddesc}
117
Fred Drake25be1932001-01-16 20:52:41 +0000118\begin{methoddesc}{pop_source}{}
Eric S. Raymondbd1a4892001-01-16 14:18:55 +0000119Pop the last-pushed input source from the input stack.
120This is the same method used internally when the lexer reaches
Fred Drake25be1932001-01-16 20:52:41 +0000121\EOF on a stacked input stream.
122\versionadded{2.1}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000123\end{methoddesc}
124
Fred Drake52dc76c2000-07-03 09:56:23 +0000125\begin{methoddesc}{error_leader}{\optional{file\optional{, line}}}
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000126This method generates an error message leader in the format of a
Fred Drake25be1932001-01-16 20:52:41 +0000127\UNIX{} C compiler error label; the format is \code{'"\%s", line \%d: '},
Fred Drake52dc76c2000-07-03 09:56:23 +0000128where the \samp{\%s} is replaced with the name of the current source
129file and the \samp{\%d} with the current input line number (the
130optional arguments can be used to override these).
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000131
Fred Drake52dc76c2000-07-03 09:56:23 +0000132This convenience is provided to encourage \module{shlex} users to
133generate error messages in the standard, parseable format understood
134by Emacs and other \UNIX{} tools.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000135\end{methoddesc}
136
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000137Instances of \class{shlex} subclasses have some public instance
Fred Drake52dc76c2000-07-03 09:56:23 +0000138variables which either control lexical analysis or can be used for
139debugging:
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000140
141\begin{memberdesc}{commenters}
142The string of characters that are recognized as comment beginners.
143All characters from the comment beginner to end of line are ignored.
Fred Drake1189fa91998-12-22 18:24:13 +0000144Includes just \character{\#} by default.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000145\end{memberdesc}
146
147\begin{memberdesc}{wordchars}
148The string of characters that will accumulate into multi-character
Fred Drake52dc76c2000-07-03 09:56:23 +0000149tokens. By default, includes all \ASCII{} alphanumerics and
Fred Drake1189fa91998-12-22 18:24:13 +0000150underscore.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000151\end{memberdesc}
152
153\begin{memberdesc}{whitespace}
154Characters that will be considered whitespace and skipped. Whitespace
Fred Drake1189fa91998-12-22 18:24:13 +0000155bounds tokens. By default, includes space, tab, linefeed and
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000156carriage-return.
157\end{memberdesc}
158
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000159\begin{memberdesc}{escape}
160Characters that will be considered as escape. This will be only used
161in posix mode, and includes just \character{\textbackslash} by default.
162\versionadded{2.3}
163\end{memberdesc}
164
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000165\begin{memberdesc}{quotes}
166Characters that will be considered string quotes. The token
167accumulates until the same quote is encountered again (thus, different
Fred Drake184e8361999-05-11 15:14:15 +0000168quote types protect each other as in the shell.) By default, includes
Fred Drake1189fa91998-12-22 18:24:13 +0000169\ASCII{} single and double quotes.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000170\end{memberdesc}
171
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000172\begin{memberdesc}{escapedquotes}
173Characters in \member{quotes} that will interpret escape characters
174defined in \member{escape}. This is only used in posix mode, and includes
175just \character{"} by default.
176\versionadded{2.3}
177\end{memberdesc}
178
179\begin{memberdesc}{whitespace_split}
180If true, tokens will only be split in whitespaces. This is useful, for
181example, for parsing command lines with \class{shlex}, getting tokens
182in a similar way to shell arguments.
183\versionadded{2.3}
184\end{memberdesc}
185
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000186\begin{memberdesc}{infile}
187The name of the current input file, as initially set at class
188instantiation time or stacked by later source requests. It may
189be useful to examine this when constructing error messages.
190\end{memberdesc}
191
192\begin{memberdesc}{instream}
Fred Drake52dc76c2000-07-03 09:56:23 +0000193The input stream from which this \class{shlex} instance is reading
194characters.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000195\end{memberdesc}
196
197\begin{memberdesc}{source}
Fred Drake52dc76c2000-07-03 09:56:23 +0000198This member is \code{None} by default. If you assign a string to it,
199that string will be recognized as a lexical-level inclusion request
200similar to the \samp{source} keyword in various shells. That is, the
201immediately following token will opened as a filename and input taken
202from that stream until \EOF, at which point the \method{close()}
203method of that stream will be called and the input source will again
204become the original input stream. Source requests may be stacked any
205number of levels deep.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000206\end{memberdesc}
207
208\begin{memberdesc}{debug}
Fred Drake52dc76c2000-07-03 09:56:23 +0000209If this member is numeric and \code{1} or more, a \class{shlex}
210instance will print verbose progress output on its behavior. If you
211need to use this, you can read the module source code to learn the
212details.
Guido van Rossumd67ddbb2000-05-01 20:14:47 +0000213\end{memberdesc}
214
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000215\begin{memberdesc}{lineno}
216Source line number (count of newlines seen so far plus one).
217\end{memberdesc}
218
219\begin{memberdesc}{token}
Fred Drake1189fa91998-12-22 18:24:13 +0000220The token buffer. It may be useful to examine this when catching
221exceptions.
Guido van Rossum5e97c9d1998-12-22 05:18:24 +0000222\end{memberdesc}
Gustavo Niemeyer68d8cef2003-04-17 21:31:33 +0000223
224\begin{memberdesc}{eof}
225Token used to determine end of file. This will be set to the empty
226string (\code{""}), in non-posix mode, and to \code{None} in posix
227mode.
228\versionadded{2.3}
229\end{memberdesc}
230
231\subsection{Parsing Rules\label{shlex-parsing-rules}}
232
233When operating in non-posix mode, \class{shlex} with try to obey to the
234following rules.
235
236\begin{itemize}
237\item Quote characters are not recognized within words
238 (\code{Do"Not"Separate} is parsed as the single word
239 \code{Do"Not"Separate});
240\item Escape characters are not recognized;
241\item Enclosing characters in quotes preserve the literal value of
242 all characters within the quotes;
243\item Closing quotes separate words (\code{"Do"Separate} is parsed
244 as \code{"Do"} and \code{Separate});
245\item If \member{whitespace_split} is \code{False}, any character not
246 declared to be a word character, whitespace, or a quote will be
247 returned as a single-character token. If it is \code{True},
248 \class{shlex} will only split words in whitespaces;
249\item EOF is signaled with an empty string (\code{""});
250\item It's not possible to parse empty strings, even if quoted.
251\end{itemize}
252
253When operating in posix mode, \class{shlex} will try to obey to the
254following parsing rules.
255
256\begin{itemize}
257\item Quotes are stripped out, and do not separate words
258 (\code{"Do"Not"Separate"} is parsed as the single word
259 \code{DoNotSeparate});
260\item Non-quoted escape characters (e.g. \character{\textbackslash})
261 preserve the literal value of the next character that follows;
262\item Enclosing characters in quotes which are not part of
263 \member{escapedquotes} (e.g. \character{'}) preserve the literal
264 value of all characters within the quotes;
265\item Enclosing characters in quotes which are part of
266 \member{escapedquotes} (e.g. \character{"}) preserves the literal
267 value of all characters within the quotes, with the exception of
268 the characters mentioned in \member{escape}. The escape characters
269 retain its special meaning only when followed by the quote in use,
270 or the escape character itself. Otherwise the escape character
271 will be considered a normal character.
272\item EOF is signaled with a \code{None} value;
273\item Quoted empty strings (\code{""}) are allowed;
274\end{itemize}
275