| .\" $OpenBSD: re_format.7,v 1.14 2007/05/31 19:19:30 jmc Exp $ |
| .\" |
| .\" Copyright (c) 1997, Phillip F Knaack. All rights reserved. |
| .\" |
| .\" Copyright (c) 1992, 1993, 1994 Henry Spencer. |
| .\" Copyright (c) 1992, 1993, 1994 |
| .\" The Regents of the University of California. All rights reserved. |
| .\" |
| .\" This code is derived from software contributed to Berkeley by |
| .\" Henry Spencer. |
| .\" |
| .\" Redistribution and use in source and binary forms, with or without |
| .\" modification, are permitted provided that the following conditions |
| .\" are met: |
| .\" 1. Redistributions of source code must retain the above copyright |
| .\" notice, this list of conditions and the following disclaimer. |
| .\" 2. Redistributions in binary form must reproduce the above copyright |
| .\" notice, this list of conditions and the following disclaimer in the |
| .\" documentation and/or other materials provided with the distribution. |
| .\" 3. Neither the name of the University nor the names of its contributors |
| .\" may be used to endorse or promote products derived from this software |
| .\" without specific prior written permission. |
| .\" |
| .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND |
| .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE |
| .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE |
| .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL |
| .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS |
| .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) |
| .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT |
| .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY |
| .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF |
| .\" SUCH DAMAGE. |
| .\" |
| .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 |
| .\" |
| .Dd $Mdocdate: May 31 2007 $ |
| .Dt RE_FORMAT 7 |
| .Os |
| .Sh NAME |
| .Nm re_format |
| .Nd POSIX regular expressions |
| .Sh DESCRIPTION |
| Regular expressions (REs), |
| as defined in |
| .St -p1003.1-2004 , |
| come in two forms: |
| basic regular expressions |
| (BREs) |
| and extended regular expressions |
| (EREs). |
| Both forms of regular expressions are supported |
| by the interfaces described in |
| .Xr regex 3 . |
| Applications dealing with regular expressions |
| may use one or the other form |
| (or indeed both). |
| For example, |
| .Xr ed 1 |
| uses BREs, |
| whilst |
| .Xr egrep 1 |
| talks EREs. |
| Consult the manual page for the specific application to find out which |
| it uses. |
| .Pp |
| POSIX leaves some aspects of RE syntax and semantics open; |
| .Sq ** |
| marks decisions on these aspects that |
| may not be fully portable to other POSIX implementations. |
| .Pp |
| This manual page first describes regular expressions in general, |
| specifically extended regular expressions, |
| and then discusses differences between them and basic regular expressions. |
| .Sh EXTENDED REGULAR EXPRESSIONS |
| An ERE is one** or more non-empty** |
| .Em branches , |
| separated by |
| .Sq \*(Ba . |
| It matches anything that matches one of the branches. |
| .Pp |
| A branch is one** or more |
| .Em pieces , |
| concatenated. |
| It matches a match for the first, followed by a match for the second, etc. |
| .Pp |
| A piece is an |
| .Em atom |
| possibly followed by a single** |
| .Sq * , |
| .Sq + , |
| .Sq ?\& , |
| or |
| .Em bound . |
| An atom followed by |
| .Sq * |
| matches a sequence of 0 or more matches of the atom. |
| An atom followed by |
| .Sq + |
| matches a sequence of 1 or more matches of the atom. |
| An atom followed by |
| .Sq ?\& |
| matches a sequence of 0 or 1 matches of the atom. |
| .Pp |
| A bound is |
| .Sq { |
| followed by an unsigned decimal integer, |
| possibly followed by |
| .Sq ,\& |
| possibly followed by another unsigned decimal integer, |
| always followed by |
| .Sq } . |
| The integers must lie between 0 and |
| .Dv RE_DUP_MAX |
| (255**) inclusive, |
| and if there are two of them, the first may not exceed the second. |
| An atom followed by a bound containing one integer |
| .Ar i |
| and no comma matches |
| a sequence of exactly |
| .Ar i |
| matches of the atom. |
| An atom followed by a bound |
| containing one integer |
| .Ar i |
| and a comma matches |
| a sequence of |
| .Ar i |
| or more matches of the atom. |
| An atom followed by a bound |
| containing two integers |
| .Ar i |
| and |
| .Ar j |
| matches a sequence of |
| .Ar i |
| through |
| .Ar j |
| (inclusive) matches of the atom. |
| .Pp |
| An atom is a regular expression enclosed in |
| .Sq () |
| (matching a part of the regular expression), |
| an empty set of |
| .Sq () |
| (matching the null string)**, |
| a |
| .Em bracket expression |
| (see below), |
| .Sq .\& |
| (matching any single character), |
| .Sq ^ |
| (matching the null string at the beginning of a line), |
| .Sq $ |
| (matching the null string at the end of a line), |
| a |
| .Sq \e |
| followed by one of the characters |
| .Sq ^.[$()|*+?{\e |
| (matching that character taken as an ordinary character), |
| a |
| .Sq \e |
| followed by any other character** |
| (matching that character taken as an ordinary character, |
| as if the |
| .Sq \e |
| had not been present**), |
| or a single character with no other significance (matching that character). |
| A |
| .Sq { |
| followed by a character other than a digit is an ordinary character, |
| not the beginning of a bound**. |
| It is illegal to end an RE with |
| .Sq \e . |
| .Pp |
| A bracket expression is a list of characters enclosed in |
| .Sq [] . |
| It normally matches any single character from the list (but see below). |
| If the list begins with |
| .Sq ^ , |
| it matches any single character |
| .Em not |
| from the rest of the list |
| (but see below). |
| If two characters in the list are separated by |
| .Sq - , |
| this is shorthand for the full |
| .Em range |
| of characters between those two (inclusive) in the |
| collating sequence, e.g.\& |
| .Sq [0-9] |
| in ASCII matches any decimal digit. |
| It is illegal** for two ranges to share an endpoint, e.g.\& |
| .Sq a-c-e . |
| Ranges are very collating-sequence-dependent, |
| and portable programs should avoid relying on them. |
| .Pp |
| To include a literal |
| .Sq ]\& |
| in the list, make it the first character |
| (following a possible |
| .Sq ^ ) . |
| To include a literal |
| .Sq - , |
| make it the first or last character, |
| or the second endpoint of a range. |
| To use a literal |
| .Sq - |
| as the first endpoint of a range, |
| enclose it in |
| .Sq [. |
| and |
| .Sq .] |
| to make it a collating element (see below). |
| With the exception of these and some combinations using |
| .Sq [ |
| (see next paragraphs), |
| all other special characters, including |
| .Sq \e , |
| lose their special significance within a bracket expression. |
| .Pp |
| Within a bracket expression, a collating element |
| (a character, |
| a multi-character sequence that collates as if it were a single character, |
| or a collating-sequence name for either) |
| enclosed in |
| .Sq [. |
| and |
| .Sq .] |
| stands for the sequence of characters of that collating element. |
| The sequence is a single element of the bracket expression's list. |
| A bracket expression containing a multi-character collating element |
| can thus match more than one character, |
| e.g. if the collating sequence includes a |
| .Sq ch |
| collating element, |
| then the RE |
| .Sq [[.ch.]]*c |
| matches the first five characters of |
| .Sq chchcc . |
| .Pp |
| Within a bracket expression, a collating element enclosed in |
| .Sq [= |
| and |
| .Sq =] |
| is an equivalence class, standing for the sequences of characters |
| of all collating elements equivalent to that one, including itself. |
| (If there are no other equivalent collating elements, |
| the treatment is as if the enclosing delimiters were |
| .Sq [. |
| and |
| .Sq .] . ) |
| For example, if |
| .Sq x |
| and |
| .Sq y |
| are the members of an equivalence class, |
| then |
| .Sq [[=x=]] , |
| .Sq [[=y=]] , |
| and |
| .Sq [xy] |
| are all synonymous. |
| An equivalence class may not** be an endpoint of a range. |
| .Pp |
| Within a bracket expression, the name of a |
| .Em character class |
| enclosed |
| in |
| .Sq [: |
| and |
| .Sq :] |
| stands for the list of all characters belonging to that class. |
| Standard character class names are: |
| .Bd -literal -offset indent |
| alnum digit punct |
| alpha graph space |
| blank lower upper |
| cntrl print xdigit |
| .Ed |
| .Pp |
| These stand for the character classes defined in |
| .Xr ctype 3 . |
| A locale may provide others. |
| A character class may not be used as an endpoint of a range. |
| .Pp |
| There are two special cases** of bracket expressions: |
| the bracket expressions |
| .Sq [[:<:]] |
| and |
| .Sq [[:>:]] |
| match the null string at the beginning and end of a word, respectively. |
| A word is defined as a sequence of |
| characters starting and ending with a word character |
| which is neither preceded nor followed by |
| word characters. |
| A word character is an |
| .Em alnum |
| character (as defined by |
| .Xr ctype 3 ) |
| or an underscore. |
| This is an extension, |
| compatible with but not specified by POSIX, |
| and should be used with |
| caution in software intended to be portable to other systems. |
| .Pp |
| In the event that an RE could match more than one substring of a given |
| string, |
| the RE matches the one starting earliest in the string. |
| If the RE could match more than one substring starting at that point, |
| it matches the longest. |
| Subexpressions also match the longest possible substrings, subject to |
| the constraint that the whole match be as long as possible, |
| with subexpressions starting earlier in the RE taking priority over |
| ones starting later. |
| Note that higher-level subexpressions thus take priority over |
| their lower-level component subexpressions. |
| .Pp |
| Match lengths are measured in characters, not collating elements. |
| A null string is considered longer than no match at all. |
| For example, |
| .Sq bb* |
| matches the three middle characters of |
| .Sq abbbc ; |
| .Sq (wee|week)(knights|nights) |
| matches all ten characters of |
| .Sq weeknights ; |
| when |
| .Sq (.*).* |
| is matched against |
| .Sq abc , |
| the parenthesized subexpression matches all three characters; |
| and when |
| .Sq (a*)* |
| is matched against |
| .Sq bc , |
| both the whole RE and the parenthesized subexpression match the null string. |
| .Pp |
| If case-independent matching is specified, |
| the effect is much as if all case distinctions had vanished from the |
| alphabet. |
| When an alphabetic that exists in multiple cases appears as an |
| ordinary character outside a bracket expression, it is effectively |
| transformed into a bracket expression containing both cases, |
| e.g.\& |
| .Sq x |
| becomes |
| .Sq [xX] . |
| When it appears inside a bracket expression, |
| all case counterparts of it are added to the bracket expression, |
| so that, for example, |
| .Sq [x] |
| becomes |
| .Sq [xX] |
| and |
| .Sq [^x] |
| becomes |
| .Sq [^xX] . |
| .Pp |
| No particular limit is imposed on the length of REs**. |
| Programs intended to be portable should not employ REs longer |
| than 256 bytes, |
| as an implementation can refuse to accept such REs and remain |
| POSIX-compliant. |
| .Pp |
| The following is a list of extended regular expressions: |
| .Bl -tag -width Ds |
| .It Ar c |
| Any character |
| .Ar c |
| not listed below matches itself. |
| .It \e Ns Ar c |
| Any backslash-escaped character |
| .Ar c |
| matches itself. |
| .It \&. |
| Matches any single character that is not a newline |
| .Pq Sq \en . |
| .It Bq Ar char-class |
| Matches any single character in |
| .Ar char-class . |
| To include a |
| .Ql \&] |
| in |
| .Ar char-class , |
| it must be the first character. |
| A range of characters may be specified by separating the end characters |
| of the range with a |
| .Ql - ; |
| e.g.\& |
| .Ar a-z |
| specifies the lower case characters. |
| The following literal expressions can also be used in |
| .Ar char-class |
| to specify sets of characters: |
| .Bd -unfilled -offset indent |
| [:alnum:] [:cntrl:] [:lower:] [:space:] |
| [:alpha:] [:digit:] [:print:] [:upper:] |
| [:blank:] [:graph:] [:punct:] [:xdigit:] |
| .Ed |
| .Pp |
| If |
| .Ql - |
| appears as the first or last character of |
| .Ar char-class , |
| then it matches itself. |
| All other characters in |
| .Ar char-class |
| match themselves. |
| .Pp |
| Patterns in |
| .Ar char-class |
| of the form |
| .Eo [. |
| .Ar col-elm |
| .Ec .]\& |
| or |
| .Eo [= |
| .Ar col-elm |
| .Ec =]\& , |
| where |
| .Ar col-elm |
| is a collating element, are interpreted according to |
| .Xr setlocale 3 |
| .Pq not currently supported . |
| .It Bq ^ Ns Ar char-class |
| Matches any single character, other than newline, not in |
| .Ar char-class . |
| .Ar char-class |
| is defined as above. |
| .It ^ |
| If |
| .Sq ^ |
| is the first character of a regular expression, then it |
| anchors the regular expression to the beginning of a line. |
| Otherwise, it matches itself. |
| .It $ |
| If |
| .Sq $ |
| is the last character of a regular expression, |
| it anchors the regular expression to the end of a line. |
| Otherwise, it matches itself. |
| .It [[:<:]] |
| Anchors the single character regular expression or subexpression |
| immediately following it to the beginning of a word. |
| .It [[:>:]] |
| Anchors the single character regular expression or subexpression |
| immediately following it to the end of a word. |
| .It Pq Ar re |
| Defines a subexpression |
| .Ar re . |
| Any set of characters enclosed in parentheses |
| matches whatever the set of characters without parentheses matches |
| (that is a long-winded way of saying the constructs |
| .Sq (re) |
| and |
| .Sq re |
| match identically). |
| .It * |
| Matches the single character regular expression or subexpression |
| immediately preceding it zero or more times. |
| If |
| .Sq * |
| is the first character of a regular expression or subexpression, |
| then it matches itself. |
| The |
| .Sq * |
| operator sometimes yields unexpected results. |
| For example, the regular expression |
| .Ar b* |
| matches the beginning of the string |
| .Qq abbb |
| (as opposed to the substring |
| .Qq bbb ) , |
| since a null match is the only leftmost match. |
| .It + |
| Matches the singular character regular expression |
| or subexpression immediately preceding it |
| one or more times. |
| .It ? |
| Matches the singular character regular expression |
| or subexpression immediately preceding it |
| 0 or 1 times. |
| .Sm off |
| .It Xo |
| .Pf { Ar n , m No }\ \& |
| .Pf { Ar n , No }\ \& |
| .Pf { Ar n No } |
| .Xc |
| .Sm on |
| Matches the single character regular expression or subexpression |
| immediately preceding it at least |
| .Ar n |
| and at most |
| .Ar m |
| times. |
| If |
| .Ar m |
| is omitted, then it matches at least |
| .Ar n |
| times. |
| If the comma is also omitted, then it matches exactly |
| .Ar n |
| times. |
| .It \*(Ba |
| Used to separate patterns. |
| For example, |
| the pattern |
| .Sq cat\*(Badog |
| matches either |
| .Sq cat |
| or |
| .Sq dog . |
| .El |
| .Sh BASIC REGULAR EXPRESSIONS |
| Basic regular expressions differ in several respects: |
| .Bl -bullet -offset 3n |
| .It |
| .Sq \*(Ba , |
| .Sq + , |
| and |
| .Sq ?\& |
| are ordinary characters and there is no equivalent |
| for their functionality. |
| .It |
| The delimiters for bounds are |
| .Sq \e{ |
| and |
| .Sq \e} , |
| with |
| .Sq { |
| and |
| .Sq } |
| by themselves ordinary characters. |
| .It |
| The parentheses for nested subexpressions are |
| .Sq \e( |
| and |
| .Sq \e) , |
| with |
| .Sq ( |
| and |
| .Sq )\& |
| by themselves ordinary characters. |
| .It |
| .Sq ^ |
| is an ordinary character except at the beginning of the |
| RE or** the beginning of a parenthesized subexpression. |
| .It |
| .Sq $ |
| is an ordinary character except at the end of the |
| RE or** the end of a parenthesized subexpression. |
| .It |
| .Sq * |
| is an ordinary character if it appears at the beginning of the |
| RE or the beginning of a parenthesized subexpression |
| (after a possible leading |
| .Sq ^ ) . |
| .It |
| Finally, there is one new type of atom, a |
| .Em back-reference : |
| .Sq \e |
| followed by a non-zero decimal digit |
| .Ar d |
| matches the same sequence of characters matched by the |
| .Ar d Ns th |
| parenthesized subexpression |
| (numbering subexpressions by the positions of their opening parentheses, |
| left to right), |
| so that, for example, |
| .Sq \e([bc]\e)\e1 |
| matches |
| .Sq bb\& |
| or |
| .Sq cc |
| but not |
| .Sq bc . |
| .El |
| .Pp |
| The following is a list of basic regular expressions: |
| .Bl -tag -width Ds |
| .It Ar c |
| Any character |
| .Ar c |
| not listed below matches itself. |
| .It \e Ns Ar c |
| Any backslash-escaped character |
| .Ar c , |
| except for |
| .Sq { , |
| .Sq } , |
| .Sq \&( , |
| and |
| .Sq \&) , |
| matches itself. |
| .It \&. |
| Matches any single character that is not a newline |
| .Pq Sq \en . |
| .It Bq Ar char-class |
| Matches any single character in |
| .Ar char-class . |
| To include a |
| .Ql \&] |
| in |
| .Ar char-class , |
| it must be the first character. |
| A range of characters may be specified by separating the end characters |
| of the range with a |
| .Ql - ; |
| e.g.\& |
| .Ar a-z |
| specifies the lower case characters. |
| The following literal expressions can also be used in |
| .Ar char-class |
| to specify sets of characters: |
| .Bd -unfilled -offset indent |
| [:alnum:] [:cntrl:] [:lower:] [:space:] |
| [:alpha:] [:digit:] [:print:] [:upper:] |
| [:blank:] [:graph:] [:punct:] [:xdigit:] |
| .Ed |
| .Pp |
| If |
| .Ql - |
| appears as the first or last character of |
| .Ar char-class , |
| then it matches itself. |
| All other characters in |
| .Ar char-class |
| match themselves. |
| .Pp |
| Patterns in |
| .Ar char-class |
| of the form |
| .Eo [. |
| .Ar col-elm |
| .Ec .]\& |
| or |
| .Eo [= |
| .Ar col-elm |
| .Ec =]\& , |
| where |
| .Ar col-elm |
| is a collating element, are interpreted according to |
| .Xr setlocale 3 |
| .Pq not currently supported . |
| .It Bq ^ Ns Ar char-class |
| Matches any single character, other than newline, not in |
| .Ar char-class . |
| .Ar char-class |
| is defined as above. |
| .It ^ |
| If |
| .Sq ^ |
| is the first character of a regular expression, then it |
| anchors the regular expression to the beginning of a line. |
| Otherwise, it matches itself. |
| .It $ |
| If |
| .Sq $ |
| is the last character of a regular expression, |
| it anchors the regular expression to the end of a line. |
| Otherwise, it matches itself. |
| .It [[:<:]] |
| Anchors the single character regular expression or subexpression |
| immediately following it to the beginning of a word. |
| .It [[:>:]] |
| Anchors the single character regular expression or subexpression |
| immediately following it to the end of a word. |
| .It \e( Ns Ar re Ns \e) |
| Defines a subexpression |
| .Ar re . |
| Subexpressions may be nested. |
| A subsequent backreference of the form |
| .Pf \e Ns Ar n , |
| where |
| .Ar n |
| is a number in the range [1,9], expands to the text matched by the |
| .Ar n Ns th |
| subexpression. |
| For example, the regular expression |
| .Ar \e(.*\e)\e1 |
| matches any string consisting of identical adjacent substrings. |
| Subexpressions are ordered relative to their left delimiter. |
| .It * |
| Matches the single character regular expression or subexpression |
| immediately preceding it zero or more times. |
| If |
| .Sq * |
| is the first character of a regular expression or subexpression, |
| then it matches itself. |
| The |
| .Sq * |
| operator sometimes yields unexpected results. |
| For example, the regular expression |
| .Ar b* |
| matches the beginning of the string |
| .Qq abbb |
| (as opposed to the substring |
| .Qq bbb ) , |
| since a null match is the only leftmost match. |
| .Sm off |
| .It Xo |
| .Pf \e{ Ar n , m No \e}\ \& |
| .Pf \e{ Ar n , No \e}\ \& |
| .Pf \e{ Ar n No \e} |
| .Xc |
| .Sm on |
| Matches the single character regular expression or subexpression |
| immediately preceding it at least |
| .Ar n |
| and at most |
| .Ar m |
| times. |
| If |
| .Ar m |
| is omitted, then it matches at least |
| .Ar n |
| times. |
| If the comma is also omitted, then it matches exactly |
| .Ar n |
| times. |
| .El |
| .Sh SEE ALSO |
| .Xr ctype 3 , |
| .Xr regex 3 |
| .Sh STANDARDS |
| .St -p1003.1-2004 : |
| Base Definitions, Chapter 9 (Regular Expressions). |
| .Sh BUGS |
| Having two kinds of REs is a botch. |
| .Pp |
| The current POSIX spec says that |
| .Sq )\& |
| is an ordinary character in the absence of an unmatched |
| .Sq ( ; |
| this was an unintentional result of a wording error, |
| and change is likely. |
| Avoid relying on it. |
| .Pp |
| Back-references are a dreadful botch, |
| posing major problems for efficient implementations. |
| They are also somewhat vaguely defined |
| (does |
| .Sq a\e(\e(b\e)*\e2\e)*d |
| match |
| .Sq abbbd ? ) . |
| Avoid using them. |
| .Pp |
| POSIX's specification of case-independent matching is vague. |
| The |
| .Dq one case implies all cases |
| definition given above |
| is the current consensus among implementors as to the right interpretation. |
| .Pp |
| The syntax for word boundaries is incredibly ugly. |