NAME
regex - POSIX 1003.2 正则表达式
DESCRIPTION
正则表达式 (``RE''s), 在 POSIX 1003.2 中定义,包含两种类型:新式 REs (基本上指的是 egrep 使用的那些,1003.2 称其为 ``extended'' REs 也就是“扩展的REs”) 和旧式 REs (指的是 ed(1) 中的那些,1003.2 称之为 ``basic'' REs 也就是“基本的REs”). 旧式 REs 的存在仅仅是为了向后和一些旧程序保持兼容;在***将加以讨论。 1003.2 对 RE 语法和语义的某些方面没有做强制规定; `(!)' 记号标示了这些内容,它们可能不能完全移植到其他 1003.2 实现当中。
一个(新式的) RE 正则表达式是一个(!) 或多个非空(!) branches 分支,以 `|' 分隔。它匹配任何匹配其中一个分支的符号串。
一个 branch 分支是一个(!) 或多个 pieces 片段连结而成。符号串首先要匹配它的***个片段,接下来剩余部分再匹配第二个片段,以此类推。
一个 piece 片段是一个 atom 原子,其后可能包含一个(!) `*', `+', `?', 或者 bound 量词。一个原子加上 `*' 匹配零个或多个这个原子的匹配构成的序列。一个原子加上 `+' 匹配一个或多个这个原子的匹配构成的序列。一个原子加上 `?' 匹配零个或一个这个原子的匹配。
一个 bound 量词是 `{' 后面跟一个无符号十进制整数,可能还会跟一个 `,',可能还会再跟一个无符号十进制整数,然后以 `}' 结束。整数的大小必须在 0 和 RE_DUP_MAX (255(!)) 之间(包含边界值)。如果给出了两个数字,那么***个决不能比第二个大。一个原子的量词中如果只有一个数字而没有逗号的话,就匹配 i 个这个原子的匹配构成的序列。一个原子的量词中如果只有一个数字并且有逗号的话,就匹配 i 个或多个这个原子的匹配构成的序列。一个原子的量词中如果包含两个数字 i 和 j 的话,就匹配 i 到 j 个这个原子的匹配构成的序列。
一个原子是一个包含在 `()' 中的正则表达式 (这将匹配这个正则表达式匹配的符号串),一个空的 `()' (匹配空串),一个 bracket expression (方括号表达式,参见下面), `.' (匹配任何字符), `^' (匹配行首的空字符串), `$' (匹配行尾的空字符串),一个 `\' 加上下列字符之一 `^.[$()|*+?{\' (匹配这个字符,忽略它的任何特殊意义),一个 `\' (加上任何其他字符(!) 匹配那个字符,忽略它的任何特殊意义,就好像 `\' 不存在(!)),或者是一个字符,没有特殊意义 (匹配它本身)。一个 `{' 后面是一个非数字的字符时,是一个普通的字符而不是量词的开始(!)。以 `\' 来结束一个 RE 是非法的。
一个 bracket expression 方括号表达式是一个字符的列表,包含在 `[]' 当中。它一般匹配列表中的任何一个字符 (有特殊情况)。如果这个列表以 `^' 开始,它将匹配 不在 列表中的任何字符 (下面还会讲到特殊情况)。如果列表中的两个字符以 `-' 分隔,可以表示字母表中这两个字符之间(包括这两个字符)所有的字符。例如,ASCII 字符表中 `[0-9]' 匹配任何数字。不能(!) 用一个字符作为定义两个字符范围的端点,就像这样 `a-c-e'。字符范围是与字母表顺序相关的,可移植的程序不应使用它们。
要在列表中包含一个字面的(没有特殊含义的) `]',可以把它放在首位(后面可能要加上一个`^')。要在列表中包含一个字面的 `-',可以把它放在首位或末尾,或者让它作为一个字符范围的末端点。要以一个字面的 `-' 作为字符范围的起始,可以将它放在 `[.' 和 `.]' 当中,使得它成为一个 collating element (归并元素,参见下面)。特殊情况除了这些,还有使用 `[' 的组合(参见下一段)。所有其他特殊字符,包括 `\' 在内,在方括号表达式中都失去了它们的特殊含义。
方括号表达式中,一个包含在 `[.' 和 `.]' 中的归并元素 (collating element,一个字符,一个视为一体的字符序列,或者一个代表着上述两类的归并序列名称) 代表着这个归并元素所包含的字符序列。这个序列被视为方括号表达式的一个元素。因此一个包含着多字符归并元素的方括号表达式可以匹配多于一个的字符。例如,如果这个归并序列包含一个归并元素 `ch',那么正则表达式 `[[.ch.]]'*c' 可以匹配 `chchcc' 的前五个字符。
方括号表达式中,一个包含在 `[=' 和 `=]' 中的归并元素是一个等价类,代表着等价于它的所有归并元素 (也包括它自身)包含的字符的序列。 (如果没有其他等价的归并元素,就把它与括号分隔符是 `[.' 和 `.]' 时同样看待。) 例如,如果 o 和 是一个等价类的成员,那么 `[[=o=]]',`[[==]]' 还有 `[o]' 都是同义词。一个等价类不能(!) 是一个字符范围的末端点。
方括号表达式中,包含在 `[:' 和 `:]' 中的一个 character class(字符类) 代表着这个字符类中的所有字符的列表。标准的字符类名称是:
-
alnum digitpunct alpha graphspace blank lowerupper cntrl printxdigit
它们代表着 wctype(3) 定义的字符类。一个 locale(语言环境) 可能会提供其他字符类。一个字符类不能用作一个字符范围的末端点。
方括号表达式还有两种特殊的情况(!) :方括号表达式 `[[:<:]]' 和 `[[:>:]]' 分别匹配一个词的开始和结尾的空字符串。一个 word (词)是一个 word character (成词字符) 的序列,并且前后都没有成词字符。一个 word character (成词字符) 是一个 alnum 字符 (在 wctype(3) 中有定义) 或者是一个下划线。这是一个扩展,与 POSIX 1003.2 兼容但没有写入正文,在需要移植到其他系统中的软件中应当小心使用。
如果一个 RE 可以匹配一个字符串的多个不同的字串时,RE 选择匹配最前面的一个。如果这个 RE 匹配的子串有相同的起始点,RE 选择匹配最长的一个。子表达式也匹配最长的字串,使得整个匹配的字串最长,RE 中前面的子表达式比后面的子表达式优先级高。注意高级的子表达式比组成它的子表达式优先级要高。
匹配长度以字符来计算,而不是归并元素。空字符串被认为比没有匹配要长。例如,`bb*' 匹配 `abbbc' 的中间三个字符; `(wee|week)(knights|nights)' 匹配 `weeknights' 的全部十个字符; `(.*).*' 匹配 `abc',其中括号中的子表达式匹配所有这三个字符; `(a*)*' 来和 `bc' 匹配时,括号中的子表达式和整个 RE 都匹配空字符串。
如果指定了 case-indepentent 忽略大小写的匹配,效果是字母表中的大小写区别似乎都消失了。如果一个字母可能以两种情况出现,假如它出现在方括号表达式之外,实际上被替换成了一个包含所有情况的方括号表达式,例如 `x' 成为了 `[xX]';如果它出现在方括号表达式之内,那么它的所有形式都被加入到这个方括号表达式之内,因此例如 `[x]' 等同于 `[xX]',还有 `[^x]' 成为了 `[^xX]'。
对 RE 的长度没有强制的限制。需要可移植的程序不应当使用长于256字节的正则表达式,因为特定的实现可以不接受这种 RE,但是仍然是 POSIX 兼容的。
过时的 (``basic'') 正则表达式在很多地方有不同之处。`|',`+' 和 `?' 是普通的字符,并且没有和它们等价的功能。量词的分隔符是 `\{' 和 `\}',`{' 和 `}' 本身是普通的字符。嵌套的子表达式使用的括号是 `\(' 和 `\)',`(' 和 `)' 本身是普通的字符。 `^' 是一个普通的字符,除非是 RE 的***个字符,或者(!) 一个括号中的子表达式的***个字符。 `$' 是一个普通的字符,除非是 RE 的***一个字符,或者(!) 一个括号中的子表达式的***一个字符。 `*' 是一个普通的字符,如果它出现在 RE 的开始,或者一个括号中的子表达式的开始(其后一般是一个 `^')。***,还有一类 atom 原子,一个 back reference(向后引用):`\' 其后跟一个非零十进制整数 d,匹配与第 d 个括号中的子表达式的匹配相同的内容(子表达式的编号是根据它们的左括号而来,从左到右)。因此(例如),`\([bc]\)\1' 匹配 `bb' 或 `cc' 但是不匹配 `bc'。
SEE ALSO 参见
regex(3)
POSIX 1003.2, section 2.8 (Regular Expression Notation).
#p#
NAME
regex - POSIX 1003.2 regular expressions
DESCRIPTION
Regular expressions (``RE''s), as defined in POSIX 1003.2, come in two forms: modern REs (roughly those of egrep; 1003.2 calls these ``extended'' REs) and obsolete REs (roughly those of ed(1); 1003.2 ``basic'' REs). Obsolete REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. 1003.2 leaves some aspects of RE syntax and semantics open; `(!)' marks decisions on these aspects that may not be fully portable to other 1003.2 implementations.
A (modern) RE is one(!) or more non-empty(!) branches, separated by `|'. It matches anything that matches one of the branches.
A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.
A piece is an atom possibly followed by a single(!) `*', `+', `?', or bound. An atom followed by `*' matches a sequence of 0 or more matches of the atom. An atom followed by `+' matches a sequence of 1 or more matches of the atom. An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
A bound is `{' followed by an unsigned decimal integer, possibly followed by `,' possibly followed by another unsigned decimal integer, always followed by `}'. The integers must lie between 0 and RE_DUP_MAX (255(!)) inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound containing one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom followed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the atom.
An atom is a regular expression enclosed in `()' (matching a match for the regular expression), an empty set of `()' (matching the null string)(!), a bracket expression (see below), `.' (matching any single character), `^' (matching the null string at the beginning of a line), `$' (matching the null string at the end of a line), a `\' followed by one of the characters `^.[$()|*+?{\' (matching that character taken as an ordinary character), a `\' followed by any other character(!) (matching that character taken as an ordinary character, as if the `\' had not been present(!)), or a single character with no other significance (matching that character). A `{' followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It is illegal to end an RE with `\'.
A bracket expression is a list of characters enclosed in `[]'. It normally matches any single character from the list (but see below). If the list begins with `^', it matches any single character (but see below) not from the rest of the list. If two characters in the list are separated by `-', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, e.g. `[0-9]' in ASCII matches any decimal digit. It is illegal(!) for two ranges to share an endpoint, e.g. `a-c-e'. Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.
To include a literal `]' in the list, make it the first character (following a possible `^'). To include a literal `-', make it the first or last character, or the second endpoint of a range. To use a literal `-' as the first endpoint of a range, enclose it in `[.' and `.]' to make it a collating element (see below). With the exception of these and some combinations using `[' (see next paragraphs), all other special characters, including `\', lose their special significance within a bracket expression.
Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if it were a single character, or a collating-sequence name for either) enclosed in `[.' and `.]' stands for the sequence of characters of that collating element. The sequence is a single element of the bracket expression's list. A bracket expression containing a multi-character collating element can thus match more than one character, e.g. if the collating sequence includes a `ch' collating element, then the RE `[[.ch.]]*c' matches the first five characters of `chchcc'.
Within a bracket expression, a collating element enclosed in `[=' and `=]' is an equivalence class, standing for the sequences of characters of all collating elements equivalent to that one, including itself. (If there are no other equivalent collating elements, the treatment is as if the enclosing delimiters were `[.' and `.]'.) For example, if o and are the members of an equivalence class, then `[[=o=]]', `[[==]]', and `[o]' are all synonymous. An equivalence class may not(!) be an endpoint of a range.
Within a bracket expression, the name of a character class enclosed in `[:' and `:]' stands for the list of all characters belonging to that class. Standard character class names are:
-
alnum digitpunct alpha graphspace blank lowerupper cntrl printxdigit
These stand for the character classes defined in wctype(3). A locale may provide others. A character class may not be used as an endpoint of a range.
There are two special cases(!) of bracket expressions: the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor followed by word characters. A word character is an alnum character (as defined by wctype(3)) or an underscore. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems.
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions.
Match lengths are measured in characters, not collating elements. A null string is considered longer than no match at all. For example, `bb*' matches the three middle characters of `abbbc', `(wee|week)(knights|nights)' matches all ten characters of `weeknights', when `(.*).*' is matched against `abc' the parenthesized subexpression matches all three characters, and when `(a*)*' is matched against `bc' both the whole RE and the parenthesized subexpression match the null string.
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, e.g. `x' becomes `[xX]'. When it appears inside a bracket expression, all case counterparts of it are added to the bracket expression, so that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes `[^xX]'.
No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.
Obsolete (``basic'') regular expressions differ in several respects. `|', `+', and `?' are ordinary characters and there is no equivalent for their functionality. The delimiters for bounds are `\{' and `\}', with `{' and `}' by themselves ordinary characters. The parentheses for nested subexpressions are `\(' and `\)', with `(' and `)' by themselves ordinary characters. `^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized subexpression, `$' is an ordinary character except at the end of the RE or(!) the end of a parenthesized subexpression, and `*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading `^'). Finally, there is one new type of atom, a back reference: `\' followed by a non-zero decimal digit d matches the same sequence of characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their opening parentheses, left to right), so that (e.g.) `\([bc]\)\1' matches `bb' or `cc' but not `bc'.
SEE ALSO
regex(3)
POSIX 1003.2, section 2.8 (Regular Expression Notation).