Regular Expressions
CHAPTER 9: REGULAR EXPRESSIONS
What Is A Regular Expression (RE)?
- Pattern for string matching
- Used by many UNIX programs, such as sed, awk, nawk, grep, egrep,
vi and emacs
- Any RE that can be described by one of the above tools can also
be written in Perl
- Perl's REs are similar to those in egrep
Single-Character Patterns
- A dot '.' matches any single character except a newline
- A backslash followed by two or three octal digits (e.g. \015)
matches the character having that octal value
- A backslashed 'x' followed by two hexadecimal digits (e.g. \x7f)
matches the character having that hexadecimal value
- A backslashed c followed by a single character (e.g. \cD) matches
the corresponding control character
- Any non-special character (see below) matches itself
- Ex.
a # Matches 'a'
\004 # Matches '<Ctrl-D>'
\cD # Matches '<Ctrl-D>'
\x7f # Matches '<Delete>'
. # Matches any character except '\n'
Predefined Single-Character Patterns
\n Newline
\r Carriage Return
\t Tab
\f Formfeed
Character Class
- Another type of single-character pattern
- List (that is, group, not really a Perl list) of characters
enclosed in []
- Represents choice: matches any single character that is in the
list
- If the first character in the list is a caret (^), a negated
character class is created. This matches any single character
which is NOT in the list.
- Character ranges can be created by using a dash (-)
- To include a dash, circumflex, left bracket or right bracket in
the list, precede them with a backslash
- Ex.
[0123456789] # Matches any single digit
[0-9] # Same thing
[^0-9] # Matches any single non-digit
[a-zA-Z] # Matches any single alphabetic
[z-a] # Error: Invalid range.
Predefined Character Classes
\d Digit: [0-9]
\D Non-digits [^0-9]
\w Word character: [a-zA-Z0-9_]
\W Non-word character: [^a-zA-Z0-9_]
\s Whitespace character: [ \t\n\r\f]
\S Non-whitespace character: [^ \t\n\r\f]
Note: The above may be used in the definition of a new
character class
Backreference
- Whenever a part of a RE enclosed in grouping parentheses is
matched, the matched substring is remembered for subsequent use
- A backslashed single digit number matches whatever the
corresponding parenthesized group matched
- Called a backreference to a substring
- The parenthesized groups are numbered starting from 1 by counting
the left parentheses from left to right in the RE
- A backslashed multi-digit number will also be considered a
backreference if the RE contains that many parenthesized groups
and the number does not begin with a 0
- Note that parentheses serve two purposes: grouping and
backreferencing. Use of parentheses for grouping triggers
backreferencing.
- Ex.
(.).*\1 # Matches any string that begins
# and ends with the same char
Atom
- Any single-character pattern OR
- Any RE enclosed in grouping parentheses
Atom Sequence
- A sequence of atoms
- A sequence of atoms matches if every atom in the sequence matches
in the order the atoms occur
- Ex.
Bob # Matches "Bob"
(B)(o)(b) # Matches "Bob"
ch\d # Matches "ch" followed by a
# single digit
Quantified Atoms
- A quantifier (or multiplier) following an atom indicates how many
times the atom must occur for a match
- Quantifiers are:
{n,m} Must occur at least n times but no more
than m times
{n,} Must occur at least n times
{n} Must occur "exactly" n times
* Must occur 0 or more times (same as {0,})
+ Must occur 1 or more times (same as {1,})
? Must occur 0 or 1 times (same as {0,1})
- These quantified-atom patterns are greedy - the longest possible
string is matched
- If a RE contains two or more quantified-atom patterns, then
leftmost is greediest
- Ex.
\w+\d? # Matches one or more word
# characters followed by
# an optional digit
x{3,7} # Matches 3-7 x's
Assertions (Anchoring Patterns)
- A pattern can be anchored to a particular part of a string using
an assertion
- Assertions are:
^ Matches the beginning of the string
$ Matches the end of the string (or BEFORE any
newline at the end of the string)
\b Matches on a word boundary (the place
between characters matching \w and \W
or \W and \w)
\B Matches on a non-word boundary
- Ex.
\bBob\b # Matches Bob, but not Bobby or
# ohBob
\bBob\B # Matches Bobby, but not "Bob Tarr"
Item
- An assertion OR
- A quantified atom OR
- A sequence of atoms
Item Sequence
- A sequence of items
- A sequence of items matches if every item in the sequence matches
in the order the items occur
- Ex.
end$ # Matches "end" at the end of
# the string
^\d.*\d$ # Matches a string that begins
# and ends with a digit
Regular Expression (RE)
- A group of item sequences separated by the vertical-bar (|)
character
- Each item sequence represents an alternative
- The RE matches if any of the alternatives match
- The alternatives are always evaluated left-to-right, stopping at
the first complete match
- Recall that the definition of an atom included any RE enclosed in
grouping parentheses. This allows for the alternation symbol to
also be used inside an atom (or item).
- Ex.
Bob|Joe # Matches "Bob" or "Joe"
^(Bob|Joe) # Matches "bob" or "Joe" at
# the beginning of the string
(Bob|Joe)Bob # Matches "BobBob" or "JoeBob"
Precedence
- The precedence of the RE meta-characters is (highest to lowest):
Grouping Parentheses ()
Quantifiers +, *, ?, {n,m}
Sequences And Assertions ^, $, \b, \B
Alternation |
Special RE Rules
- The following characters must be backslashed to get their literal
meaning:
( ) { } * + ? . [ ] | \
- The '^' character must be backslashed in two instances to get its
literal meaning: 1) if it is the first character inside [] or if
it is the first atom of an atom sequence
- The '$' character must be backslashed in one instance to get its
literal meaning: 1) if it is the last atom of an atom sequence
- The '-' character must be backslashed inside [] to get its literal
meaning if it would otherwise be interpreted as a delimiting a
range
- The ']' character does not terminate a character class definition
if it is the first character (after any ^) in the class
- The following backslash sequences have special meaning:
\n \r \t \f \d \D \w \W \s \S \b \B
\0nn \xnn \c<control char>
- Inside [], \b means the backspace character
- Any other backslashed character matches the character itself.
Thus, \y is the same as y.
- Any other character is non-special and matches itself
The Match Operator
- Used to search a string for a pattern specified by a RE
- /RE/
- In a scalar context, returns True if the RE matched, else
returns False. Any matched backreferences are saved in the
special variables $1, $2, etc.
NOTE!! The variables $1, $2, etc. should NOT be used to
represent backreferences in the RE, since they
have not been created yet when the RE is being
evaluated. Instead use the \1, \2, etc. forms.
- In an array context, returns the matched backreferences if any
matched substrings exist. If no matched substrings exist because
the RE has no parenthesized groups, but the RE matches part of the
search string, returns (1). If no match occurs, the empty list is
returned. In no matched backreferences exist, the $1, $2,...
variables are NOT set.
- Searches $_ by default
- If the RE is empty, the most recent RE from a previous match
operation (or substitution operation) is used
- Ex.
if (/Bob|Joe/) # True if $_ contains "Bob"
# or "Joe"
{
print ("$_");
}
- Ex.
$_ = "One Two Three Four";
($first, $second) = /(\w+)\W+(\w+)/;
print ("The first word was $first, the second was $second\n");
The Pattern Binding Operator
- To change the target of the matching operator, use the pattern
binding operator, =~
- EXPR =~ /RE/
- EXPR can be any expression that yields a scalar string value
- Ex.
print ("Continue [y/n]?");
if (<STDIN> =~ /^[yY]/) # True if line read in
# begins with y or Y
{
print ("Let's do it!\n");
}
- Ex.
$string = "aacbbxxxaacbb";
@matches = $string =~ /(a)c(b)/; # @matches is ("a", "b")
Note that =~ has higher precedence than =, so the above
works as expected. If you did this:
(@matches = $string) =~ /(a)c(b)/;
@matches becomes the one-element list ($string). Then
the pattern binding operator forces its left operand to be
evaluated in a scalar context, which in this case would be
the number of elements in @matches which is 1. So the
string "1" is bound to the pattern match! This always fails
with the above pattern.
Case-Insensitive Matching
- To ignore case in a match operation, use the 'i' option
- EXPR =~ /RE/i
- Ex.
print ("Continue [y/n]?");
if (<STDIN> =~ /^[y]/i) # True if line read in
# begins with y or Y
{
print ("Let's do it!\n");
}
Using A Different Delimiter
- To use a delimiter other than /, use the 'm' prefix
- EXPR =~ m/RE/
- Useful for patterns that contain /'s (e.g. pathnames); it avoids
lots of backslashed slashes
- Ex.
if ($filename =~ m#/usr/local/etc#)
{
print ("Found it!\n");
}
Global Matching
- To match ALL occurrences of the pattern in the target string,
use the 'g' option
- EXPR =~ /RE/g
- In a scalar context, only the first occurrence is matched, but
Perl remembers where it stopped searching and will continue at
that point on the next search. Any matched backreferences are
saved in $1, $2, etc. on each iteration.
- In an array context, returns the matched backreferences if any
matched substrings exist, for EACH occurrence matched. If no
matched substrings exist because the RE has no parenthesized
groups, returns the matched part of the string for EACH
occurrence matched. Otherwise returns the empty list. The $1,
$2,... variables are set for the first occurrence only of each
matched backreference.
- Ex.
$string = "aacbbxxxaacbb";
@matches = $string =~ /(aa)c(bb)/; # @matches is ("aa", "bb")
@matches = $string =~ /(aa)c(bb)/g; # @matches is now ("aa",
# "bb", "aa, "bb")
@matches = $string =~ /aacbb/; # @matches is now (1)
@matches = $string =~ /aacbb/g; # @matches is now ("aabcc",
# "aabcc")
- Ex.
To extract all words from the string in $x:
while ($x =~ /\w+/g) # Global matching in a
{ # scalar context
push (@words, $&); # $& contains the matched
} # substring
Or using global matching in an array context:
push (@words, $x =~ /\w+/g);
Variable Interpolation In The Match Operator
- Each time the match operator is encountered, the RE is scanned
for possible scalar variable references and then the value of
these variables are interpolated into the pattern
- The RE is then "compiled" for execution
- If the RE does not have any variable references, the pattern is
compiled only once. But if it does have variable references,
variable interpolation and compilation is done each time the RE
is encountered
- If the variable contains a pattern which does NOT change over the
life of the Perl program, use the 'o' option to compile the pattern
only once and avoid needless runtime recompilations
- Ex.
$word = "Bob";
if ($string =~ /\b$word\b/o)
{
print ("Found $word in the string!\n");
}
Special Variables Used In The Match Operator
- Recall that $1, $2, etc. contain the matched backreferences
- Other special variables include:
$& The part of the string that matched
$` The part of the string before the part
that matched
$' The part of the string after the part
that matched
- Note that these special variables are NOT set if the match
operator is used in an array context and contains backreferences
- These are all read-only variables and can not be set
The Match-Once Operator
- Same as the match operator (/PATTERN/) except that it matches
only once between calls to the reset operator
- ?RE?
- The i and o options can be used with the match-once operator
- Once a match is found with the match-once operator, it will
fail to find any more matches until the reset operator is
called
- Useful for finding the first occurrence of a pattern in each
file of a set of files
The Substitution Operator
- Used to search a target string for a pattern and replace the
matched substring with a new string
- s/RE/REPLACEMENT_STRING/
- Returns the number of substitutions made, or 0 (False) if the
pattern was not found in the target string
- Searches $_ by default. The '=~' operator can be used to change
the target string.
- The 'i' (ignore case') and 'o' (compile only once) options can be
used
- Without any options, the replacement is done on only the first
match encountered in the target string. To replace all occurrences
of the pattern in the target string with the replacement string,
use the 'g' option.
- Both the RE and the REPLACEMENT_STRING are variable interpolated
- The REPLACEMENT_STRING can have backreferences in the form $n
NOTE!! Although the \1, \2, etc. forms will sometimes work
in the REPLACEMENT_STRING as a backreference, the
$1, $2, etc. forms should be used here.
So the rule is: Use \1, etc. in the RE
Use $1, etc. in the REPLACEMENT_STRING
- The sed awk program allows the use of '&' in the replacement
string to refer to that part of the target string that matched.
In Perl you must use $&.
- Any delimiter may be used instead of /
- If the RE is empty, the most recent RE from a previous match
operation (or substitution operation) is used
- Ex.
$pat = "Bob";
$rs = "Joe";
if (($times = ($string =~ s/\b$pat\b/$rs/go))
{
print ("Replaced $pat $times in the string!\n");
}
Note that
$times = $string =~ s/RE/REPLACEMENT_STRING/;
is the same as
$times = ($string =~ s/RE/REPLACEMENT_STRING/);
Both perform the substitution on $string and return the number
of substitutions in $times.
But
($times = $string) =~ s/RE/REPLACEMENT_STRING/;
copies $string to $times and does the substitution on $times,
leaving $string unchanged.
- Ex.
# Put parentheses around each word in $string
$string =~ s/(\w+)/($1)/g;
Split Function
- Splits a string into an array of strings (fields) according to a
delimiter specified by a pattern (RE)
- split (/RE/, EXPR, LIMIT)
split (/RE/, EXPR)
split (/RE/)
split
- Returns the array of strings (fields) in an array context. (In
a scalar context, returns the number of fields and splits into
the @_ array.)
- The RE identifies the delimiters that separate the desired fields
to be extracted from the string in EXPR. The delimiter may be
longer than one character.
- If the RE is not found, returns a one-element array containing
the original string
- The delimiters are NOT returned
- LIMIT can be used to limit the number of fields extracted
- If EXPR is omitted, $_ is used
- If the RE is omitted, splits on whitespace: /[ \t\n]+/. In
this case, leading whitespace produces a null first field.
- As a special case, the pattern " " (a space without the slashes),
splits on whitespace WITHOUT creating a null first field for
any leading whitespace. This emulates awk's default behavior.
- Typical use:
@fields = split (/:/, $line);
- Ex.
#!/usr/bin/perl
# Parses /etc/passwd from STDIN
while (<STDIN>)
{
($user, $passwd, $uid, $gid, $gcos) = split (/:/);
print ("User $user has gcos $gcos\n");
}
Join Function
- Joins a list of values into a single string with the values
separated by a specified delimiter
- join (EXPR, LIST)
- Returns the created string
- The EXPR identifies the delimiter to use. It is NOT a RE! (Well,
it may look like one, but it does not act like one! Actually,
join has nothing to do with REs. Just convenient to talk about
join here, since we just mentioned split.)
- The LIST identifies the list of values to be joined
- Typical use:
@bigstring = join (":", @fields);
Grep Function
- Evaluates an expression for each element of a list, locally
setting $_ to each list element
- grep (EXPR, LIST)
- In an array context, returns an array of those list elements for
which the expression evaluated to True. In a scalar context,
returns the number of elements for which the expression evaluated
to True.
- The expression can be an RE
- Typical use:
@files = grep (-d, @ARGV); # @files contains all files
# from the command line
# which are directories
Comparison Of RE Meta-Characters In Various Programs
RE M-C grep egrep sed awk perl
. Yes Yes Yes Yes Yes
\ Yes Yes Yes Yes Yes
^ Yes Yes Yes Yes Yes
$ Yes Yes Yes Yes Yes
[] Yes Yes Yes Yes Yes
RE* Yes Yes Yes Yes Yes
RE+ No Yes No Yes Yes
RE? No Yes No Yes Yes
RE|RE No Yes No Yes Yes
\(RE\) Backref Yes No Yes No No
(RE) Grouping No Yes No Yes Yes(1)
\n Yes No Yes No Yes
RE\{m\} Yes Yes(2) Yes Yes(2) No
RE{m} No No No No Yes
RE\{m,\} Yes Yes(2) Yes Yes(2) No
RE{m,} No No No No Yes
RE\{m,n\} Yes Yes(2) Yes Yes(2) No
RE{m,n} No No No No Yes
\<RE\> No Yes(3) No Yes(3) No
\bRE\b No No No No Yes
Notes:
1) In Perl, () are used for BOTH backreferences and grouping.
2) The \{,\} constructs are NOT available in all versions of
egrep and awk!
3) The \<,\> construct is NOT available in all versions of
egrep and awk!
Bob Tarr
University of Maryland, Baltimore County
tarr@umbc.edu