Regular Expressions





                    CHAPTER 9: REGULAR EXPRESSIONS


   What Is A Regular Expression (RE)?

     - Pattern for string matching

     - Used by many UNIX programs, such as sed, awk, nawk, grep, egrep, 
       vi and emacs

     - Any RE that can be described by one of the above tools can also 
       be written in Perl

     - Perl's REs are similar to those in egrep


   Single-Character Patterns

     - A dot '.' matches any single character except a newline

     - A backslash followed by two or three octal digits (e.g. \015) 
       matches the character having that octal value

     - A backslashed 'x' followed by two hexadecimal digits (e.g. \x7f) 
       matches the character having that hexadecimal value

     - A backslashed c followed by a single character (e.g. \cD) matches 
       the corresponding control character

     - Any non-special character (see below) matches itself

     - Ex.

         a                         # Matches 'a'
         \004                      # Matches '<Ctrl-D>'
         \cD                       # Matches '<Ctrl-D>'
         \x7f                      # Matches '<Delete>'
         .                         # Matches any character except '\n'


   Predefined Single-Character Patterns

         \n		Newline
         \r		Carriage Return
         \t 		Tab
         \f		Formfeed


   Character Class

     - Another type of single-character pattern

     - List (that is, group, not really a Perl list) of characters 
       enclosed in []

     - Represents choice: matches any single character that is in the 
       list 

     - If the first character in the list is a caret (^), a negated 
       character class is created.  This matches any single character 
       which is NOT in the list.

     - Character ranges can be created by using a dash (-)

     - To include a dash, circumflex, left bracket or right bracket in 
       the list, precede them with a backslash  

     - Ex.

         [0123456789]              # Matches any single digit
         [0-9]                     # Same thing
         [^0-9]                    # Matches any single non-digit
         [a-zA-Z]                  # Matches any single alphabetic
         [z-a]                     # Error: Invalid range.


   Predefined Character Classes

         \d		Digit: [0-9]
         \D		Non-digits [^0-9]
         \w		Word character: [a-zA-Z0-9_]
         \W		Non-word character: [^a-zA-Z0-9_]
         \s		Whitespace character: [ \t\n\r\f]
         \S		Non-whitespace character: [^ \t\n\r\f]

         Note: The above may be used in the definition of a new
               character class

  
   Backreference

     - Whenever a part of a RE enclosed in grouping parentheses is 
       matched, the matched substring is remembered for subsequent use

     - A backslashed single digit number matches whatever the
       corresponding parenthesized group matched

     - Called a backreference to a substring

     - The parenthesized groups are numbered starting from 1 by counting 
       the left parentheses from left to right in the RE
   
     - A backslashed multi-digit number will also be considered a 
       backreference if the RE contains that many parenthesized groups 
       and the number does not begin with a 0

     - Note that parentheses serve two purposes: grouping and
       backreferencing.  Use of parentheses for grouping triggers
       backreferencing.

     - Ex.

         (.).*\1                   # Matches any string that begins
                                   #   and ends with the same char


   Atom

     - Any single-character pattern             OR

     - Any RE enclosed in grouping parentheses


   Atom Sequence 

     - A sequence of atoms

     - A sequence of atoms matches if every atom in the sequence matches 
       in the order the atoms occur

     - Ex.

         Bob                       # Matches "Bob"
         (B)(o)(b)                 # Matches "Bob"
         ch\d                      # Matches "ch" followed by a
                                   #   single digit


   Quantified Atoms

     - A quantifier (or multiplier) following an atom indicates how many 
       times the atom must occur for a match

     - Quantifiers are: 

         {n,m}          Must occur at least n times but no more
                        than m times
         {n,}           Must occur at least n times 
         {n}            Must occur "exactly" n times 
         *              Must occur 0 or more times (same as {0,})
         +              Must occur 1 or more times (same as {1,})
         ?              Must occur 0 or 1 times (same as {0,1})

     - These quantified-atom patterns are greedy - the longest possible 
       string is matched

     - If a RE contains two or more quantified-atom patterns, then 
       leftmost is greediest
     
     - Ex.

         \w+\d?                    # Matches one or more word
                                   #   characters followed by
                                   #   an optional digit

         x{3,7}                    # Matches 3-7 x's


   Assertions (Anchoring Patterns)

     - A pattern can be anchored to a particular part of a string using 
       an assertion

     - Assertions are: 

         ^              Matches the beginning of the string
         $              Matches the end of the string (or BEFORE any
                          newline at the end of the string)
         \b             Matches on a word boundary (the place 
                          between characters matching \w and \W
                          or \W and \w)
         \B             Matches on a non-word boundary

     - Ex.

         \bBob\b                   # Matches Bob, but not Bobby or
                                   #   ohBob
         \bBob\B                   # Matches Bobby, but not "Bob Tarr"


   Item

     - An assertion                            OR

     - A quantified atom                       OR

     - A sequence of atoms


   Item Sequence 

     - A sequence of items

     - A sequence of items matches if every item in the sequence matches 
       in the order the items occur

     - Ex.

         end$                      # Matches "end" at the end of 
                                   #   the string
         ^\d.*\d$                  # Matches a string that begins
                                   #   and ends with a digit


   Regular Expression (RE)

     - A group of item sequences separated by the vertical-bar (|)
       character

     - Each item sequence represents an alternative

     - The RE matches if any of the alternatives match

     - The alternatives are always evaluated left-to-right, stopping at 
       the first complete match

     - Recall that the definition of an atom included any RE enclosed in 
       grouping parentheses.  This allows for the alternation symbol to 
       also be used inside an atom (or item).

     - Ex.

         Bob|Joe                   # Matches "Bob" or "Joe"
         ^(Bob|Joe)                # Matches "bob" or "Joe" at
                                   #   the beginning of the string
         (Bob|Joe)Bob              # Matches "BobBob" or "JoeBob"


   Precedence

     - The precedence of the RE meta-characters is (highest to lowest):

         Grouping Parentheses        ()
         Quantifiers                 +, *, ?, {n,m}
         Sequences And Assertions    ^, $, \b, \B
         Alternation                 |


   Special RE Rules

     - The following characters must be backslashed to get their literal 
       meaning:

         ( ) { } * + ? . [ ] | \

     - The '^' character must be backslashed in two instances to get its
       literal meaning: 1) if it is the first character inside [] or if 
       it is the first atom of an atom sequence 

     - The '$' character must be backslashed in one instance to get its
       literal meaning: 1) if it is the last atom of an atom sequence

     - The '-' character must be backslashed inside [] to get its literal 
       meaning if it would otherwise be interpreted as a delimiting a
       range

     - The ']' character does not terminate a character class definition 
       if it is the first character (after any ^) in the class

     - The following backslash sequences have special meaning:

         \n \r \t \f \d \D \w \W \s \S \b \B
         \0nn \xnn \c<control char>

     - Inside [], \b means the backspace character

     - Any other backslashed character matches the character itself.
       Thus, \y is the same as y.

     - Any other character is non-special and matches itself


   The Match Operator

     - Used to search a string for a pattern specified by a RE

     - /RE/

     - In a scalar context, returns True if the RE matched, else 
       returns False.  Any matched backreferences are saved in the 
       special variables $1, $2, etc.

       NOTE!! The variables $1, $2, etc. should NOT be used to 
              represent backreferences in the RE, since they
              have not been created yet when the RE is being
              evaluated.  Instead use the \1, \2, etc. forms.

     - In an array context, returns the matched backreferences if any 
       matched substrings exist.  If no matched substrings exist because 
       the RE has no parenthesized groups, but the RE matches part of the 
       search string, returns (1).  If no match occurs, the empty list is
       returned.  In no matched backreferences exist, the $1, $2,... 
       variables are NOT set.

     - Searches $_ by default

     - If the RE is empty, the most recent RE from a previous match 
       operation (or substitution operation) is used

     - Ex.

         if (/Bob|Joe/)            # True if $_ contains "Bob"
                                   #   or "Joe"
         {
           print ("$_");
         }

     - Ex.

         $_ = "One Two Three Four";
         ($first, $second) = /(\w+)\W+(\w+)/;
         print ("The first word was $first, the second was $second\n");


   The Pattern Binding Operator

     - To change the target of the matching operator, use the pattern
       binding  operator, =~

     - EXPR =~ /RE/

     - EXPR can be any expression that yields a scalar string value

     - Ex.

         print ("Continue [y/n]?");
         if (<STDIN> =~ /^[yY]/)            # True if line read in
                                            #   begins with y or Y
         {
           print ("Let's do it!\n");
         }

     - Ex.

         $string = "aacbbxxxaacbb";
         @matches = $string =~ /(a)c(b)/;  # @matches is ("a", "b")

         Note that =~ has higher precedence than =, so the above
         works as expected.  If you did this:

         (@matches = $string) =~ /(a)c(b)/; 

         @matches becomes the one-element list ($string). Then
         the pattern binding operator forces its left operand to be
         evaluated in a scalar context, which in this case would be
         the number of elements in @matches which is 1.  So the
         string "1" is bound to the pattern match!  This always fails
         with the above pattern.


   Case-Insensitive Matching

     - To ignore case in a match operation, use the 'i' option

     - EXPR =~ /RE/i

     - Ex.

         print ("Continue [y/n]?");
         if (<STDIN> =~ /^[y]/i)            # True if line read in
                                            #   begins with y or Y
         {
           print ("Let's do it!\n");
         }


   Using A Different Delimiter

     - To use a delimiter other than /, use the 'm' prefix

     - EXPR =~ m/RE/

     - Useful for patterns that contain /'s (e.g. pathnames); it avoids 
       lots of backslashed slashes

     - Ex.

         if ($filename =~ m#/usr/local/etc#)
         {
           print ("Found it!\n");
         }


   Global Matching 

     - To match ALL occurrences of the pattern in the target string, 
       use the 'g' option

     - EXPR =~ /RE/g

     - In a scalar context, only the first occurrence is matched, but 
       Perl remembers where it stopped searching and will continue at 
       that point on the next search.  Any matched backreferences are 
       saved in $1, $2, etc. on each iteration.

     - In an array context, returns the matched backreferences if any 
       matched substrings exist, for EACH occurrence matched.  If no 
       matched substrings exist because the RE has no parenthesized 
       groups, returns the matched part of the string for EACH 
       occurrence matched.  Otherwise returns the empty list.  The $1, 
       $2,... variables are set for the first occurrence only of each 
       matched backreference.

     - Ex.

         $string = "aacbbxxxaacbb";
         @matches = $string =~ /(aa)c(bb)/;   # @matches is ("aa", "bb")
         @matches = $string =~ /(aa)c(bb)/g;  # @matches is now ("aa",
                                              #    "bb", "aa, "bb")
         @matches = $string =~ /aacbb/;       # @matches is now (1)
         @matches = $string =~ /aacbb/g;      # @matches is now ("aabcc",
                                              #    "aabcc")
 
     - Ex.

         To extract all words from the string in $x:

           while ($x =~ /\w+/g)               # Global matching in a 
           {                                  #   scalar context
             push (@words, $&);               # $& contains the matched
           }                                  #   substring

         Or using global matching in an array context:

           push (@words, $x =~ /\w+/g);


   Variable Interpolation In The Match Operator

     - Each time the match operator is encountered, the RE is scanned 
       for possible scalar variable references and then the value of 
       these variables are interpolated into the pattern

     - The RE is then "compiled" for execution

     - If the RE does not have any variable references, the pattern is 
       compiled only once.  But if it does have variable references, 
       variable interpolation and compilation is done each time the RE 
       is encountered

     - If the variable contains a pattern which does NOT change over the 
       life of the Perl program, use the 'o' option to compile the pattern 
       only once and avoid needless runtime recompilations

     - Ex.

         $word = "Bob";
         if ($string =~ /\b$word\b/o)
         {
           print ("Found $word in the string!\n");
         }


   Special Variables Used In The Match Operator

     - Recall that $1, $2, etc. contain the matched backreferences

     - Other special variables include:

         $&             The part of the string that matched
         $`             The part of the string before the part 
                          that matched
         $'             The part of the string after the part 
                          that matched

    - Note that these special variables are NOT set if the match 
      operator is used in an array context and contains backreferences

    - These are all read-only variables and can not be set


   The Match-Once Operator

     - Same as the match operator (/PATTERN/) except that it matches
       only once between calls to the reset operator

     - ?RE?

     - The i and o options can be used with the match-once operator

     - Once a match is found with the match-once operator, it will
       fail to find any more matches until the reset operator is 
       called

     - Useful for finding the first occurrence of a pattern in each
       file of a set of files


   The Substitution Operator

     - Used to search a target string for a pattern and replace the 
       matched substring with a new string

     - s/RE/REPLACEMENT_STRING/

     - Returns the number of substitutions made, or 0 (False) if the 
       pattern was not found in the target string

     - Searches $_ by default.  The '=~' operator can be used to change 
       the target string.
   
     - The 'i' (ignore case') and 'o' (compile only once) options can be 
       used

     - Without any options, the replacement is done on only the first 
       match encountered in the target string.  To replace all occurrences 
       of the pattern in the target string with the replacement string, 
       use the 'g' option.

     - Both the RE and the REPLACEMENT_STRING are variable interpolated

     - The REPLACEMENT_STRING can have backreferences in the form $n 

       NOTE!! Although the \1, \2, etc. forms will sometimes work
              in the REPLACEMENT_STRING as a backreference, the
              $1, $2, etc. forms should be used here.

              So the rule is: Use \1, etc. in the RE
                              Use $1, etc. in the REPLACEMENT_STRING

     - The sed awk program allows the use of '&' in the replacement 
       string to refer to that part of the target string that matched.  
       In Perl you must use $&.

     - Any delimiter may be used instead of /

     - If the RE is empty, the most recent RE from a previous match 
       operation (or substitution operation) is used

     - Ex.

         $pat = "Bob";
         $rs = "Joe";
         if (($times = ($string =~ s/\b$pat\b/$rs/go))
         {
           print ("Replaced $pat $times in the string!\n");
         }

         Note that 

            $times = $string =~ s/RE/REPLACEMENT_STRING/;

         is the same as 

            $times = ($string =~ s/RE/REPLACEMENT_STRING/);

         Both perform the substitution on $string and return the number 
          of substitutions in $times.

         But

            ($times = $string) =~ s/RE/REPLACEMENT_STRING/;

         copies $string to $times and does the substitution on $times, 
         leaving $string unchanged.
          
     - Ex.

         # Put parentheses around each word in $string
         $string =~ s/(\w+)/($1)/g;


   Split Function

     - Splits a string into an array of strings (fields) according to a 
       delimiter specified by a pattern (RE)

     - split (/RE/, EXPR, LIMIT)
       split (/RE/, EXPR)
       split (/RE/)
       split 

     - Returns the array of strings (fields) in an array context.  (In 
       a scalar context, returns the number of fields and splits into 
       the @_ array.)

     - The RE identifies the delimiters that separate the desired fields 
       to be extracted from the string in EXPR.  The delimiter may be 
       longer than one character.

     - If the RE is not found, returns a one-element array containing
       the original string 

     - The delimiters are NOT returned

     - LIMIT can be used to limit the number of fields extracted

     - If EXPR is omitted, $_ is used

     - If the RE is omitted, splits on whitespace: /[ \t\n]+/.  In
       this case, leading whitespace produces a null first field.

     - As a special case, the pattern " " (a space without the slashes),
       splits on whitespace WITHOUT creating a null first field for
       any leading whitespace.  This emulates awk's default behavior.

     - Typical use:

         @fields = split (/:/, $line);

     - Ex.

         #!/usr/bin/perl
         # Parses /etc/passwd from STDIN

         while (<STDIN>)
         {
           ($user, $passwd, $uid, $gid, $gcos) = split (/:/);
           print ("User $user has gcos $gcos\n");
         }


   Join Function

     - Joins a list of values into a single string with the values
       separated by a specified delimiter

     - join (EXPR, LIST)

     - Returns the created string

     - The EXPR identifies the delimiter to use.  It is NOT a RE!  (Well, 
       it may look like one, but it does not act like one!  Actually, 
       join has nothing to do with REs.  Just convenient to talk about 
       join here, since we just mentioned split.)

     - The LIST identifies the list of values to be joined

     - Typical use:

         @bigstring = join (":", @fields);


   Grep Function

     - Evaluates an expression for each element of a list, locally
       setting $_ to each list element

     - grep (EXPR, LIST)

     - In an array context, returns an array of those list elements for 
       which the expression evaluated to True.  In a scalar context, 
       returns the number of elements for which the expression evaluated 
       to True.

     - The expression can be an RE 

     - Typical use:

         @files = grep (-d, @ARGV);  # @files contains all files
                                     #   from the command line
                                     #   which are directories


   Comparison Of RE Meta-Characters In Various Programs

       RE M-C                 grep    egrep   sed     awk     perl

       .                      Yes     Yes     Yes     Yes     Yes 
       \                      Yes     Yes     Yes     Yes     Yes 
       ^                      Yes     Yes     Yes     Yes     Yes 
       $                      Yes     Yes     Yes     Yes     Yes 
       []                     Yes     Yes     Yes     Yes     Yes 
       RE*                    Yes     Yes     Yes     Yes     Yes 
       RE+                    No      Yes     No      Yes     Yes 
       RE?                    No      Yes     No      Yes     Yes 
       RE|RE                  No      Yes     No      Yes     Yes 

       \(RE\) Backref         Yes     No      Yes     No      No
       (RE) Grouping          No      Yes     No      Yes     Yes(1)
       \n                     Yes     No      Yes     No      Yes 

       RE\{m\}                Yes     Yes(2)  Yes     Yes(2)  No
       RE{m}                  No      No      No      No      Yes

       RE\{m,\}               Yes     Yes(2)  Yes     Yes(2)  No
       RE{m,}                 No      No      No      No      Yes

       RE\{m,n\}              Yes     Yes(2)  Yes     Yes(2)  No
       RE{m,n}                No      No      No      No      Yes

       \<RE\>                 No      Yes(3)  No      Yes(3)  No
       \bRE\b                 No      No      No      No      Yes

    
       Notes:
       1) In Perl, () are used for BOTH backreferences and grouping.
       2) The \{,\} constructs are NOT available in all versions of 
          egrep and awk!
       3) The \<,\> construct is NOT available in all versions of 
          egrep and awk!




Bob Tarr
University of Maryland, Baltimore County
tarr@umbc.edu