Horatio T.P. Webb
Regular Expressions -- SPRING 2004
DISC 3371 Transaction Processing I
Last Updated 10AM 3/26/2009

The purpose of regular expressions is to test for the occurence of particular characters, strings or patterns of characters within another string. This method is well recognized and is used in many languages. The methods by which regular expressions are evaluated varies from language to language but the general approach is based on a single set of matching operators. This page demonstrates some of the basic methods and shows examples.

  1. Regular Expression Object in VBScript

    With version 5.0 an later, VBScript employs three basic parts of the test:

    1. A string which is the subject of the examination

      This is just a string to be tested and it possesses no unique characteristic. In the examples we will use the variable string_to_test to hold the string we wish to test and store a string value like this:

      string_to_test = "Hi Ho, Hi Ho, It is off to work we go"

    2. A regular Expression object

      This VBScript object must be defined and named according to the following syntax:

      SET you_pick_any_name_for_the_regular_expression_object = New RegExp

      We will use the arbitrary name regexpobj as our object name. Thus the syntax would be:

      SET regexpobj = New RegExp

    3. A set of statements that defined the parameters of the regular expression object. The general form is:

      With name_for_the_regular_expression_object
         .Pattern = " the regular expression specification "
         .IgnoreCase = True OR False
         .Global= True OR False
      End With

      The value of .Pattern contains the string pattern or regular expression specification we will be evaluating for with the regular expression object.

      The value of .IgnoreCase determines whether or not to ignore the case of the string to be searched and the string or pattern we will attempt to find. True means to ignore upper and lower case differences (i.e, upper and lower case characters produce the SAME result or the character "A" produces the same result as "a") and False means to honor the differences between upper and ower case characters (so, "a" and "A" produce different results. The default is True.

      The value of .Global determines whether to search for one or more occurences of the test pattern. If True then the string is searched for all occurences of the test pattern. If False the test is terminated after the only the first match is found.


      Once:

      1. the value of the string to be tested is set;
      2. the regular expression object created; and
      3. the parameters of the object defined

      Three methods can be employed:

      1. The Test Method

        In this simple version a variable is assigned the results of the regular expression test. The usage is:

        result_string = name_of_the_regular_expression_object.Test (string_to_test)

        The entire sequence would thus be for example:

        string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
        SET regexpobj = New RegExp
        With regexpobg
           .Pattern = "work"
           .IgnoreCase = True
           .Global= True
        End With
        results = regexpobg.Test(string_to_test)

           The variable results will be assigned the value True or False based on the evaluation of the string. In this case the variable results is assigned the True since the string "work" is found in the string named string_to_test.
           We change the value of the IgnoreCase to True. The variable results will be assigned the value False since the .Pattern='work' doesn't match the value 'Work' in string_to_test
           We change the Pattern to "FRED" -- which doesn't appear anywhere in string_to_test. The variable results will be assigned the value False

      2. The Replace Method

        The Replace method is used to find and replace all the occurences in the string. The syntax for use is the requires the specification of two strings: (1) the string to test and (2) the replacement value. This method DOES NOT return True or False but returns a new string with the replacement (if found)

        The syntax is:

        new_string = name_of_the_regular_expression_object.Replace (string_to_find , replacenment value )

        The entire sequence would thus be for example:

        string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
        SET regexpobj = New RegExp
        With regexpobg
           .Pattern = "Hi"
           .IgnoreCase = True
           .Global= True
        End With
        new_string = regexpobg.Replace(string_to_test,"Ho")

        and new_staring would be: "Ho Ho, Ho Ho, It Is Off To Work We Go"

           Using our example, to replace "Work" with "Drudgery" the syntax would be:

        new_string = regexpobj.Replace(string_to_test,"Drudgery")

        the value of new_string would be "Hi Ho, Hi Ho, IT Is Off To Drudgery We Go"

           Suppose we try to replace "FRED" with "Drudgery" the syntax would be the same:

        new_string = regexpobj.Replace(string_to_test,"Drudgery")

        The value of Pattern would be set to "FRED" (i.e., we are trying to find "FRED" and replace it with "Drudgery". The value of new_string would be the same as string_to_test (i.e., no replacement was made since FRED wasn't found)

           Suppose we try to replace "Hi" with "Drudgery" the syntax would be the same:

        new_string = regexpobj.Replace(string_to_test,"Drudgery")

        The value of Pattern would be set to "Hi" (i.e., we are trying to find "Hi" and replace it with "Drudgery". The value of new_string would contain the string "Drudgery" twice since there are two occurences of "hi"n the string_to_test.

      3. The Execute Method

        The Execute method is a more advanced version of the Test method. Instead of returning true or false to a variable (the lefthand side of the equal sign, the method returns an object which a collection of the matches. This method not only tests for a pattern but it also returns the (a) number of matches and (b) where the matches occur. First, the syntax for usage is: The syntax is:

        match_object = name_of_the_regular_expression_object.Execute (string_to_test)

        Additionally we have:

        • the property: match_object.Count which is the number of matches found in the match_collecion object.
        • the collection match_object.Item. This collection has a zero-based index (much like forms or elements in an HTML form) and the items can be processed either:
          1. a loop index (i.e., a subscript that starts at zero and goes to match_object.Count
          2. a For-Each Loop

          Either way, each the .Item memeber of the match_object collection has four properties (only three of which are discussed here-- more later):

          1. .FirstIndex is the number of characters to the left of the matched string (this can also be interpreted as the character position in a string using zero as the first character)
          2. .Length is the number of characters in the matched item
          3. .Value is the string that was matched

        The usage is:

        match_collection = name_of_the_regular_expression_object.Execute (string_to_find)

        The entire sequence would thus be for example:

        string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
        SET regexpobj = New RegExp
        With regexpobg
           .Pattern = "Ho"
           .IgnoreCase = True
           .Global= True
        End With
        match_collection = regexpobg.Execute(string_to_test)
        if match_collection.Count = 0 then
            os = "NO matches found"
        else
             os = cstr(match_collection.Count)+ " matches found."
             for i=0 to (match_collection.Count - 1)
                 os=os + match_collection.Item(i).value
                 os=os + " found at " + cstr(match_collection.Item(i).FirstIndex)
                 os=os + " length = "+cstr(match_collection.Item(i).Length)
             next
        end if
        msgbox os

           Using our example, we will find "Hi". The syntax would be:

        match_collection = regexpobj.Execute(string_to_test)

           Using our example, we will find "i". The syntax would be:

        match_collection = regexpobj.Execute(string_to_test)

           Using our example, we will find "o" (that is the letter "oh" -- not zero). The syntax would be:

        match_collection = regexpobj.Execute(string_to_test)

        This is the basics of the regular expression operations. Now...

      4. Patterns

        There is a very large set of pattern matching options for regular expressions. Here are some basics:

        In the following we can test for characters OR or group -- a sequence of characters enclosing in parenthesis -- like: (ABC). "Ordinary" characters do not need to be enclosed in parentheses. However, the the "special" characters: $ ^ { [ ( | ) ] } * . + ? and \ have specific meanings to regular expressions. To use these special characters in matches your must preceed the special characters with a backslash (i.e., \).

        Pattern Example
        some_string

        matches the exact string

         String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go 

         Pattern: Hi Ho

          

         Pattern: HiHo

          

         Pattern: Off t

          

        ^(some_string)

        matches some_string
        at the beginning of the
        test string

         String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go 

         Pattern: ^(Hi Ho)

          

         Pattern: ^(i Ho)

          

         Pattern: ^H

          

        (some_string)$

        matches some_string
        at the end of the
        test string

         String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go 

         Pattern: (We Go)$

          

         Pattern: o$

          

         Pattern: G$

          

        (string1).(string2)

        matches: string1
        followed by any byte
        followed by string2

         String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go 

         Pattern: (to wo).(k we go)

          

         Pattern: o.k

          

         Pattern: (To).(Work)

          

        (string1)|(string2)

        matches
        string1 or string2
        in either order

         String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go 

         Pattern: (Ho)|(Hi)

          

         Pattern: (Hi)|(Ho)|(Go)

          

         Pattern: (Go)|(Zoo)

          

        (string){n}
        or
        character{n}
        must occur exactly
        n times
        and be contiguous (n >=1)
        (string){n,m}
        or
        character{n,m}
        must occur bewteen
        n and m times
        and be contiguous
        (0 >= n >= m)
        This is Greedy
        Tests m first, then m-1,...
        (string){n,m}?
        or
        character){n,m}?
        must occur between
        n and m times

        and be contiguous
        (0 >= n >= m)
        This is Lazy
        Tests n first, then n+1,...
        (string){n,}
        or
        character{n,}
        must occur at least
        n times
        and be contiguous
         String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go 

         Pattern: f{2}

          

         Pattern: (hi ho, ){2}

          

         Pattern: (hi ho, ){3}

          




         String to Test: row row row your boat 

         Pattern: (row ){2,3}

          

         Pattern: (row ){5,6}

          

         Pattern: (row){1,5}

          




         String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go 

         Pattern: f{2,3}

          

         Pattern: (hi ho, ){2,3}

          

         Pattern: (hi ho, ){3,4}

          




         String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go 

         Pattern: f{2,}

          

         Pattern: (hi ho, ){2,}

          

         Pattern: (hi ho, ){3,}

          

        [list of characters and/or groups]

        the test string
        must contain at least
        one
        these characters
        or groups

         String to Test: Off To Work


         Pattern: [wrk]

          


         Pattern: [abc]

          


         Pattern: [(off)Z]

          

        -

        defines a range
        of characters

        the test string
        must contain at least
        one the characters
        in the range

        Like:
        [A-Z]
        [a-z]
        [A-Za-z]
        [0-9]
        [A-Za-z0-9]

        Using the ^ character
        in a range excludes
        the characters in the range

         String to Test: to work we go (we make the test case-sensitive here)


         Pattern: [A-Z]

          


         Pattern: [a-z]

          


         Pattern: [A-Da-d]

          

         String to Test: 234567


         Pattern: [0-9]

          


         Pattern: [8-9]

          


         Pattern: [0-1]

          

         String to Test: abcdefghijkl


         Pattern: [^m-z]

          


         Pattern: [^a-z]

          


         Pattern: [^0-9]

          

        *
        (string)*
        or
        some character*

        the test string
        must contain 0 or more
        preceeding group or character
        same a {0,}

         String to Test: rowrowrow your bbboat


         Pattern: (row)*

          


         Pattern: b*

          


         Pattern: X*

          

        +
        (string)+
        or
        some character+

        the test string
        must contain 1 or more
        preceeding group or character
        same a {1,}

         String to Test: rowrowrow your bbboat


         Pattern: (row)+

          


         Pattern: y+

          


         Pattern: X+

          

        ?
        (string)?
        or
        some character?

        the test string
        must contain 0 or 1 of the
        preceeding group or character
        same a {0,1}

         String to Test: rowrowrow your bbboat


         Pattern: (row)?

          


         Pattern: b?

          


         Pattern: X?

          

      5. Character Classes

        There is additional shorthand for certain groups of characters
        Class Explanation
        \d Matches any decimal digit. Same as [0-9]
        \D Matches any non-digit. Same as [^0-9]
        \s Matches any whitespace character (blank space, tab (horizontal tab), newline (linefeed aka LF), return (carriage return aka CR), form feed (vertical tab)
        \S Matches any NON-whitespace character (blank space, tab , newline (linefeed aka LF), return (carriage return aka CR), form feed (veertical tab)
        \w Matches any word characters. Equivalent to [A-Za-z0-9_]
        \W Matches any NON-word characters. Equivalent to [^A-Za-z0-9_]

      6. Escapes

        There are numerous special characters that have unique specifications in regular expressions

        Escaped CharacterExplanation
        \a Matches the bell character (aka alarm) ASCII character 007 (decimal)
        \b Matches the backspace character ASCII character 008 (decimal)
        \t Matches the tab character ASCII character 009 (decimal)
        \r Matches the carriage return character ASCII character 013 (decimal)
        \v Matches the vertical tab character ASCII character 011 (decimal)
        \f Matches the form feed character ASCII character 012 (decimal)
        \n Matches the new line character ASCII character 010 (decimal)
        \e Matches the escape character ASCII character 027 (decimal)
        \xhh Matches any ASCII character where hh is a hexadecimal number (00 through ff -- i.e., 0-255)
        \uxxxx Matches any Unicode character where xxxx is a four digit hexadecimal number

      7. Useful Patterns

        While the previous discussion provides some elementary explanations for regular expressions, the utility of regular expressions is in the work done by many individuals who work to develop accurate and efficient patterns for the myriad of tasks that confront the computer worker. Here are some useful patterns with references.

         
        Some Patterns with Explanation
         
        A debit or credit (maybe negative, one decimal, two digits to the right of the decimal)

        Pattern: ^-{0,1}[0-9]{1,15}\.[0-9]{2}$

        Explanation of the pattern:

        1. ^-{0,1} means at the begining of the string there is zero or 1 minus, then
        2. [0-9]{1,15} means one to fifteen digits 0 through 9, then
        3. \. means a single decimal point, then
        4. [0-9]{2}$ means two digits at the end of the string
         String: -12345.88

          

         String: 123456.8

          

         String: 1234,567.88

          

         String: 1234567v88

          

        String to Test --->

        A reasonable non-empty alphabetic string of no more than 20 characters

        Pattern: ^[A-Za-z\s]{1,20}$

        Explanation of the pattern:

        1. [A-za-z\s] means upper case, lower case and white space
        2. {1,20}$ means ends with from one to 20 characters
         String: AbdcefgHIJK

          

         String: AbdcefgHIJK3

          

         String: A b d cefgHIJK

          

         String: A-b-dcefgHIJK

          

        String to Test --->

        A PeopleSoft ID

        Pattern: ^0{1}[0-9]{6}$

        Explanation of the pattern:

        1. ^0{1} begins with a zero
        2. [0-9]{6}$ means ends six digits
         String: 0123456

          

         String: 1234567

          

         String: -123456

          

         String: 0123

          

        String to Test --->

        MasterCard Number

        Pattern: ^5[1-5][0-9]{14}$

        A MasterCard Number begins with the two digits somewhere between 51 through 55; and has a total of 16 digits (note the is Also a checksum digit process that requires further validation (see DISC 3371 Midtern Exam Problem 1 for the account number layout and Problem # 3 for the checksum algoritm)

        Explanation of the pattern:

        1. ^5 must begin with a 5
        2. [1-5] followed by 1,2,3,4 or 5
        3. [0-9]{14}$ end with 14 digits
         String:
         5112345678901234

          

         String:
         511234567890123

          

         String:
         5612345678901234

          

         String:
         5112-3456-7890-1234

          

        String to Test --->

        Email Address:

        Pattern: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

        from Jan Goyvaerts' regular expression pages. Since I am using a case-sensitive regular expression I have modified the test to add a-z to the three string lists.

        Explanation of the pattern:

        1. ^[A-Za-z0-9._%+-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, underscore, percent, plus, hyphen (this is the user name)
        2. @ one @ sign
        3. [A-Za-z0-9.-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, and hyphen (this is the domain)
        4. [A-Za-z]{2,4}$ ends with a 2 to 4 character top level domain
         String:
         T.P.Webb@abc.com

          

         String:
          T.P.Webb@abc.company

          

         String:
          T.P. Webb@abc.com

          

         String:
         T.P.Webb@a.b.c.com

          

        String to Test --->

        Top Level Domains (TLD, e.g., .com, .edu, .net, etc.): (See: iana)