Horatio T.P. Webb

Regular Expressions -- SPRING 2004
DISC 3371 Transaction Processing I
Last Updated 10AM 3/26/2009

The purpose of regular expressions is to test for the occurence of particular characters, strings or patterns of characters within another string. This method is well recognized and is used in many languages. The methods by which regular expressions are evaluated varies from language to language but the general approach is based on a single set of matching operators. This page demonstrates some of the basic methods and shows examples.

Regular Expression Object in VBScript

With version 5.0 an later, VBScript employs three basic parts of the test:

A string which is the subject of the examination
This is just a string to be tested and it possesses no unique characteristic. In the examples we will use the variable string_to_test to hold the string we wish to test and store a string value like this:
string_to_test = "Hi Ho, Hi Ho, It is off to work we go"
A regular Expression object
This VBScript object must be defined and named according to the following syntax:
SET you_pick_any_name_for_the_regular_expression_object = New RegExp
We will use the arbitrary name regexpobj as our object name. Thus the syntax would be:
SET regexpobj = New RegExp

A set of statements that defined the parameters of the regular expression object. The general form is:

With name_for_the_regular_expression_object
   .Pattern = " the regular expression specification "
   .IgnoreCase = True OR False
   .Global= True OR False
End With

The value of .Pattern contains the string pattern or regular expression specification we will be evaluating for with the regular expression object.

The value of .IgnoreCase determines whether or not to ignore the case of the string to be searched and the string or pattern we will attempt to find. True means to ignore upper and lower case differences (i.e, upper and lower case characters produce the SAME result or the character "A" produces the same result as "a") and False means to honor the differences between upper and ower case characters (so, "a" and "A" produce different results. The default is True.

The value of .Global determines whether to search for one or more occurences of the test pattern. If True then the string is searched for all occurences of the test pattern. If False the test is terminated after the only the first match is found.

Once:

the value of the string to be tested is set;
the regular expression object created; and
the parameters of the object defined

Three methods can be employed:

The Test Method

In this simple version a variable is assigned the results of the regular expression test. The usage is:

result_string = name_of_the_regular_expression_object.Test (string_to_test)

The entire sequence would thus be for example:

string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
   .Pattern = "work"
   .IgnoreCase = True
   .Global= True
End With
results = regexpobg.Test(string_to_test)

   The variable results will be assigned the value True or False based on the evaluation of the string. In this case the variable results is assigned the True since the string "work" is found in the string named string_to_test.

   We change the value of the IgnoreCase to True. The variable results will be assigned the value False since the .Pattern='work' doesn't match the value 'Work' in string_to_test

   We change the Pattern to "FRED" -- which doesn't appear anywhere in string_to_test. The variable results will be assigned the value False

The Replace Method

The Replace method is used to find and replace all the occurences in the string. The syntax for use is the requires the specification of two strings: (1) the string to test and (2) the replacement value. This method DOES NOT return True or False but returns a new string with the replacement (if found)

The syntax is:

new_string = name_of_the_regular_expression_object.Replace (string_to_find , replacenment value )

The entire sequence would thus be for example:

string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
   .Pattern = "Hi"
   .IgnoreCase = True
   .Global= True
End With
new_string = regexpobg.Replace(string_to_test,"Ho")

and new_staring would be: "Ho Ho, Ho Ho, It Is Off To Work We Go"

   Using our example, to replace "Work" with "Drudgery" the syntax would be:
new_string = regexpobj.Replace(string_to_test,"Drudgery")
the value of new_string would be "Hi Ho, Hi Ho, IT Is Off To Drudgery We Go"

   Suppose we try to replace "FRED" with "Drudgery" the syntax would be the same:
new_string = regexpobj.Replace(string_to_test,"Drudgery")
The value of Pattern would be set to "FRED" (i.e., we are trying to find "FRED" and replace it with "Drudgery". The value of new_string would be the same as string_to_test (i.e., no replacement was made since FRED wasn't found)

   Suppose we try to replace "Hi" with "Drudgery" the syntax would be the same:
new_string = regexpobj.Replace(string_to_test,"Drudgery")
The value of Pattern would be set to "Hi" (i.e., we are trying to find "Hi" and replace it with "Drudgery". The value of new_string would contain the string "Drudgery" twice since there are two occurences of "hi"n the string_to_test.

The Execute Method

The Execute method is a more advanced version of the Test method. Instead of returning true or false to a variable (the lefthand side of the equal sign, the method returns an object which a collection of the matches. This method not only tests for a pattern but it also returns the (a) number of matches and (b) where the matches occur. First, the syntax for usage is: The syntax is:

match_object = name_of_the_regular_expression_object.Execute (string_to_test)

Additionally we have:

the property: match_object.Count which is the number of matches found in the match_collecion object.
the collection match_object.Item. This collection has a zero-based index (much like forms or elements in an HTML form) and the items can be processed either:
1. a loop index (i.e., a subscript that starts at zero and goes to match_object.Count
2. a For-Each Loop
Either way, each the .Item memeber of the match_object collection has four properties (only three of which are discussed here-- more later):
1. .FirstIndex is the number of characters to the left of the matched string (this can also be interpreted as the character position in a string using zero as the first character)
2. .Length is the number of characters in the matched item
3. .Value is the string that was matched

The usage is:

match_collection = name_of_the_regular_expression_object.Execute (string_to_find)

The entire sequence would thus be for example:

string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
   .Pattern = "Ho"
   .IgnoreCase = True
   .Global= True
End With
match_collection = regexpobg.Execute(string_to_test)
if match_collection.Count = 0 then
    os = "NO matches found"
else
     os = cstr(match_collection.Count)+ " matches found."
     for i=0 to (match_collection.Count - 1)
         os=os + match_collection.Item(i).value
         os=os + " found at " + cstr(match_collection.Item(i).FirstIndex)
         os=os + " length = "+cstr(match_collection.Item(i).Length)
     next
end if
msgbox os

   Using our example, we will find "Hi". The syntax would be:
match_collection = regexpobj.Execute(string_to_test)

   Using our example, we will find "i". The syntax would be:
match_collection = regexpobj.Execute(string_to_test)

   Using our example, we will find "o" (that is the letter "oh" -- not zero). The syntax would be:
match_collection = regexpobj.Execute(string_to_test)

This is the basics of the regular expression operations. Now...

Patterns

There is a very large set of pattern matching options for regular expressions. Here are some basics:

In the following we can test for characters OR or group -- a sequence of characters enclosing in parenthesis -- like: (ABC). "Ordinary" characters do not need to be enclosed in parentheses. However, the the "special" characters: $ ^ { [ ( | ) ] } * . + ? and \ have specific meanings to regular expressions. To use these special characters in matches your must preceed the special characters with a backslash (i.e., \).

Pattern Example

some_string
matches the exact string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go

Pattern: Hi Ho

Pattern: HiHo

Pattern: Off t


^(some_string)
matches some_string
at the beginning of the
test string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go

Pattern: ^(Hi Ho)

Pattern: ^(i Ho)

Pattern: ^H


(some_string)$
matches some_string
at the end of the
test string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go

Pattern: (We Go)$

Pattern: o$

Pattern: G$


(string1).(string2)
matches: string1
followed by any byte
followed by string2
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go

Pattern: (to wo).(k we go)

Pattern: o.k

Pattern: (To).(Work)


(string1)|(string2)
matches
string1 or string2
in either order
String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go

Pattern: (Ho)|(Hi)

Pattern: (Hi)|(Ho)|(Go)

Pattern: (Go)|(Zoo)


(string){n} or character{n}
must occur exactly
n times
and be contiguous (n >=1)
(string){n,m} or character{n,m}
must occur bewteen
n and m times
and be contiguous
(0 >= n >= m)
This is Greedy
Tests m first, then m-1,...
(string){n,m}? or character){n,m}?
must occur between
n and m times
and be contiguous
(0 >= n >= m)
This is Lazy
Tests n first, then n+1,...
(string){n,} or character{n,}
must occur at least
n times
and be contiguous String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go

Pattern: f{2}

Pattern: (hi ho, ){2}

Pattern: (hi ho, ){3}


String to Test: row row row your boat

Pattern: (row ){2,3}

Pattern: (row ){5,6}

Pattern: (row){1,5}


String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go

Pattern: f{2,3}

Pattern: (hi ho, ){2,3}

Pattern: (hi ho, ){3,4}


String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go

Pattern: f{2,}

Pattern: (hi ho, ){2,}

Pattern: (hi ho, ){3,}


[list of characters and/or groups]
the test string
must contain at least
one these characters
or groups
String to Test: Off To Work

Pattern: [wrk]


Pattern: [abc]


Pattern: [(off)Z]


-
defines a range
of characters
the test string
must contain at least
one the characters
in the range
Like:
[A-Z]
[a-z]
[A-Za-z]
[0-9]
[A-Za-z0-9]
Using the ^ character
in a range excludes
the characters in the range
String to Test: to work we go (we make the test case-sensitive here)

Pattern: [A-Z]


Pattern: [a-z]


Pattern: [A-Da-d]


String to Test: 234567

Pattern: [0-9]


Pattern: [8-9]


Pattern: [0-1]


String to Test: abcdefghijkl

Pattern: [^m-z]


Pattern: [^a-z]


Pattern: [^0-9]


*
(string)*
or
some character*
the test string
must contain 0 or more
preceeding group or character
same a {0,}
String to Test: rowrowrow your bbboat

Pattern: (row)*


Pattern: b*


Pattern: X*


+
(string)+
or
some character+
the test string
must contain 1 or more
preceeding group or character
same a {1,}
String to Test: rowrowrow your bbboat

Pattern: (row)+


Pattern: y+


Pattern: X+


?
(string)?
or
some character?
the test string
must contain 0 or 1 of the
preceeding group or character
same a {0,1}
String to Test: rowrowrow your bbboat

Pattern: (row)?


Pattern: b?


Pattern: X?

Character Classes

There is additional shorthand for certain groups of characters

Class Explanation

\d Matches any decimal digit. Same as [0-9]

\D Matches any non-digit. Same as [^0-9]

\s Matches any whitespace character (blank space, tab (horizontal tab), newline (linefeed aka LF), return (carriage return aka CR), form feed (vertical tab)

\S Matches any NON-whitespace character (blank space, tab , newline (linefeed aka LF), return (carriage return aka CR), form feed (veertical tab)

\w Matches any word characters. Equivalent to [A-Za-z0-9_]

\W Matches any NON-word characters. Equivalent to [^A-Za-z0-9_]

Escapes

There are numerous special characters that have unique specifications in regular expressions

Escaped Character Explanation

\a Matches the bell character (aka alarm) ASCII character 007 (decimal)

\b Matches the backspace character ASCII character 008 (decimal)

\t Matches the tab character ASCII character 009 (decimal)

\r Matches the carriage return character ASCII character 013 (decimal)

\v Matches the vertical tab character ASCII character 011 (decimal)

\f Matches the form feed character ASCII character 012 (decimal)

\n Matches the new line character ASCII character 010 (decimal)

\e Matches the escape character ASCII character 027 (decimal)

\xhh Matches any ASCII character where hh is a hexadecimal number (00 through ff -- i.e., 0-255)

\uxxxx Matches any Unicode character where xxxx is a four digit hexadecimal number

Useful Patterns

While the previous discussion provides some elementary explanations for regular expressions, the utility of regular expressions is in the work done by many individuals who work to develop accurate and efficient patterns for the myriad of tasks that confront the computer worker. Here are some useful patterns with references.

Some Patterns with Explanation

A debit or credit (maybe negative, one decimal, two digits to the right of the decimal)
Pattern: ^-{0,1}[0-9]{1,15}\.[0-9]{2}$
Explanation of the pattern:

^-{0,1} means at the begining of the string there is zero or 1 minus, then
[0-9]{1,15} means one to fifteen digits 0 through 9, then
\. means a single decimal point, then
[0-9]{2}$ means two digits at the end of the string

String: -12345.88

String: 123456.8

String: 1234,567.88

String: 1234567v88


String to Test --->

A reasonable non-empty alphabetic string of no more than 20 characters
Pattern: ^[A-Za-z\s]{1,20}$
Explanation of the pattern:

[A-za-z\s] means upper case, lower case and white space
{1,20}$ means ends with from one to 20 characters

String: AbdcefgHIJK

String: AbdcefgHIJK3

String: A b d cefgHIJK

String: A-b-dcefgHIJK


String to Test --->

A PeopleSoft ID
Pattern: ^0{1}[0-9]{6}$
Explanation of the pattern:

^0{1} begins with a zero
[0-9]{6}$ means ends six digits

String: 0123456

String: 1234567

String: -123456

String: 0123


String to Test --->

MasterCard Number
Pattern: ^5[1-5][0-9]{14}$
A MasterCard Number begins with the two digits somewhere between 51 through 55; and has a total of 16 digits (note the is Also a checksum digit process that requires further validation (see DISC 3371 Midtern Exam Problem 1 for the account number layout and Problem # 3 for the checksum algoritm)
Explanation of the pattern:

^5 must begin with a 5
[1-5] followed by 1,2,3,4 or 5
[0-9]{14}$ end with 14 digits

String:
5112345678901234

String:
511234567890123

String:
5612345678901234

String:
5112-3456-7890-1234


String to Test --->

Email Address:
Pattern: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$
from Jan Goyvaerts' regular expression pages. Since I am using a case-sensitive regular expression I have modified the test to add a-z to the three string lists.
Explanation of the pattern:

^[A-Za-z0-9._%+-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, underscore, percent, plus, hyphen (this is the user name)
@ one @ sign
[A-Za-z0-9.-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, and hyphen (this is the domain)
[A-Za-z]{2,4}$ ends with a 2 to 4 character top level domain

String:
T.P.Webb@abc.com

String:
T.P.Webb@abc.company

String:
T.P. Webb@abc.com

String:
T.P.Webb@a.b.c.com


String to Test --->

Top Level Domains (TLD, e.g., .com, .edu, .net, etc.): (See: iana)

	The variable results will be assigned the value True or False based on the evaluation of the string. In this case the variable results is assigned the True since the string "work" is found in the string named string_to_test.
	We change the value of the IgnoreCase to True. The variable results will be assigned the value False since the .Pattern='work' doesn't match the value 'Work' in string_to_test
	We change the Pattern to "FRED" -- which doesn't appear anywhere in string_to_test. The variable results will be assigned the value False

Class	Explanation
\d	Matches any decimal digit. Same as [0-9]
\D	Matches any non-digit. Same as [^0-9]
\s	Matches any whitespace character (blank space, tab (horizontal tab), newline (linefeed aka LF), return (carriage return aka CR), form feed (vertical tab)
\S	Matches any NON-whitespace character (blank space, tab , newline (linefeed aka LF), return (carriage return aka CR), form feed (veertical tab)
\w	Matches any word characters. Equivalent to [A-Za-z0-9_]
\W	Matches any NON-word characters. Equivalent to [^A-Za-z0-9_]

Escaped Character	Explanation
\a	Matches the bell character (aka alarm) ASCII character 007 (decimal)
\b	Matches the backspace character ASCII character 008 (decimal)
\t	Matches the tab character ASCII character 009 (decimal)
\r	Matches the carriage return character ASCII character 013 (decimal)
\v	Matches the vertical tab character ASCII character 011 (decimal)
\f	Matches the form feed character ASCII character 012 (decimal)
\n	Matches the new line character ASCII character 010 (decimal)
\e	Matches the escape character ASCII character 027 (decimal)
\xhh	Matches any ASCII character where hh is a hexadecimal number (00 through ff -- i.e., 0-255)
\uxxxx	Matches any Unicode character where xxxx is a four digit hexadecimal number