Regular Expressions -- SPRING 2004 DISC 3371 Transaction Processing I Last Updated 10AM 3/26/2009
The purpose of regular expressions is to test for the occurence of particular characters, strings or patterns of characters within another string. This method is well recognized and is used in many languages. The methods by which regular expressions are evaluated varies from language to language but the general approach is based on a single set of matching operators. This page demonstrates some of the basic methods and shows examples.
Regular Expression Object in VBScript
With version 5.0 an later, VBScript employs three basic parts of the test:
A string which is the subject of the examination
This is just a string to be tested and it possesses no unique characteristic. In the examples we will use the variable string_to_test to hold the string we wish to test and store a string value like this:
string_to_test = "Hi Ho, Hi Ho, It is off to work we go"
A regular Expression object
This VBScript object must be defined and named according to the following syntax:
SET you_pick_any_name_for_the_regular_expression_object = New RegExp
We will use the arbitrary name regexpobj as our object name. Thus the syntax would be:
SET regexpobj = New RegExp
A set of statements that defined the parameters of the regular expression object. The general form is:
With name_for_the_regular_expression_object
.Pattern = " the regular expression specification "
.IgnoreCase = True OR False
.Global= True OR False
End With
The value of .Pattern contains the string pattern or regular expression specification we will be evaluating for with the regular expression object.
The value of .IgnoreCase determines whether or not to ignore the case of the string to be searched and the string or pattern we will attempt to find. True means to ignore upper and lower case differences (i.e, upper and lower case characters produce the SAME result or the character "A" produces the same result as "a") and False means to honor the differences between upper and ower case characters (so, "a" and "A" produce different results. The default is True.
The value of .Global determines whether to search for one or more occurences of the test pattern. If True then the string is searched for all occurences of the test pattern. If False the test is terminated after the only the first match is found.
Once:
the value of the string to be tested is set;
the regular expression object created; and
the parameters of the object defined
Three methods can be employed:
The Test Method
In this simple version a variable is assigned the results of the regular expression test. The usage is:
string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
.Pattern = "work"
.IgnoreCase = True
.Global= True
End With
results = regexpobg.Test(string_to_test)
The variable results will be assigned the value True or False based on the evaluation of the string. In this case the variable results is assigned the True since the string "work" is found in the string named string_to_test.
We change the value of the IgnoreCase to True. The variable results will be assigned the value False since the .Pattern='work' doesn't match the value 'Work' in string_to_test
We change the Pattern to "FRED" -- which doesn't appear anywhere in string_to_test. The variable results will be assigned the value False
The Replace Method
The Replace method is used to find and replace all the occurences in the string. The syntax for use is the requires the specification of two strings: (1) the string to test and (2) the replacement value. This method DOES NOT return True or False but returns a new string with the replacement (if found)
The syntax is:
new_string = name_of_the_regular_expression_object.Replace (string_to_find , replacenment value )
The entire sequence would thus be for example:
string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
.Pattern = "Hi"
.IgnoreCase = True
.Global= True
End With
new_string = regexpobg.Replace(string_to_test,"Ho")
and new_staring would be: "Ho Ho, Ho Ho, It Is Off To Work We Go"
Using our example, to replace "Work" with "Drudgery" the syntax would be:
The value of Pattern would be set to "FRED" (i.e., we are trying to find "FRED" and replace it with "Drudgery". The value of new_string would be the same as string_to_test (i.e., no replacement was made since FRED wasn't found)
Suppose we try to replace "Hi" with "Drudgery" the syntax would be the same:
The value of Pattern would be set to "Hi" (i.e., we are trying to find "Hi" and replace it with "Drudgery". The value of new_string would contain the string "Drudgery" twice since there are two occurences of "hi"n the string_to_test.
The Execute Method
The Execute method is a more advanced version of the Test method. Instead of returning true or false to a variable (the lefthand side of the equal sign, the method returns an object which a collection of the matches.
This method not only tests for a pattern but it also returns the (a) number of matches and (b) where the matches occur. First, the syntax for usage is:
The syntax is:
the property: match_object.Count which is the number of matches found in the match_collecion object.
the collection match_object.Item. This collection has a zero-based index (much like forms or elements in an HTML form) and the items can be processed either:
a loop index (i.e., a subscript that starts at zero and goes to match_object.Count
a For-Each Loop
Either way, each the .Item memeber of the match_object collection has four properties (only three of which are discussed here-- more later):
.FirstIndex is the number of characters to the left of the matched string (this can also be interpreted as the character position in a string using zero as the first character)
.Length is the number of characters in the matched item
string_to_test = "Hi Ho, Hi Ho, It Is Off To Work We Go"
SET regexpobj = New RegExp
With regexpobg
.Pattern = "Ho"
.IgnoreCase = True
.Global= True
End With
match_collection = regexpobg.Execute(string_to_test)
if match_collection.Count = 0 then
os = "NO matches found"
else
os = cstr(match_collection.Count)+ " matches found."
for i=0 to (match_collection.Count - 1)
os=os + match_collection.Item(i).value
os=os + " found at " + cstr(match_collection.Item(i).FirstIndex)
os=os + " length = "+cstr(match_collection.Item(i).Length)
next
end if
msgbox os
Using our example, we will find "Hi". The syntax would be:
This is the basics of the regular expression operations. Now...
Patterns
There is a very large set of pattern matching options for regular expressions. Here are some basics:
In the following we can test for characters OR or group -- a sequence of characters enclosing in parenthesis -- like: (ABC). "Ordinary" characters do not need to be enclosed in parentheses. However, the the "special" characters: $ ^ { [ ( | ) ] } * . + ? and \ have specific meanings to regular expressions. To use these special characters in matches your must preceed the special characters with a backslash (i.e., \).
Pattern
Example
some_string
matches the exact string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go
Pattern: Hi Ho
Pattern: HiHo
Pattern: Off t
^(some_string)
matches some_string
at the beginning of the
test string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go
Pattern: ^(Hi Ho)
Pattern: ^(i Ho)
Pattern: ^H
(some_string)$
matches some_string
at the end of the
test string
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go
Pattern: (We Go)$
Pattern: o$
Pattern: G$
(string1).(string2)
matches: string1 followed by any byte
followed by string2
String to Test: Hi Ho, Hi Ho, It Is Off To Work We Go
Pattern: (to wo).(k we go)
Pattern: o.k
Pattern: (To).(Work)
(string1)|(string2)
matches string1 or string2
in either order
String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go
Pattern: (Ho)|(Hi)
Pattern: (Hi)|(Ho)|(Go)
Pattern: (Go)|(Zoo)
(string){n}
or
character{n}
must occur exactly ntimes
and be contiguous (n >=1)
(string){n,m}
or
character{n,m} must occur bewteen n and m times
and be contiguous
(0 >= n >= m)
This is Greedy
Tests m first, then m-1,...
(string){n,m}?
or
character){n,m}?
must occur between n and m times
and be contiguous
(0 >= n >= m)
This is Lazy
Tests n first, then n+1,...
(string){n,}
or
character{n,}
must occur at least n times
and be contiguous
String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go
Pattern: f{2}
Pattern: (hi ho, ){2}
Pattern: (hi ho, ){3}
String to Test: row row row your boat
Pattern: (row ){2,3}
Pattern: (row ){5,6}
Pattern: (row){1,5}
String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go
Pattern: f{2,3}
Pattern: (hi ho, ){2,3}
Pattern: (hi ho, ){3,4}
String to Test: Hi Ho, Ho Hi, It Is Off To Work We Go
Pattern: f{2,}
Pattern: (hi ho, ){2,}
Pattern: (hi ho, ){3,}
[list of characters and/or groups]
the test string
must contain at least
one these characters
or groups
String to Test: Off To Work
Pattern: [wrk]
Pattern: [abc]
Pattern: [(off)Z]
-
defines a range
of characters
the test string
must contain at least one the characters
in the range
Like:
[A-Z]
[a-z]
[A-Za-z]
[0-9]
[A-Za-z0-9]
Using the ^ character
in a range excludes
the characters in the range
String to Test: to work we go (we make the test case-sensitive here)
Pattern: [A-Z]
Pattern: [a-z]
Pattern: [A-Da-d]
String to Test: 234567
Pattern: [0-9]
Pattern: [8-9]
Pattern: [0-1]
String to Test: abcdefghijkl
Pattern: [^m-z]
Pattern: [^a-z]
Pattern: [^0-9]
*
(string)*
or some character*
the test string
must contain 0 or more
preceeding group or character
same a {0,}
String to Test: rowrowrow your bbboat
Pattern: (row)*
Pattern: b*
Pattern: X*
+
(string)+
or some character+
the test string
must contain 1 or more
preceeding group or character
same a {1,}
String to Test: rowrowrow your bbboat
Pattern: (row)+
Pattern: y+
Pattern: X+
?
(string)?
or some character?
the test string
must contain 0 or 1 of the
preceeding group or character
same a {0,1}
String to Test: rowrowrow your bbboat
Pattern: (row)?
Pattern: b?
Pattern: X?
Character Classes
There is additional shorthand for certain groups of characters
Class
Explanation
\d
Matches any decimal digit. Same as [0-9]
\D
Matches any non-digit. Same as [^0-9]
\s
Matches any whitespace character (blank space, tab (horizontal tab), newline (linefeed aka LF), return (carriage return aka CR),
form feed (vertical tab)
\S
Matches any NON-whitespace character (blank space, tab , newline (linefeed aka LF), return (carriage return aka CR),
form feed (veertical tab)
\w
Matches any word characters. Equivalent to [A-Za-z0-9_]
\W
Matches any NON-word characters. Equivalent to [^A-Za-z0-9_]
Escapes
There are numerous special characters that have unique specifications in regular expressions
Escaped Character
Explanation
\a
Matches the bell character (aka alarm) ASCII character 007 (decimal)
\b
Matches the backspace character ASCII character 008 (decimal)
\t
Matches the tab character ASCII character 009 (decimal)
\r
Matches the carriage return character ASCII character 013 (decimal)
\v
Matches the vertical tab character ASCII character 011 (decimal)
\f
Matches the form feed character ASCII character 012 (decimal)
\n
Matches the new line character ASCII character 010 (decimal)
\e
Matches the escape character ASCII character 027 (decimal)
\xhh
Matches any ASCII character where hh is a hexadecimal number (00 through ff -- i.e., 0-255)
\uxxxx
Matches any Unicode character where xxxx is a four digit hexadecimal number
Useful Patterns
While the previous discussion provides some elementary explanations for regular expressions, the utility of regular expressions is in the work done by many individuals who work to develop accurate and efficient patterns for the myriad of tasks that confront the computer worker. Here are some useful patterns with references.
Some Patterns with Explanation
A debit or credit (maybe negative, one decimal, two digits to the right of the decimal)
Pattern: ^-{0,1}[0-9]{1,15}\.[0-9]{2}$
Explanation of the pattern:
^-{0,1} means at the begining of the string there is zero or 1 minus, then
[0-9]{1,15} means one to fifteen digits 0 through 9, then
\. means a single decimal point, then
[0-9]{2}$ means two digits at the end of the string
String: -12345.88
String: 123456.8
String: 1234,567.88
String: 1234567v88
String to Test --->
A reasonable non-empty alphabetic string of no more than 20 characters
Pattern: ^[A-Za-z\s]{1,20}$
Explanation of the pattern:
[A-za-z\s] means upper case, lower case and white space
{1,20}$ means ends with from one to 20 characters
String: AbdcefgHIJK
String: AbdcefgHIJK3
String: A b d cefgHIJK
String: A-b-dcefgHIJK
String to Test --->
A PeopleSoft ID
Pattern: ^0{1}[0-9]{6}$
Explanation of the pattern:
^0{1} begins with a zero
[0-9]{6}$ means ends six digits
String: 0123456
String: 1234567
String: -123456
String: 0123
String to Test --->
MasterCard Number
Pattern: ^5[1-5][0-9]{14}$
A MasterCard Number begins with the two digits somewhere between 51 through 55; and has a total of 16 digits
(note the is Also a checksum digit process that requires further validation (see DISC 3371 Midtern Exam Problem 1 for the account number layout and Problem # 3 for the checksum algoritm)
from Jan Goyvaerts' regular expression pages. Since I am using a case-sensitive regular expression I have modified the test to add a-z to the three string lists.
Explanation of the pattern:
^[A-Za-z0-9._%+-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, underscore, percent, plus, hyphen (this is the user name)
@ one @ sign
[A-Za-z0-9.-]+ must contain a string that contains 1 or more characeters from the list: A-Z, a-z, 0-9, period, and hyphen (this is the domain)
[A-Za-z]{2,4}$ ends with a 2 to 4 character top level domain