Introduction into regular expressions

A brief introduction of regular expressions with emphasis on their application in the MobilChange system environment. MobilChange uses regular expressions in many instances, for example in routing configurations for outgoing and incoming messages. This document describes the basics of regular expression applications.

Introduction
Special Characters
How to Find Special Characters?
More Complex Expressions

Introduction--What is a regular expression?

Regular expression (referred to as regexp hereinafter) is only an extension of the traditional "asterisk convention" from MS DOS. You can enter more complex expressions and what is more important you can look for parts and use them for further processing.

Special Characters

In the traditional MS DOS system, two special characters are defined when using asterisk conventions: an asterisk (any number of arbitrary characters) and a question mark (a single arbitrary character). Regexp contains more characters like these:

. - dot: A dot represents one arbitrary character. The function is identical to the DOS question mark.
* - asterisk: Any number (even zero) of duplicates of the previous character (or a group of characters). So, an equivalent of the DOS asterisk would be written as .* (any number of duplicates of any character).
+ - plus: Any non-zero number of duplicates of the previous character (or a group of characters). I.e. it behaves like an asterisk, but the searched character/character sequence must occur at least once.
? - question mark: The previous character (or a group of characters) is optional--it may or it may not occur.
AHO? will retrieve AHOY and AHO.
[] - square brackets: Square brackets determine a more complex definition of a single character. They contain information about characters permitted on a given position, by account or interval:
[123] permits 1, 2 or 3 on a given position
[2-6] permits one of the numbers from 2 to 6 on a given position
In one pair of brackets, there can be several conditions, i.e.
[2-6ABCG-L] permits on this position numbers 2 to 6 or one of the letters A B C G H I J K L
^ - circumflex: A circumflex has two meanings. At the beginning of regexp, it specifies that regexp must "hit the target" directly from the beginning of the tested expression (in other cases it is valid, even if it finds an instance in the middle). In square brackets it specifies that the following character must not appear on a given position. Regexp [^a] then says that any character except "a" can be on the given position.
$ - dollar: A dollar character at the end of regexp is similar to støíška at its beginning--it says that regexp must proceed to the end of the tested text.
() - round brackets: Round brackets determine a "sub expression". It can be applied for further processing (for example here) and it can also be used for more complex searches:
bana(na)* matches bana, banana, bananana, banananana etc.

How to Find Special Characters?

If you need to look for special characters in regexp, just prefix them with "\" (a backslash). In square brackets, only ^ a works as a special character, the rest does not have to be prefixed by \.

More Complex Expressions

All previous regular expressions can be used to create more complex expressions. Several examples are listed bellow:

^#pizza (.*)$: This regexp will be valid for all texts beginning with #pizza followed by one space. All following characters (to the end of the expression) are stored in sub expression no. 1 which can be used for further processing.
^#([^ ]+) (.*)$: This regexp will be valid for all texts beginning with # followed by any sequence of characters, one space and another arbitrary sequence of characters. The result of the sub expression [^ ]+ (a sequence of any characters up to a space) will be stored as sub expression no. 1, the second sub expression .* (a sequence of any characters) will be stored as sub sequence no. 2.
^42060[34][0-9]+,^4616$: It will find all T-Mobile CZ network: 420603xxxxxx, 420604ccccccc, 4616. However, it will not find 420602zzzzzzz (2 at the sixth position does not comply with the expression [34]) and 420603 (the expression [0-9]+ is not fulfilled).

Example of searching

Expression	What is found?
Praha`.`	String Praha
Prah`.`	Strings Praha, Prahy, Prahu etc.
Prah`.`*	Strings Praha, Prahy, Prahu, . . . , Prahou etc.
^Praha	String Praha at the beginning of expression being compared.
Praha$	String Praha at the end of expression being compared.
^Praha$	String Praha exactly as value of compared expression
P[rR][aA][hH][aA]	Praha with case insensitive letters from the second to fifth position.
[a-zA-Z]	Any letters of English alphabet.
[A-Z]*	String of any length consisting only of capital letters of English alphabet.
[A-Z][A-Z]*	String (at least one char long) consisting only of capital letters of English alphabet.
[A-Z]`.`*	String (at least one char long), starting with capital letter. Other chars can be any.
part[0-9]	Values part0, part1, . . . , part9.
[Rr][Ff][Cc][0-9]`.\.`[Hh][Tt][Mm]`.`	Request for finding RFC document - at the beginning of filename must be "RFC" string no matter of letter size. Then one number follows and any other characters. There is "htm" extension after dot sign, which can be extended (to HTML for example).