Back to Programming Articles

How-To Validate fields or extract information using regular expressions
by IdanB 06 Jun 2009

This guide/tutorial was made to answer few of questions asked by Come2Daddy.
This guide will give the trade tools for validating fields when processing any HTML forms (before saving information to DB). Or in general will provide information on how to "analyze" certain text block, to "extract" the information we need.

so, lets begin

Intro:
Regular Expression, also known as "regex" or "regexp" in short, is combination of special chars & letters that describes a "search pattern".
The regular expression is a big area (entire books could be written about it). In this tutorial i'll give focus to php functions, since guide goal is to give you the tools for using it inside vbulletin plugins.

php has several functions that allow regular expression handling.
We'll cover the search functions commonly used:

preg_match
PHP Code:
int preg_match patternsubject [, matches [, flags [, offset]]]) 
Function performs Perl-style pattern matching on a string.

ereg/eregi
PHP Code:
int ereg pattern string [, regs ] )
int eregi pattern string [, regs ] ) 
Both function performs regular expression match.
ereg is "Case Sensitive" & eregi is "Case Insensitive".
Note these functions are much easier to handle (than the preg_match), and this guide will focus mainly on it, since the perl-style regexp pattern is more complex to understand.

Before we can gets hands on, we needs to explain how to create a regular expression pattern(s). Note: i wont be listing all of them - only the most common ones. as said previously, this regexp is big field and i dont want too many bushes hiding what i wanted people to learn, if you know what i mean...

Text Patterns:
  1. Literal Characters:
    The most basic pattern - pattern consist of single/several literal character(s) that match exact text, as-is.
    So for example, if i make pattern as "o" and string i'm searching is "bob going out" the pattern will match 3 o's (1: 2nd letter of "bob", 2: 2nd letter of "going", 3: 1st letter of "out")
    This type of pattern doesnt give any conditions, it match for any characters combo as written.
  2. Character Set:
    When we want pattern that will match char "x", char "y" or char "z" we could write pattern as "[xyz]". Anything inside the square brackets means "one of the characters" inside given set.
    Example:
    Lets assume we have following pattern [pb]each that will match for word "beach" and also for word "peach".
    Note: character set allow us to use characters range (english chars only).
    So [a-z] will match any lower case letter between a to z. Same thing could be in upper case letters: [A-Z]. We can also place number ranges: [0-9].
  3. The Dot Match - Any Character (almost):
    This char will match on every char, except for the "line break". Line break is "\n" on linux/unix and "\r\n" on windows ("carriage return" & "line feed": some long history behind these chars, if anyone intrested in it he is welcome to read more at this link: https://en.wikipedia.org/wiki/Newline )
    pattern example: ".*" will match for any chars in string (for * see "Repetition" below)
  4. Anchors:
    With anchors we can match position of pattern.
    Examples:
    Start of string: ^
    End of string: $
    End of line: $
    Note about ^ char:
    When used inside character set, it has meaning of NOT.
    For example: pattern of "[^a-z]" will match anything that isnt lower case letter.
  5. Alternation:
    used by special character of "pipe" |
    True, similar to Character Set, but allow complete word OR selection.
    so next pattern: "foo|bar|zoo" will match for one of the words "foo", "bar" or "zoo".
  6. Repetition:
    Sometimes we want to give option that will state if pattern can occur "at least one time or more", or "only x to y times" or "optional" (zero or more).
    Pattern used in several forms:
    -> one or more: + (plus char). eg.: "[a-zA-Z]+" will match any word that made of alphabet letters (both lower & upper case), starting with single letter word to as size.
    -> zero or more (optional): ? (question char). eg.: "colou?r" will match both colour and color.
    -> unlimited time limit: * (astrisk char). eg.: ".*" will match all chars in string.
    -> only x times: [a-zA-Z]{4} will match any 4 letter word.
    -> only x to y times: [a-zA-Z]{4,6} will match any 4-6 letter word.
  7. Grouping:
    When we want to group certain pattern, that will be later returned to us, we need to use round brackets ( )
  8. Special Char Escaping:
    If we want to locate some char that is special char (such as . or ^ or *) we can use slash (\). so finding the dot we uyse this pattern "\."

Common & useful patterns:
  1. IP MATCHING:
    We can have both simple pattern that match ip (but may contain invalid ip, like 999.999.999.999) or more complex regexp that verify each octect isnt bigger than 255
    Simple regexp (with some flaw):
    PHP Code:
    ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ 
    More complex copy (doesnt not match octect bigger than 255):
    PHP Code:
    ^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$ 
    Note: complex regexp can be found on web, one just need to know what to look for.
  2. EMAIL ADDRESS MATCHING:
    The next regexp will allow to check proper email:
    PHP Code:
    ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$ 

Note: the list of regex possibilites are endless, and almost each field or pattern needs close exmaining of it, to avoid any "pitfalls".
Some sites offer listing of such common patterns. Will provide refrence to 2:
http://www.regular-expressions.info/examples.html
http://regexlib.com/DisplayPatterns.aspx

php & vbulletin usage examples:
As mentioned in the begining of this guide, i will mainly focus on ereg() & eregi() functions, as they are quite easy to use.

Fetch field based on pattern (and manipulate data in plugin code):
Lets assume we have following string:
"Welcome back Idan"
(where Idan is my name, and "Welcome back" is constant string text)
and suppose i would want to extract my name from above string, i could use the following pattern: "^Welcome back ([a-zA-Z]+)$". So in code this would look like this:
PHP Code:
$sample_test_string "Welcome back Idan";
if (
ereg("^Welcome back ([a-zA-Z]+)$" ,$sample_test_string$matches_array))
{
     
$my_name $matches_array[1];

     
// do anything you want with matched value

This (very) simple example showed of to fetch specific field out of known pattern.
In vbulletin we could find various examples of usages that can help us, when we want to use regular expressions.
Lets assume we want to extract certain informaion from given template, or even anaylze data from it.

I'm reffering to another mod i've coded that uses regular expression - mod called "Alternate Last Post" (Alternate Last Post Display) , in that mod concept, i've used regular expressions pattern search to match for date fields, so i can calculate specific time from it.
In that mod i search pattern the $lastpostinfo[lastpostdate] and $lastpostinfo[lastposttime] to break it down to parts, in way that would allow to obtain day, month, year, hour, minutes, seconds, etc. & with it allow any code manipulation later on.

Require Field Validation (or redirect with error example):
if we want to redirect with error messge based on regexp pattern of the field we can do following code into plugin hook:
PHP Code:
if (!(ereg($pattern_to_search ,$field_string$matches_array)))
{
     
// if here means we didnt have a match
     // redirect with error
     
eval(standard_error(fetch_error('my_error_phrase')));

This will make the page redirect with error.
Be sure to add first the error phrase under "phrase manager" on admincp.

One more common "solution" regular expressions allow us to do is "remote data fetching" - open remote page, and assuming the pattern is fixed & known to us, extract information we need from it using pattern search.

Non-English Character Matching:
The "character set" block (square brackets) allow us to match characters, but since we can only range english chars, one trick is to use ASCII or UNICODE matching like this:
ASCII matching can be performed like this: ([^\x00-\x80]+)
Unicode matching can be performed like this: [^\u0000-\u0080]+
To our case, to match only non-english chars use: ([^\x00-\x80]+)
To match ALL chars (both english & non english & some non-chars as well, perhaps) use: ([a-zA-Z\x00-\xFF]+)

That's it for this tutorial.
I really hope this tutorial helped to build basic regexp knowledge with some refrence to go with, that one can use in his modifications coding.

vblts.ru supports vBulletin®, 2022-2024