Membership since Feb 08 2003
1983 posts Send PM Add Friend | One very useful function in particular is preg_replace(), which allows you to find certain occurrences of words in an advanced, customized way and replace them with a a string of your choice. The searched string can either be a simple string or it can be a regular expression (REGEX). These regular expressions are like targeted wildcards, albeit MUCH more complex. REGEX is used by many text editors, utilities, and programming languages to search and manipulate text based on patterns.
Special Characters
| Characters | Functions |
| / | Define the string pattern modifier. i.e."/ /i" means that patterns are case insensitive. |
| \ | Escape special characters, i.e., "\/" denotes "/". In UTF encoding, "\p{Pe}" means the Pe class of special characters. |
| $ | The end of the line. In replace pattern, "$n" means the patterns in nth pair of "()" from left to right. |
| ^ | The start of the line. In "[^]", "^" is used to negating character classes. |
| | | Boolean "or". For example, gray|grey can match "gray" or "grey". |
| ^ | The start of the line. In "[^]", "^" is used to negating character classes. |
| ? | The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour". |
| * | The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. |
| + | The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". |
| . | Matches any single character. Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c". |
Basic REGEX
Make no mistake - REGEX is widely used today - even searches in Microsoft Windows use them to some degree. Let me point you towards a simple example:
*.* - This is REGEX, and in windows it means "find any file with any extension in a given directory". In PHP it would mean "find one or more characters followed by a dot followed by one or more characters. Let us enhance that a little:
[A-Z]*.* - The "[A-Z]" is a character class and it basically means any letter from a to z that is uppercase. If you want to collect lowercase you would enter "[a-z]". If you would like to collect any letter, the obvious solution would be "[A-Za-z]".
TIP: If you want to check for a custom range of characters you could always use [g-p], etc.
Occurence-Counting REGEX
A character class followed by a "*" means "zero or more characters from the selected character class". So this string: [a-z]* would mean "zero or more lowercase letters". If you need to check for at least one occurrence of a letter you would use:
[a-z]+ - A "+" basically means "one or more occurrences". You could also do:
[a-z]{1} - This means "exactly one or more occurrences of a lowercase letter". So "exactly two to three occurrences of a lowercase letter" would be: [a-z]{2-3}
If you want to check for an optional character you use the question mark (?), like this: [a-z]? - And the explanation of this line is "an optional lowercase character". Now that we have this covered lets move on...
Character-Counting REGEX
^(.){4-6}$ - In PHP REGEX the carrot (^) symbol basically means the beginning of the line. So the dollar ($) symbol obviously means the end of the line. The end of the line occurs when a '/n' character is found. So this expression will mean "the start of the line followed by 4 to 6 any characters followed by the end of the line". Yes, the dot (.) character means "any character". So the line: (.*) would mean "any amount of any character". The carrot (^) character can also be used for negating character classes. By negating I mean checking if there are no characters of the specified range. So a string like ^[^0-9]*$ would mean "start of the line followed by zero or more any characters that is NOT a digit followed by the end of the line".
The Zen of Brackets
By now you have probably noticed all the different brackets that are used. All of them have a different meaning. Let me explain:
The parenthesis "(" and ")" are used to group different expressions together, to which (if you need to use preg_replace) you can return later using a simple "$n" where n means a digit representing order from left to right of all the groups in the REGEX string. So, if you want to extract the text from the second group in this: ^([a-z]+)[A-Z]?([0-5]{1-3})$ You would have to use "$2" (the first group is ([a-z]+) and the second is ([0-5]{1-3})). And, of course, the usual translation of the string to human language is "the start of the line followed by one or more lowercase letters followed by an optional uppercase letter followed by 1 to 3 digits not higher then 5 followed by the end of the line".
The curly brackets "{" and "}" represent the widely used minimum/maximum values. As explained earlier, they can be used to further customize checking for characters in a string instead of the usual "one or more" or "zero or more". Syntax would be: {n} for n or more e.g. {1}, or {n-m} for no less than n number of characters and no more then m number of characters. e.g. {3-7}
And finally, of course, there are the the normal brackets "[" and "]". These represent a character range, which was also explained earlier. The syntax for this one is: [a-b] where a is the range start and b is the range end e.g. [A-Z]
Of course, you don't have to use all REGEX for a string. You can also check for occurrences of words in a more advanced way. If, for example, you would like to search for a string containing the word "military" followed by an optional digit followed by the end of the line, you would write something like this: [Mm]ilitary[0-9]?$ Take note that the "[Mm]" is also a character range - it specifies a search for either character in the brackets. You can use all kinds of characters in your searches, but if you want to use a special character (e.g. a bracket) you will need to escape it using the all-saving backslash (). This is, of course, the rule for PHP in general anyway! So, for example, if you want to search for "[word]" you would write the REGEX like this: ([word]+)
Commonly Used Examples
Now that we have all the advanced theory out of the way, here are some frequently used reference REGEX expressions found in popular PHP-driven scripts:
[b](.*?)[/b] - What you see here is REGEX used to search for text encased in a [b] and [/b] tag. This is used very widely among forums, news systems of all kinds, etc.
[0-9A-Za-z]{8-15} - This could be used in scripts that utilise registration with passwords. This REGEX only accepts a string that is numeric or alphabetic with minimum 8 and maximum 15 characters.
| POSIX | Perl | ASCII | Description |
| [:alnum:] |
|
[A-Za-z0-9] |
Alphanumeric characters |
| [:word:] |
\w |
[A-Za-z0-9_] |
Alphanumeric characters plus "_" |
|
\W |
[^\w] |
non-word character |
| [:alpha:] |
|
[A-Za-z] |
Alphabetic characters |
| [:blank:] |
|
[ \t] |
Space and tab |
| [:cntrl:] |
|
[\x00-\x1F\x7F] |
Control characters |
| [:digit:] |
\d |
[0-9] |
Digits |
|
\D |
[^\d] |
non-digit |
| [:graph:] |
|
[\x21-\x7E] |
Visible characters |
| [:lower:] |
|
[a-z] |
Lowercase letters |
| [:print:] |
|
[\x20-\x7E] |
Visible characters and spaces |
| [:punct:] |
|
[-!"#$%&'()*+,./:;<=>?@\[\\\]_`{|}~] |
Punctuation characters |
| [:space:] |
\s |
[ \t\r\n\v\f] |
Whitespace characters |
|
\S |
[^\s] |
non-whitespace character |
| [:upper:] |
|
[A-Z] |
Uppercase letters |
| [:xdigit:] |
|
[A-Fa-f0-9] |
Hexadecimal digits |
Last Modification: Dec 17 2009, 04:26 PM |