Tag Archives: regex

regex cheat sheet





Special Sequences
  • \w - Any “word” character (a-z 0-9 _)
  • \W - Any non “word” character
  • \s - Whitespace (space, tab CRLF)
  • \S - Any non whitepsace character
  • \d - Digits (0-9)
  • \D - Any non digit character
  • . - (Period) – Any character except newline

Meta Characters

  • ^ - Start of subject (or line in multiline mode)
  • $ - End of subject (or line in multiline mode)
  • [ - Start character class definition
  • ] - End character class definition
  • | - Alternates, eg (a|b) matches a or b
  • ( - Start subpattern
  • ) - End subpattern
  • \ - Escape character

Quantifiers

  • n* - Zero or more of n
  • n+ - One or more of n
  • n? - Zero or one occurrences of n
  • {n} - n occurrences exactly
  • {n,} - At least n occurrences
  • {n,m} - Between n and m occurrences (inclusive)

Pattern Modifiers

  • i - Case Insensitive
  • m - Multiline mode – ^ and $ match start and end of lines
  • s - Dotall – . class includes newline
  • x - Extended– comments and whitespace
  • e - preg_replace only – enables evaluation of replacement as PHP code
  • S - Extra analysis of pattern
  • U - Pattern is ungreedy
  • u - Pattern is treated as UTF-8

Point based assertions

  • \b - Word boundary
  • \B - Not a word boundary
  • \A - Start of subject
  • \Z - End of subject or newline at end
  • \z - End of subject
  • \G - First matching position in subject

Assertions

  • (?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
  • (?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
  • (?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
  • (?<!) - Negative look behind assertion (?<!foo)bar matches bar when not preceded by foo
  • (?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
  • (?(x)) - Conditional subpatterns
  • (?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
  • (?#) - Comment (?# Pattern does x y or z)

source: http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

regex: match all including new line

I was looking for a way to match all characters plus new line and using “.” is not solely the answer because:

The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^\n] (UNIX regex flavors) or [^\r\n](Windows regex flavors).

source: http://www.regular-expressions.info/dot.html

So I found some answers on the references below by using the \s

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

sources:
http://www.amk.ca/python/howto/regex/

http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

So combining it all example showed below:

<?
$str = ‘you \n are \n good’;
preg_match(‘/(.*)/s’, $str, $match);

?>

[additions] After testing several examples I found out that S is not enough for the example below, so instead I added the U pattern modifier so it would be like preg_match(‘/div(.*)<\/div>/sU’, $str, $match);

$a = “
<div>
<p>aaa</p>
</div>
<div>
<p>bbb</p>
</div>
<div>
<p>ccc</p>
</div>
“;

You can also check regular expressions list here.





preg_match case insensitive


preg_match case insensitive
<?php
// The "i" after the pattern delimiter indicates a case-insensitive search
if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {
    echo "A match was found.";
} else {
    echo "A match was not found.";
}
?>

regex: backreferences

Backreferences

Perhaps the most powerful element of the regular expression syntax, backreferences allow you to load the results of a matched pattern into a buffer and then reuse it later in the expression.

In a previous example, we used two separate regular expressions to put something before and after a filename in a list of files. I mentioned at that point that it wasn’t entirely necessary that we use two lines. This is because backreferences allow us to get it down to one line. Here’s how:

s/\(blurfle[0-9]+\)/fraggelate \1 >>fraggled_files/

The key elements in this example are the parentheses and the “\1″. Earlier we noted that parentheses can be used to limit the scope of a match. They can also be used to save a particular pattern into a temporary buffer. In this example, everything in the “search” half of the sed routine (the “blurfle” part) is saved into a buffer. In the “replace” half we recall the contents of that buffer back into the string by referring to its buffer number. In this case, buffer “\1″. So, this sed routine will do precisely what the earlier one did: find all the instances of blurfle followed by a number between zero and nine and replace it with “fragellate blurfle[some number] >>fraggled files”.

Backreferences allow for something that very few ordinary search engines can manage; namely, strings of data that change slightly from instance to instance. Page numbering schemes provide a perfect example of this. Suppose we had a document that numbered each page with the notation <page n=”[some number]” id n=”[some chapter name]“>. The number and the chapter name change from page to page, but the rest of the string stays the same. We can easily write a regular expression that matches on this string, but what if we wanted to match on it and then replace everything but the number and the chapter name?

s/<page n="\([0-9]+\)" id="\([A-Za-z]+\)">/Page \1, Chapter \2/

Buffer number one (“\1″) holds the first matched sequence, ([0-9]+); buffer number two (“\2″) holds the second, ([A-Za-z]+).

Tools vary in the number of backreference they can hold. The more common tools (like sed and grep) hold nine, but Python can hold up to ninety-nine. Perl is limited only by the amount of physical memory (which, for all practical purposes, means you can have as many as you want). Perl also lets you assign the buffer number to an ordinary scalar variable ($1, $2, etc.) so you can use it later on in the code block.

More details here: http://etext.virginia.edu/services/helpsheets/unix/regex.html