Archive for the ‘regex’ Category

regex cheat sheet

Tuesday, March 9th, 2010 by dreamluverz




Special Sequences
  • \w - Any “word” character (a-z 0-9 _)
  • \W - Any non “word” character
  • \s - Whitespace (space, tab CRLF)
  • \S - Any non whitepsace character
  • \d - Digits (0-9)
  • \D - Any non digit character
  • . - (Period) – Any character except newline

Meta Characters

  • ^ - Start of subject (or line in multiline mode)
  • $ - End of subject (or line in multiline mode)
  • [ - Start character class definition
  • ] - End character class definition
  • | - Alternates, eg (a|b) matches a or b
  • ( - Start subpattern
  • ) - End subpattern
  • \ - Escape character

Quantifiers

  • n* - Zero or more of n
  • n+ - One or more of n
  • n? - Zero or one occurrences of n
  • {n} - n occurrences exactly
  • {n,} - At least n occurrences
  • {n,m} - Between n and m occurrences (inclusive)

Pattern Modifiers

  • i - Case Insensitive
  • m - Multiline mode – ^ and $ match start and end of lines
  • s - Dotall – . class includes newline
  • x - Extended– comments and whitespace
  • e - preg_replace only – enables evaluation of replacement as PHP code
  • S - Extra analysis of pattern
  • U - Pattern is ungreedy
  • u - Pattern is treated as UTF-8

Point based assertions

  • \b - Word boundary
  • \B - Not a word boundary
  • \A - Start of subject
  • \Z - End of subject or newline at end
  • \z - End of subject
  • \G - First matching position in subject

Assertions

  • (?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
  • (?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
  • (?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
  • (?<!) - Negative look behind assertion (?<!foo)bar matches bar when not preceded by foo
  • (?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
  • (?(x)) - Conditional subpatterns
  • (?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
  • (?#) - Comment (?# Pattern does x y or z)

source: http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

regex: match all including new line

Tuesday, March 9th, 2010 by dreamluverz

I was looking for a way to match all characters plus new line and using “.” is not solely the answer because:

The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^\n] (UNIX regex flavors) or [^\r\n](Windows regex flavors).

source: http://www.regular-expressions.info/dot.html

So I found some answers on the references below by using the \s

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

sources:
http://www.amk.ca/python/howto/regex/

http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

So combining it all example showed below:

<?
$str = ‘you \n are \n good’;
preg_match(‘/(.*)/s’, $str, $match);

?>

[additions] After testing several examples I found out that S is not enough for the example below, so instead I added the U pattern modifier so it would be like preg_match(‘/div(.*)<\/div>/sU’, $str, $match);

$a = “
<div>
<p>aaa</p>
</div>
<div>
<p>bbb</p>
</div>
<div>
<p>ccc</p>
</div>
“;

You can also check regular expressions list here.





regex: matching character +

Friday, May 22nd, 2009 by dreamluverz

regex matching charater “+”

Example:
ad+
will match add, adD, addng but not ab or ac or a

Related article: http://dreamluverz.com/developers-tools/regex/regex-matching-characters

preg_match case insensitive

Thursday, May 21st, 2009 by dreamluverz

preg_match case insensitive
<?php
// The "i" after the pattern delimiter indicates a case-insensitive search
if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {
    echo "A match was found.";
} else {
    echo "A match was not found.";
}
?>

regex: matching characters

Monday, May 4th, 2009 by dreamluverz

I wanted to note it down coz I always tend to forget these commonly used special sequences in regular expression. I hope it will also help you in one way or another.

Matching Characters

Most letters and characters will simply match themselves. For example, the regular expression test will match the string “test” exactly. (You can enable a case-insensitive mode that would let this RE match “Test” or “TEST” as well; more about this later.)

There are exceptions to this rule; some characters are special, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them. Much of this document is devoted to discussing various metacharacters and what they do.

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { [ ] \ | ( )

The first metacharacters we’ll look at are “[" and "]“. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a “-“. For example, [abc] will match any of the characters “a“, “b“, or “c“; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters “a“, “k“, “m“, or “$“; “$” is usually a metacharacter, but inside a character class it’s stripped of its special nature.

You can match the characters not within a range by complementing the set. This is indicated by including a “^” as the first character of the class; “^” elsewhere will simply match the “^” character. For example, [^5] will match any character except “5“.

Perhaps the most important metacharacter is the backslash, “\“. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a “[" or "\", you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with "\" represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn't whitespace. The following predefined special sequences are available:

\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or “,” or “.“.

The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. “.” is often used where you want to match “any character”.

source: http://www.amk.ca/python/howto/regex/

regex: backreferences

Monday, April 27th, 2009 by dreamluverz

Backreferences

Perhaps the most powerful element of the regular expression syntax, backreferences allow you to load the results of a matched pattern into a buffer and then reuse it later in the expression.

In a previous example, we used two separate regular expressions to put something before and after a filename in a list of files. I mentioned at that point that it wasn’t entirely necessary that we use two lines. This is because backreferences allow us to get it down to one line. Here’s how:

s/\(blurfle[0-9]+\)/fraggelate \1 >>fraggled_files/

The key elements in this example are the parentheses and the “\1″. Earlier we noted that parentheses can be used to limit the scope of a match. They can also be used to save a particular pattern into a temporary buffer. In this example, everything in the “search” half of the sed routine (the “blurfle” part) is saved into a buffer. In the “replace” half we recall the contents of that buffer back into the string by referring to its buffer number. In this case, buffer “\1″. So, this sed routine will do precisely what the earlier one did: find all the instances of blurfle followed by a number between zero and nine and replace it with “fragellate blurfle[some number] >>fraggled files”.

Backreferences allow for something that very few ordinary search engines can manage; namely, strings of data that change slightly from instance to instance. Page numbering schemes provide a perfect example of this. Suppose we had a document that numbered each page with the notation <page n=”[some number]” id n=”[some chapter name]“>. The number and the chapter name change from page to page, but the rest of the string stays the same. We can easily write a regular expression that matches on this string, but what if we wanted to match on it and then replace everything but the number and the chapter name?

s/<page n="\([0-9]+\)" id="\([A-Za-z]+\)">/Page \1, Chapter \2/

Buffer number one (“\1″) holds the first matched sequence, ([0-9]+); buffer number two (“\2″) holds the second, ([A-Za-z]+).

Tools vary in the number of backreference they can hold. The more common tools (like sed and grep) hold nine, but Python can hold up to ninety-nine. Perl is limited only by the amount of physical memory (which, for all practical purposes, means you can have as many as you want). Perl also lets you assign the buffer number to an ordinary scalar variable ($1, $2, etc.) so you can use it later on in the code block.

More details here: http://etext.virginia.edu/services/helpsheets/unix/regex.html