Tag Archives: regular expression

regex: match all including new line

I was looking for a way to match all characters plus new line and using “.” is not solely the answer because:

The dot matches a single character, without caring what that character is. The only exception are newline characters. In all regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for the negated character class [^\n] (UNIX regex flavors) or [^\r\n](Windows regex flavors).

source: http://www.regular-expressions.info/dot.html

So I found some answers on the references below by using the \s

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

sources:
http://www.amk.ca/python/howto/regex/

http://www.phpro.org/tutorials/Introduction-to-PHP-Regex.html

So combining it all example showed below:

<?
$str = ‘you \n are \n good’;
preg_match(‘/(.*)/s’, $str, $match);

?>

[additions] After testing several examples I found out that S is not enough for the example below, so instead I added the U pattern modifier so it would be like preg_match(‘/div(.*)<\/div>/sU’, $str, $match);

$a = “
<div>
<p>aaa</p>
</div>
<div>
<p>bbb</p>
</div>
<div>
<p>ccc</p>
</div>
“;

You can also check regular expressions list here.





regex: matching characters

I wanted to note it down coz I always tend to forget these commonly used special sequences in regular expression. I hope it will also help you in one way or another.

Matching Characters

Most letters and characters will simply match themselves. For example, the regular expression test will match the string “test” exactly. (You can enable a case-insensitive mode that would let this RE match “Test” or “TEST” as well; more about this later.)

There are exceptions to this rule; some characters are special, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them. Much of this document is devoted to discussing various metacharacters and what they do.

Here’s a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { [ ] \ | ( )

The first metacharacters we’ll look at are “[" and "]“. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a “-“. For example, [abc] will match any of the characters “a“, “b“, or “c“; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters “a“, “k“, “m“, or “$“; “$” is usually a metacharacter, but inside a character class it’s stripped of its special nature.

You can match the characters not within a range by complementing the set. This is indicated by including a “^” as the first character of the class; “^” elsewhere will simply match the “^” character. For example, [^5] will match any character except “5“.

Perhaps the most important metacharacter is the backslash, “\“. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a “[" or "\", you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of the special sequences beginning with "\" represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn't whitespace. The following predefined special sequences are available:

\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or “,” or “.“.

The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. “.” is often used where you want to match “any character”.

source: http://www.amk.ca/python/howto/regex/





regex: backreferences

Backreferences

Perhaps the most powerful element of the regular expression syntax, backreferences allow you to load the results of a matched pattern into a buffer and then reuse it later in the expression.

In a previous example, we used two separate regular expressions to put something before and after a filename in a list of files. I mentioned at that point that it wasn’t entirely necessary that we use two lines. This is because backreferences allow us to get it down to one line. Here’s how:

s/\(blurfle[0-9]+\)/fraggelate \1 >>fraggled_files/

The key elements in this example are the parentheses and the “\1″. Earlier we noted that parentheses can be used to limit the scope of a match. They can also be used to save a particular pattern into a temporary buffer. In this example, everything in the “search” half of the sed routine (the “blurfle” part) is saved into a buffer. In the “replace” half we recall the contents of that buffer back into the string by referring to its buffer number. In this case, buffer “\1″. So, this sed routine will do precisely what the earlier one did: find all the instances of blurfle followed by a number between zero and nine and replace it with “fragellate blurfle[some number] >>fraggled files”.

Backreferences allow for something that very few ordinary search engines can manage; namely, strings of data that change slightly from instance to instance. Page numbering schemes provide a perfect example of this. Suppose we had a document that numbered each page with the notation <page n=”[some number]” id n=”[some chapter name]“>. The number and the chapter name change from page to page, but the rest of the string stays the same. We can easily write a regular expression that matches on this string, but what if we wanted to match on it and then replace everything but the number and the chapter name?

s/<page n="\([0-9]+\)" id="\([A-Za-z]+\)">/Page \1, Chapter \2/

Buffer number one (“\1″) holds the first matched sequence, ([0-9]+); buffer number two (“\2″) holds the second, ([A-Za-z]+).

Tools vary in the number of backreference they can hold. The more common tools (like sed and grep) hold nine, but Python can hold up to ninety-nine. Perl is limited only by the amount of physical memory (which, for all practical purposes, means you can have as many as you want). Perl also lets you assign the buffer number to an ordinary scalar variable ($1, $2, etc.) so you can use it later on in the code block.

More details here: http://etext.virginia.edu/services/helpsheets/unix/regex.html