regex: backreferences

Backreferences

Perhaps the most powerful element of the regular expression syntax, backreferences allow you to load the results of a matched pattern into a buffer and then reuse it later in the expression.

In a previous example, we used two separate regular expressions to put something before and after a filename in a list of files. I mentioned at that point that it wasn’t entirely necessary that we use two lines. This is because backreferences allow us to get it down to one line. Here’s how:





s/\(blurfle[0-9]+\)/fraggelate \1 >>fraggled_files/





The key elements in this example are the parentheses and the “\1″. Earlier we noted that parentheses can be used to limit the scope of a match. They can also be used to save a particular pattern into a temporary buffer. In this example, everything in the “search” half of the sed routine (the “blurfle” part) is saved into a buffer. In the “replace” half we recall the contents of that buffer back into the string by referring to its buffer number. In this case, buffer “\1″. So, this sed routine will do precisely what the earlier one did: find all the instances of blurfle followed by a number between zero and nine and replace it with “fragellate blurfle[some number] >>fraggled files”.

Backreferences allow for something that very few ordinary search engines can manage; namely, strings of data that change slightly from instance to instance. Page numbering schemes provide a perfect example of this. Suppose we had a document that numbered each page with the notation <page n=”[some number]” id n=”[some chapter name]“>. The number and the chapter name change from page to page, but the rest of the string stays the same. We can easily write a regular expression that matches on this string, but what if we wanted to match on it and then replace everything but the number and the chapter name?

s/<page n="\([0-9]+\)" id="\([A-Za-z]+\)">/Page \1, Chapter \2/

Buffer number one (“\1″) holds the first matched sequence, ([0-9]+); buffer number two (“\2″) holds the second, ([A-Za-z]+).

Tools vary in the number of backreference they can hold. The more common tools (like sed and grep) hold nine, but Python can hold up to ninety-nine. Perl is limited only by the amount of physical memory (which, for all practical purposes, means you can have as many as you want). Perl also lets you assign the buffer number to an ordinary scalar variable ($1, $2, etc.) so you can use it later on in the code block.

More details here: http://etext.virginia.edu/services/helpsheets/unix/regex.html

This entry was posted in regex and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>