Regex

Regular expressions in 1 hour or less

05 Jan 2017

/^([^,]+),\s([A-Z]{2})(?:\s(\d{5}))?$/

When you first encounter regular expressions (“regex”) like the one above, they can seem very foreign. The random symbols look more like something from Stargate than from English, but understanding regex is actually quite simple!

What?

Regex is a sequence of characters that form a search pattern that can be either extremely broad or specific. It is a formalized description of a regular language, created by Steven Kleene in the 1950s. The theory behind is pretty cool but honestly not very relevant to our understanding.

For our purposes, regex is a way to match strings with patterns using concise code that is common across programming languages (with slight variations or “flavors” as they are sometimes called).

Why?

Any search you do probably at some level uses regex to find a match. Regex is a combination of characters and meta-characters used for:

Matching: Determine if a string matches some format (e.g., a phone number, email, address or credit card number)
Replacement: Find and replace patterns in a string (e.g., all whitespaces, common mispellings, etc.)
Extraction: Extract specific information (e.g., zip codes, titles, etc.)

Using regex over code has two main advantages:

Portability: Almost every major language has a regular expression library with standardized syntax so you can learn it once, use it everywhere
Concise: While it may not seem like it at first, regex is much easier to write and read than the equivalent logic in code

Some common use cases are:

Verify the structure of strings (e.g. input forms)
Extract information form strings (e.g. zip code from an address)
Search / replace / rearrange parts of the string (e.g. DD/MM/YYYY to MM/DD/YYYY)
Split a string into tokens

How?

Basics

Let’s learn Regex with Javascript and then we can port that knowledge to other languages as needed.

If you know the pattern beforehand, you can declare a regex using a literal. Nothing magical here–just like wrapping text in quotation marks indicates a string, or brackets indicate an array, or curly braces indicate an object, wrapping text in forward slashes indicates a regex: var regex = /pattern/

If you won’t know the pattern till runtime (e.g. because the user will input it), you can use the RegExp constructor instead: var userInput = 'pattern'; var regex = new RegExp(userInput);

The most basic operations you can do resemble the ‘Ctrl / Cmd + F’ operations most people are already familiar with.

Find and Replace

var string = 'Zoo';
var regex = /Zoo/;

string.match(regex); ["Zoo"]

'Some other word'.match(regex); null

Regex returns an array of the matched characters if there are any, or null if there are no matches.

By default, partial matches are allowed: 'Zoolander'.match(regex); ["Zoo"]

To match whole words, use \b to define the word boundary:

'Zoolander'.match(/\bZoo\b/); null

'San Francisco Zoo is so cool'.match(/\bZoo\b/); ["Zoo"]

By default, matches are case sensitive. Adding the i modifier makes the match case insensitive:

'zoo'.match(/Zoo/); null

'Zoo'.match(/Zoo/i); ["Zoo"]

'zoo'.match(/Zoo/i); ["zoo"]

There are a few more modifiers–they always go after the ending forward slash.

By default, regex is like a Find Next–it stops looking as soon as a match is found. Adding the g modifier makes it more like a Find All: 'Zoolander went to the SF Zoo'.match(/Zoo/); ["Zoo"] 'Zoolander went to the SF Zoo'.match(/Zoo/g); ["Zoo", "Zoo"]

Now, let’s get into the cooler stuff regex can do that goes beyond a simple Find function.

For one, you can match two separate patterns using the or operator: 'gray'.match(/gray|grey/); ["gray"] 'grey'.match(/gray|grey/); ["grey"]

Brackets let you specify multiple possibilities for a single character. 'gray'.match(/gr[a|e]y/); ["gray"] 'grey'.match(/gr[a|e]y/); ["grey"]

Actually, within the brackets, the or operator is implied: 'gray'.match(/gr[ae]y/); ["gray"] 'grey'.match(/gr[ae]y/); ["grey"]

Say you wanted to match a character with any letter in the alphabet. You can just specify a range [a-z]: 'red'.match(/[a-z]/); ["r"]

As mentioned above, the brackets still match only ONE character. We need to specify how many characters to match.

We could define a specific number of them as such: ‘red’.match(/[a-z]{2}/); ["re"] ‘red’.match(/[a-z]{3}/); ["red"] ‘red’.match(/[a-z]{4}/); null

Rather than a specific amount, we can specify a minimum and maximum:

{0, 10} = up to 10.
{3, 5} = at least 3, at most 5.
{0, 1} = either none or one. Can use ? as shorthand in this specific case.
{0, } = any amount. Can use * as shorthand in this specific case.
{1, } = 1 or more. Can use + as shorthand in this specific case.

What we really want is for there to be one or more letters: 'green'.match(/[a-z]+/); ["green"] 'Pink'.match(/[a-z]+/); ["ink"]

Uh-oh, the capital letter wasn’t matched—need to increase the allowable range to capital letters, too: 'Pink'.match(/[a-zA-Z]+/); ["Pink"]

Or, make the regex case-insensitive: 'Pink'.match(/[a-z]+/i); ["Pink"]

But what about: 'Pink Panther'.match(/[a-z]+/i); ["Pink"]

Why didn’t it match the second word? It’s because there’s a space in the middle and that’s not in the allowed range: 'Pink Panther'.match(/[a-z ]+/i); ["Pink Panther"]

The meta-character \s matches not just a blank space, but any whitespace (tabs, newlines, etc.): 'Pink Panther'.match(/[a-z\s]+/i); ["Pink Panther"] 'Pink Panther 2'.match(/[a-z\s]+/i); ["Pink Panther "]

We need to allow numbers within the range also: 'Pink Panther 2'.match(/[a-z\s0-9]+/i); ["Pink Panther 2"]

Within the brackets, order doesn’t matter: 'Pink Panther 2'.match(/[a-z0-9\s]+/i); ["Pink Panther 2"]

The meta-character \w is the same as [A-Za-z0-9] so we can just do: 'Pink Panther 2'.match(/[\w\s]+/); ["Pink Panther 2"]

Similarly, \d is the same as [0-9]: 'one23four5'.match(/[\d]+/); ["23"] '1twothree4five'.match(/[\d]+/); ["1"]

There is also a handy NOT operator. For escaped meta-characters, capitalizing the letter is the same as negating it. 'one23four5'.match(/[^\d]+/); ["one"] '1twothree4five'.match(/[\D]+/); ["twothree"]

Inside the brackets, characters represent themselves, so [$.] matches the $ and ., but otherwise both are meta-characters that need to be escaped as \$ and \. or else they’ll be confused for something else. '$34.45'.match(/[$\d.]+/); ["$34.45"]

But this is not strict enough: '$34.....45'.match(/[$\d.]+/); ["$34.....45"] '34.456789'.match(/[$\d.]+/); ["34.456789"] '$$$$$'.match(/[$\d.]+/); ["$$$$$"]

A better rule, using the escaped characters, is: '$34.45'.match(/\$[\d]+\.[\d]{2}/); ["$34.45"] '34.45'.match(/\$[\d]+\.[\d]{2}/); null

We can also make the dollar sign optional: '$34.45'.match(/\${0,1}[\d]+\.[\d]{2}/); ["$34.45"] '$34.45'.match(/\$?[\d]+\.[\d]{2}/); ["$34.45"] '$$$$$$34.45'.match(/\$?[\d]+\.[\d]{2}/); ["$34.45"] '34.4567'.match(/\$?[\d]+\.[\d]{2}/); ["34.45"]

But now we have another problem–it’s too strict: '$34'.match(/\$[\d]+\.[\d]{2}/); null '34.5'.match(/\$[\d]+\.[\d]{2}/); null

We can fix that by making a group of characters optional: '34'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34", undefined] '$34'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34.", "."] '$34.0'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34.0", ".0"] '$34.05'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34.05", ".05"] '$34.056'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34.05", ".05"] '$34.056.09'.match(/\$[\d]+(\.[\d]{0,2})?/); ["$34.05", ".05"]

Creating groups results in multiple matches showing up in the results–one for the entire pattern, and one for each group. We can get rid of a group from the results by making it a non-capturing group: '$34.056.09'.match(/\$[\d]+(?:\.[\d]{0,2})?/); ["$34.05"]

Capturing groups let us reference parts of a string easily–for example, to reformat American date format to British: '07-04-1776'.replace(/(\d{2})-(\d{2})-(\d{4})/, '$2-$1-$3')04-07-1776

Beware, though, that capturing groups do not work when the global (g) flag is enabled.

Backreferences let us refer to pattern groups numerically when there is a repetitive pattern: '07-04-1776'.replace(/(\d{2})-(\1)-(\d{4})/, '$2-$1-$3')04-07-1776

Advanced

Regex is great for finding any part of a string that matches a pattern, but we can also use anchors to specify which part of the string should match a pattern.

We can make sure a string starts a certain way by using ^ (it has a different meaning than when it’s inside a bracket): 'unbearable'.match(/^bear\w*/i); null 'bearable'.match(/^bear\w*/i); ["bearable"] 'forebear'.match(/^bear\w*/i); null 'teddy bear'.match(/^bear\w*/i); null 'Bear'.match(/^bear\w*/i); ["Bear"]

We can make sure a string ends a certain way by using $: 'unbearable'.match(/\w*bear$/i); null 'bearable'.match(/\w*bear$/i); null 'forebear'.match(/\w*bear$/i); ["forebear"] 'teddy bear'.match(/\w*bear$/i); ["bear"] 'Bear'.match(/\w*bear$/i); ["Bear"]

Using both, we can make sure that something matches a pattern and ONLY that pattern: 'unbearable'.match(/^\w*bear$/i); null 'bearable'.match(/^\w*bear$/i); null 'forebear'.match(/^\w*bear$/i); ["forebear"] 'teddy bear'.match(/^\w*bear$/i); null 'Bear'.match(/^\w*bear$/i); ["Bear"]

What if a text contains more than one line?

var multiLine = 'Grizzly Bear\n
              Mama Bear\n
              Mama Whale\n
              Killer Whale\n
              forebear\n
              teddy bear\n
              Bear Cub';

multiLine.match(/^\w+\sbear$/ig); null

That didn’t work because the ^ and $ anchors treated the whole string as one, rather than doing individual matches for each line.

The m modifier fixes that: multiLine.match(/^\w+\sbear$/mig); ["Grizzly Bear", "Mama Bear", "teddy bear"]

Expert

At this this point, we’ve covered all the most common rules for regex. Let’s discuss lookarounds, though they are not very commonly used.

Look arounds let you insert additional logic into your regex. Basically, you look for a pattern only if it follows or is followed by another pattern.

Want to only include matches if there is another string ahead of it? Use positive lookahead (?=): 'Say hi before you say bye'.match(/hi(?=[\w\s]*bye)/); ["hi"] 'Do not say bye before you say hi'.match(/hi(?=[\w\s]*bye)/); null

Want to do the opposite? Use negative lookahead (?!): 'Say hi before you say bye'.match(/hi(?![\w\s]*bye)/); null 'Do not say bye before you say hi'.match(/hi(?![\w\s]*bye)/); ["hi"]

Javascript’s implementation of regex does not allow for lookbehind or atomic operations but let’s discuss it for the sake of completeness to see what it might do.

To include a match only if another string is behind it, Use positive lookbehind (?<=): 'Look behind before you look ahead'.match(/ahead(?<=behind[\w\s]*)/); ["ahead"] 'Look ahead before you look behind'.match(/ahead(?<=behind[\w\s]*)/); null

Want to do the opposite? Use negative lookbehind (?<!): 'Do not look behind before you look ahead'.match(/ahead(?<!behind[\w\s]*)/); null 'Do not look ahead before you look behind'.match(/ahead(?<!behind[\w\s]*)/); ["ahead"]

An atomic group is a non-capturing group that exits the group and throws away all alternatives after the first match of the pattern inside the group, so backtracking is disallowed. This makes the matching process faster since the engine only ever goes forward (visiting each letter only once).

By default, regex matches are treated as non-atomic groups so they allow backtracking. The regex engine looks for the first match, then if the matching ahead fails, it will backtrack and find the next match, until a match for the entire expression is found or all possibilities are exhausted.

'carts'.match(/(car|cart)s/); ["carts", "cart"]

In the example above, the regex engine will first try to match the string to cars but fail at s, then backtrack and try to match the string to carts, for which it will successfully find matches.

'carts'.match(/(?>car|cart)s/); null

When I use an atomic group, however, the engine never backtracks to try the second alternative, and just fails to find a match.

'carts'.match(/(?>fart|cart)s/); ["carts", "cart"]

Because the first alternative was never even evaluated–the second one is the only path the engine follows, and never needs to backtrack to find a match.

Cheatsheet

Now that you know the rules, this cross-lingual cheat sheet should be enough to fill any gaps:

Sources

Technology (3)

programming (2), string manipulation (1)

Want to make changes to this post? Contribute directly via GitHub or leave a comment below!