More Fun With Regular Expressions

Learning how to use regular expressions can give developers power over data when solving problems with pattern matching, searching, format validations and conversion.

If you haven't done so, learn the basics by going through my previous article: Fun With Regular Expressions. If you think that you already have a good grasp of how the basics can be used together, it is time to learn more features.

In this article, we will focus on lists and groups. As with my previous article, we will use the same convention where regular expressions are enclosed in forward slashes "//".

Lists

A list is like an array of characters. In regular expressions, you enclose lists in square brackets "[]". For example, if you want to match "a" or "b" or "c", you can write /^[abc]$/. You can also define a range using the dash "-". So, if you want to match only capital letters, you can say /[A-Z]/. If you want to match capital and small letters, you can say /[A-Za-z]/. You can also do this for numbers like /[0-9]/.

If you want to create a negative list, that is to match anything NOT in the list, you start the list with a caret "^". For example, if you want to match anything NOT "a" or "b" or "c", you can write /^[^abc]$/. One example where this is useful is when you want to match an HTML or XML tag. A simple syntax like /<[^>]+>/ can be used to capture an opening HTML or XML tag, including all attributes in it. To also match a closing tag, you can simply add a test for the forward slash: /<\/?[^>]+>/.

Groups

A group is like a list of strings. More correctly, a group is a list of regular expressions. Groups are enclosed in parenthesis "()". Each item in the group is separated by a pipe "|". For example, if you want to match "abc" or "xyz", you can say /^(abc|xyz)$/.

Note that the content of a group can be regular expression patterns themselves. For example, when matching a time format, you can write /([1-9]|1[0-2]):[0-5][0-9] ?([AP]M)?/, which can match "9:30AM" or "10:45 PM" or "12:00" as valid time formats. The group "([1-9]|1[0-2])" allows for 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10, 11, 12 in the hour part. Note that the "([AP]M)?" syntax is a technique used to let the "?" affect the group -- otherwise, if expressed as [AP]M?, the "?" would affect only "M".

I'll let you absorb these for now. For beginners, it can take a while getting used to. Meanwhile, practice what you learned so far using https://regex101.com/.

Have fun!

Comments