Fun With Regular Expressions

In programming, if there's a "language" that's as crazy as speaking in cryptic sequence of symbols, it's regular expressions. Learning how to use regular expressions can be intimidating. However, once you get over the dyslexic looking syntax, you'll be amazed at how powerful it can be in data formatting, matching, processing, searching and validating programs.

Understand that regular expressions are, for the most part, implementations of automata theory. Regular expressions gained popularity through grep and sed, which influenced AWK and, later, Perl.

I first used regular expressions through AWK while working for a data conversion company, specializing in audio/print-to-digital processing services. It was also my first job as a programmer. From my experience, I think that regular expressions is a must in all programming languages. It's one of the first features I look for. Modern programming languages like C#, Java, JavaScript and Python, for example, support regular expressions. It's like having a language within a language!

Now, let's start with the basics, shall we? Regular expressions, as a language, define specific rules. Just like how you have to know what characters, keywords and phrases are reserved/special in programming languages, regular expressions have them too. To begin with, there are special characters. It is important that you know them by heart. Different programming languages pretty much support the same regular expression syntax. This article focuses on the most basic, specifically pattern matching, enough to get a novice started. As a convention, to differentiate them from strings, the regular expressions in this article will be enclosed in forward slashes "/".

Every character in your keyboard can be matched. However, some of these characters have special meaning in regular expressions. If you want to match that special character as itself, you can escape it using a backslash "\". That said, "\" is a special character. If you want to match a backslash itself, you'll write a double backslash /\\/, which you can use for example to match the backslash in "C:\Temp".

The period "." is a special character that matches any single character. By itself, /./ can be used, for example, to check if a string is empty or not. Given for example a regular expression /h.t/, this can match strings that contain "h", followed by any single character and then a "t" -- like "hat", "hit", "hot", "hut", "hate", "that", "shut", "blah toot", etc.

Although /./ can be used to check if a string contains a character, it does not mean that the string should contain only a single character. /./ matches "x" and "xyz". If you want to make sure that you are matching from the beginning of a string, you start the regular expression with a caret "^". Thus, saying /^h/ means that the string should start with "h". /^h/ will match "hello world", but not "why hello world". If you want to make sure that you are matching to the end of a string, you end the regular expression with a dollar sign "$". Thus, saying /d$/ will match "hello world", but not "hello worlds". So, if you want to make sure that a string contains only a single character, you can use /^.$/, which can match "x", "y" or "z", but definitely not "xyz".

So far, it is clear that regular expressions can be used to match specific characters and/or sequence of characters in a string. Let's take that knowledge up a notch. The "?", "+" and "*" are special characters known as quantifiers. Understanding the purpose of these characters can show the true power of regular expressions. Let's tackle them one by one. Before I begin, understand that these quantifiers affect the expression immediately at its left in the regular expression. For example, in /th.?s/, the "?" affects how "." is evaluated.
  • The "?" means "0 or 1". Thus, if you say /th.?s/, it can match "this" and "maths", but not "mathematics".
  • The "+" means "1 or more". Thus, if you say /th.+s/, it can match "this" and "mathematics", but not "maths".
  • The "*" means "0 or more". Thus, if you say /th.*s/, it can match "this", "maths and "mathematics".
I'll let you absorb these for now. For beginners, it can take a while getting used to. Meanwhile, practice what you learned so far using https://regex101.com/.

Have fun!

Comments