Regular expressions
Appearance
A regular expression (or regex) is a string of characters, (some of which being reserved control characters,) which represent a pattern [1], i.e. a string designed to match a particular sequence of characters. Regular expressions provide the basic tool in searching, and are ubiquitous in the electronic world.
Getting started
[edit | edit source]There are many editors with regex functionalities. Here are a few examples (Please feel free to add or remove if you find better ones.)
- Regex tester - try your hand at regex here
- Regex101 - compose and test your regex
- meta:User:Pathoschild/Scripts/Regex menu framework - a simple and useful wiki-editing javascript
- Codeproject
- [1] - a useful editor with regex functionality
- Geany editor - a flexible, extensible free and open source editor that supports regex for search and replace operations
- Regexps manual - Emacs regular expression manual
- Regex tester - a firefox add-on
Learning materials
[edit | edit source]A lightning introduction
[edit | edit source]There are several "dialects" (e.g. javascript, perl, php, python) of regular exprssions which differ slightly in grammar. Let us focus on python regex for the moment (because I happen to have a reference [2] for it).
Control characters
[edit | edit source]- Python regex has the control characters :
\-.*+?$<!=|()[]^:#
First examples
[edit | edit source][please verify]
- Any string (e.g.
abcdefg
)which does not contain any control characters is trivially a regular expression ("regex") pattern. It matches only itself - The pattern
[A-Z]
matches a character between A and Z (in the ASCII table) - A backslash (\) followed by any control character, such as
\.
or even the backslash itself\\
, match the character itself (this pattern is called an "escape"). In our examples, \. matches the single dot . and \\ matches the backslash - Combining the two examples above, the pattern
[A-Za-z0-9\-]
matches any single alphanumeric character or the dash "-". - The pattern
\n
matches a newline - The pattern
abc.xyz
matches a string that starts with abc, then contains any character except an end-of-line character, then ends with xyz - The pattern
a*
matches a string with as many characters "a" as possible; it also matches the empty string "". - Combining the previous two examples, we get a very common pattern:
abc.*xyz
matches a string which starts and ends with "abc" and "xyz" respectively, and between which is the longest available string (which could be empty) of any character except the newline.
Exercises
[edit | edit source]- Question: What is
[A-Za-z0-9\-]
? - Write a regular expression to match (a) the URL of any wikiversity page; (b) the URL for any page on any wikimedia site, and (c) the email address of all your friends. Check with a regex editor that your regex actually works.
Write your proposed solutions below
[edit | edit source]Further lessons
[edit | edit source][proposals]
- /Basics - the bare minimum to get one start working
- /Groups
- /How a regex engine works
- /Lookahead and lookbehind
- /Regex objects in python
- /The good and the bad
- /Cookbook
Wikimedia links
[edit | edit source]- b:regular expressions
- w:regular expressions
- mediawiki:titleblacklist - an application on wikiversity
External links
[edit | edit source]- Using regular expressions in Notepad ++
- regex library cheatSheet
- regex tutorials
- Regex helpsheet at etext.lib.virginia.edu
- Regular Expressions Lesson for Java
- Princeton: Regular Expressions - The Complete Tutorial
Notes
[edit | edit source]- ↑ Martelli, Python in a nutshell, p.203
- ↑ Alex Martelli, Python in a nutshell ISBN 0596100469