Python Programming/RegEx
This lesson introduces Python regular expression processing.
Objectives and Skills
[edit | edit source]Objectives and skills for this lesson include:
- Standard Library
- Regular expression operations
Readings
[edit | edit source]Multimedia
[edit | edit source]- YouTube: Python for Informatics - Chapter 11 - Regular Expressions
- YouTube: Python - Regular Expressions
- YouTube: Python3 - Regular Expressions
Examples
[edit | edit source]The match() Method
[edit | edit source]The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[1]
import re
string = "<p>HTML text.</p>"
match = re.match("<p>.*</p>", string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 0 end: 17 group: <p>HTML text.</p>
The search() Method
[edit | edit source]The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[2]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
match = re.search("<p>.*</p>", string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 16 end: 33 group: <p>HTML text.</p>
Greedy vs. Non-greedy
[edit | edit source]The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[3]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
match = re.search("<.*>", string)
if match:
print("Greedy")
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
match = re.search("<.*?>", string)
if match:
print("\nNon-greedy")
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
Greedy start: 0 end: 33 group: <h1>Heading</h1><p>HTML text.</p> Non-greedy start: 0 end: 4 group: <h1>
The findall() Method
[edit | edit source]The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[4]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
matches = re.findall("<.*?>", string)
print("matches:", matches)
Output:
matches: ['<h1>', '</h1>', '<p>', '</p>']
The sub() Method
[edit | edit source]The sub() method replaces every occurrence of a pattern with a string.[5]
import re
string = "<h1>Heading</h1><p>HTML text.</p>"
string = re.sub("<.*?>", "", string)
print("string:", string)
Output:
string: HeadingHTML text.
The split() Method
[edit | edit source]The split() method splits string by the occurrences of pattern.[6]
import re
string = "cat: Frisky, dog: Spot, fish: Bubbles"
keys = re.split(": ?\w*,? ?", string)
values = re.split(",? ?\w*: ?", string)
print("string:", string)
print("keys:", keys)
print("values:", values)
Output:
string: cat: Frisky, dog: Spot, fish: Bubbles keys: ['cat', 'dog', 'fish', ''] values: ['', 'Frisky', 'Spot', 'Bubbles']
The compile() Method
[edit | edit source]The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[7]
import re
string = "<p>Lines of<br>HTML text</p>"
regex = re.compile("<br>", re.IGNORECASE)
match = regex.search(string)
if match:
print("start:", match.start(0))
print("end:", match.end(0))
print("group:", match.group(0))
Output:
start: 11 end: 15 group: <br>
Match Groups
[edit | edit source]Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[8]
import re
string = "<p>HTML text.</p>"
match = re.match("<p>(.*)</p>", string)
if match:
print("start:", match.start(1))
print("end:", match.end(1))
print("group:", match.group(1))
string = "'cat': 'Frisky', 'dog': 'Spot', 'fish': 'Bubbles'"
match = re.search("'cat': '(.*?)', 'dog': '(.*?)', 'fish': '(.*?)'", string)
if match:
print("groups:", match.group(1), match.group(2), match.group(3))
lst = re.findall(r"'(.*?)': '(.*?)',?\s*", string)
for key, value in lst:
print("%s: %s" % (key, value))
Output:
start: 3 end: 13 group: HTML text. groups: Frisky Spot Bubbles cat: Frisky dog: Spot fish: Bubbles
Activities
[edit | edit source]Tutorials
[edit | edit source]- Complete one or more of the following tutorials:
- LearnPython
- TutorialsPoint
- RegexOne
Practice
[edit | edit source]- Create a Python program that asks the user to enter a line of comma-separated grade scores. Use RegEx methods to parse the line and add each item to a list. Display the list of entered scores sorted in descending order and then calculate and display the high, low, and average for the entered scores. Include try and except to handle input errors.
- Create a Python program that asks the user for a line of text that contains HTML tags, such as:
<p><strong>This is a bold paragraph.</strong></p>
Use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a list. Print the untagged text and then display the list of removed tags sorted in alphabetical order with duplicate tags removed. Include error handling in case an HTML tag isn't entered correctly (an unmatched < or >). Use a user-defined function for the actual string processing, separate from input and output. - Create a Python program that asks the user to enter a line of dictionary keys and values in the form Key-1: Value 1, Key-2: Value 2, Key-3: Value 3. You may assume that keys will never contain spaces, but may contain hyphens. Values may contain spaces, but a comma will always separate one key-value pair from the next key-value pair. Use RegEx functions to parse the string and build a dictionary of key-value pairs. Then display the dictionary sorted in alphabetical order by key. Include input validation and error handling in case a user accidentally enters the same key more than once.
Lesson Summary
[edit | edit source]RegEx Concepts
[edit | edit source]- A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.[9]
- Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.[10]
- In regex, | indicates either|or.[11]
- In regex, ? indicates there is zero or one of the preceding element.[12]
- In regex, * indicates there is zero or more of the preceding element.[13]
- In regex, + indicates there is one or more of the preceding element.[14]
- In regex, () is used to group elements.[15]
- In regex, . matches any single character.[16]
- In regex, [] matches any single character contained within the brackets.[17]
- In regex, [^] matches any single character not contained within the brackets.[18]
- In regex, ^ matches the start of the string.[19]
- In regex, $ matches the end of the string.[20]
- In regex, \w matches a word.[21]
- In regex, \d matches a digit.[22]
- In regex, \s matches whitespace.[23]
Python RegEx
[edit | edit source]- The Python regular expression library is re.py, and accessed using
import re
.[24] - The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[25]
- The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[26]
- The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[27]
- The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[28]
- The sub() method replaces every occurrence of a pattern with a string.[29]
- The split() method splits string by the occurrences of pattern.[30]
- The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[31]
- The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.[32]
- Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[33]
Key Terms
[edit | edit source]- brittle code
- Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. We call this “brittle code” because it is easily broken.[34]
- greedy matching
- The notion that the “+” and “*” characters in a regular expression expand outward to match the largest possible string.[35]
- grep
- A command available in most Unix systems that searches through text files looking for lines that match regular expressions. The command name stands for "Generalized Regular Expression Parser".[36]
- regular expression
- A language for expressing more complex search strings. A regular expression may contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.[37]
- wild card
- A special character that matches any character. In regular expressions the wild-card character is the period.[38]
Review Questions
[edit | edit source]-
A regular expression (abbreviated regex) is _____.A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.
-
Each character in a regular expression is either _____, or _____.Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.
-
In regex, | indicates _____.In regex,
-
In regex, ? indicates _____.In regex, ? indicates there is zero or one of the preceding element.
-
In regex, * indicates _____.In regex, * indicates there is zero or more of the preceding element.
-
In regex, + indicates _____.In regex, + indicates there is one or more of the preceding element.
-
In regex, () is used to _____.In regex, () is used to group elements.
-
In regex, . matches _____.In regex, . matches any single character.
-
In regex, [] matches _____.In regex, [] matches any single character contained within the brackets.
-
In regex, [^] matches _____.In regex, [^] matches any single character not contained within the brackets.
-
In regex, ^ matches _____.In regex, ^ matches the start of the string.
-
In regex, $ matches _____.In regex, $ matches the end of the string.
-
In regex, \w matches _____.In regex, \w matches a word.
-
In regex, \d matches _____.In regex, \d matches a digit.
-
In regex, \s matches _____.In regex, \s matches whitespace.
-
The match() method _____.The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.
-
The search() method _____.The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.
-
The '*', '+', and '?' quantifiers are all _____; they match _____. Adding ? after the quantifier makes it _____.The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
-
The findall() method _____.The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.
-
The sub() method _____.The sub() method replaces every occurrence of a pattern with a string.
-
The split() method _____.The split() method splits string by the occurrences of pattern.
-
The compile() method _____.The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.
-
The compile() method flags include _____.The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.
-
Match groups match _____.Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.
Assessments
[edit | edit source]- Flashcards: Quizlet: Python Regular Expressions
- Quiz: Quizlet: Python Regular Expressions
See Also
[edit | edit source]- Regular expressions
- Python.org: String Pattern Matching
- SoloLearn: Python
- Regex101.com: Online Regex Tester
- PyRegex.com: Python Regex Tester
- Princeton: Regular Expressions - The Complete Tutorial
References
[edit | edit source]- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Wikipedia: Regular expression
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ Python.org: Regular expression operations
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions
- ↑ PythonLearn: Regular expressions