Python Programming/RegEx

From Wikiversity
Jump to navigation Jump to search

This lesson introduces Python regular expression processing.

Objectives and Skills[edit | edit source]

Objectives and skills for this lesson include:

  • Standard Library
    • Regular expression operations

Readings[edit | edit source]

  1. Wikipedia: Regular expression
  2. Python for Everyone: Regular expressions

Multimedia[edit | edit source]

  1. YouTube: Python for Informatics - Chapter 11 - Regular Expressions
  2. YouTube: Python - Regular Expressions
  3. YouTube: Python3 - Regular Expressions

Examples[edit | edit source]

The match() Method[edit | edit source]

The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[1]

import re

string = "<p>HTML text.</p>"
match = re.match("<p>.*</p>", string)
if match:
    print("start:", match.start(0))
    print("end:", match.end(0))
    print("group:", match.group(0))

Output:

start: 0
end: 17
group: <p>HTML text.</p>

The search() Method[edit | edit source]

The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[2]

import re

string = "<h1>Heading</h1><p>HTML text.</p>"
match = re.search("<p>.*</p>", string)
if match:
    print("start:", match.start(0))
    print("end:", match.end(0))
    print("group:", match.group(0))

Output:

start: 16
end: 33
group: <p>HTML text.</p>

Greedy vs. Non-greedy[edit | edit source]

The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[3]

import re

string = "<h1>Heading</h1><p>HTML text.</p>"

match = re.search("<.*>", string)
if match:
    print("Greedy")
    print("start:", match.start(0))
    print("end:", match.end(0))
    print("group:", match.group(0))

match = re.search("<.*?>", string)
if match:
    print("\nNon-greedy")
    print("start:", match.start(0))
    print("end:", match.end(0))
    print("group:", match.group(0))

Output:

Greedy
start: 0
end: 33
group: <h1>Heading</h1><p>HTML text.</p>

Non-greedy
start: 0
end: 4
group: <h1>

The findall() Method[edit | edit source]

The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[4]

import re

string = "<h1>Heading</h1><p>HTML text.</p>"
matches = re.findall("<.*?>", string)
print("matches:", matches)

Output:

matches: ['<h1>', '</h1>', '<p>', '</p>']

The sub() Method[edit | edit source]

The sub() method replaces every occurrence of a pattern with a string.[5]

import re

string = "<h1>Heading</h1><p>HTML text.</p>"
string = re.sub("<.*?>", "", string)
print("string:", string)

Output:

string: HeadingHTML text.

The split() Method[edit | edit source]

The split() method splits string by the occurrences of pattern.[6]

import re

string = "cat: Frisky, dog: Spot, fish: Bubbles"
keys = re.split(": ?\w*,? ?", string)
values = re.split(",? ?\w*: ?", string)
print("string:", string)
print("keys:", keys)
print("values:", values)

Output:

string: cat: Frisky, dog: Spot, fish: Bubbles
keys: ['cat', 'dog', 'fish', '']
values: ['', 'Frisky', 'Spot', 'Bubbles']

The compile() Method[edit | edit source]

The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[7]

import re

string = "<p>Lines of<br>HTML text</p>"
regex = re.compile("<br>", re.IGNORECASE)
match = regex.search(string)
if match:
    print("start:", match.start(0))
    print("end:", match.end(0))
    print("group:", match.group(0))

Output:

start: 11
end: 15
group: <br>

Match Groups[edit | edit source]

Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[8]

import re

string = "<p>HTML text.</p>"
match = re.match("<p>(.*)</p>", string)
if match:
    print("start:", match.start(1))
    print("end:", match.end(1))
    print("group:", match.group(1))

string = "'cat': 'Frisky', 'dog': 'Spot', 'fish': 'Bubbles'"

match = re.search("'cat': '(.*?)', 'dog': '(.*?)', 'fish': '(.*?)'", string)
if match:
    print("groups:", match.group(1), match.group(2), match.group(3))
    
lst = re.findall(r"'(.*?)': '(.*?)',?\s*", string)
for key, value in lst:
  print("%s: %s" % (key, value))

Output:

start: 3
end: 13
group: HTML text.
groups: Frisky Spot Bubbles
cat: Frisky
dog: Spot
fish: Bubbles

Activities[edit | edit source]

Tutorials[edit | edit source]

  1. Complete one or more of the following tutorials:

Practice[edit | edit source]

  1. Create a Python program that asks the user to enter a line of comma-separated grade scores. Use RegEx methods to parse the line and add each item to a list. Display the list of entered scores sorted in descending order and then calculate and display the high, low, and average for the entered scores. Include try and except to handle input errors.
  2. Create a Python program that asks the user for a line of text that contains HTML tags, such as:
        <p><strong>This is a bold paragraph.</strong></p>
    Use RegEx methods to search for and remove all HTML tags from the text, saving each removed tag in a list. Print the untagged text and then display the list of removed tags sorted in alphabetical order with duplicate tags removed. Include error handling in case an HTML tag isn't entered correctly (an unmatched < or >). Use a user-defined function for the actual string processing, separate from input and output.
  3. Create a Python program that asks the user to enter a line of dictionary keys and values in the form Key-1: Value 1, Key-2: Value 2, Key-3: Value 3. You may assume that keys will never contain spaces, but may contain hyphens. Values may contain spaces, but a comma will always separate one key-value pair from the next key-value pair. Use RegEx functions to parse the string and build a dictionary of key-value pairs. Then display the dictionary sorted in alphabetical order by key. Include input validation and error handling in case a user accidentally enters the same key more than once.

Lesson Summary[edit | edit source]

RegEx Concepts[edit | edit source]

  • A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.[9]
  • Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.[10]
  • In regex, | indicates either|or.[11]
  • In regex, ? indicates there is zero or one of the preceding element.[12]
  • In regex, * indicates there is zero or more of the preceding element.[13]
  • In regex, + indicates there is one or more of the preceding element.[14]
  • In regex, () is used to group elements.[15]
  • In regex, . matches any single character.[16]
  • In regex, [] matches any single character contained within the brackets.[17]
  • In regex, [^] matches any single character not contained within the brackets.[18]
  • In regex, ^ matches the start of the string.[19]
  • In regex, $ matches the end of the string.[20]
  • In regex, \w matches a word.[21]
  • In regex, \d matches a digit.[22]
  • In regex, \s matches whitespace.[23]

Python RegEx[edit | edit source]

  • The Python regular expression library is re.py, and accessed using import re.[24]
  • The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.[25]
  • The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.[26]
  • The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.[27]
  • The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.[28]
  • The sub() method replaces every occurrence of a pattern with a string.[29]
  • The split() method splits string by the occurrences of pattern.[30]
  • The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.[31]
  • The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.[32]
  • Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.[33]

Key Terms[edit | edit source]

brittle code
Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. We call this “brittle code” because it is easily broken.[34]
greedy matching
The notion that the “+” and “*” characters in a regular expression expand outward to match the largest possible string.[35]
grep
A command available in most Unix systems that searches through text files looking for lines that match regular expressions. The command name stands for "Generalized Regular Expression Parser".[36]
regular expression
A language for expressing more complex search strings. A regular expression may contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.[37]
wild card
A special character that matches any character. In regular expressions the wild-card character is the period.[38]

Review Questions[edit | edit source]

Enable JavaScript to hide answers.
Click on a question to see the answer.
  1. A regular expression (abbreviated regex) is _____.
    A regular expression (abbreviated regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.
  2. Each character in a regular expression is either _____, or _____.
    Each character in a regular expression is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning.
  3. In regex, | indicates _____.
    In regex,
  4. In regex, ? indicates _____.
    In regex, ? indicates there is zero or one of the preceding element.
  5. In regex, * indicates _____.
    In regex, * indicates there is zero or more of the preceding element.
  6. In regex, + indicates _____.
    In regex, + indicates there is one or more of the preceding element.
  7. In regex, () is used to _____.
    In regex, () is used to group elements.
  8. In regex, . matches _____.
    In regex, . matches any single character.
  9. In regex, [] matches _____.
    In regex, [] matches any single character contained within the brackets.
  10. In regex, [^] matches _____.
    In regex, [^] matches any single character not contained within the brackets.
  11. In regex, ^ matches _____.
    In regex, ^ matches the start of the string.
  12. In regex, $ matches _____.
    In regex, $ matches the end of the string.
  13. In regex, \w matches _____.
    In regex, \w matches a word.
  14. In regex, \d matches _____.
    In regex, \d matches a digit.
  15. In regex, \s matches _____.
    In regex, \s matches whitespace.
  16. The match() method _____.
    The match() method looks for zero or more characters at the beginning of the given string that match the given regular expression and returns a match object if found, or None if there is no match.
  17. The search() method _____.
    The search() method scans for the first match of the given regular expression in the given string and returns a match object if found, or None if there is no match.
  18. The '*', '+', and '?' quantifiers are all _____; they match _____. Adding ? after the quantifier makes it _____.
    The '*', '+', and '?' quantifiers are all greedy; they match as much text as possible. Adding ? after the quantifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
  19. The findall() method _____.
    The findall() method matches all occurrences of the given regular expression in the string and returns a list of matching strings.
  20. The sub() method _____.
    The sub() method replaces every occurrence of a pattern with a string.
  21. The split() method _____.
    The split() method splits string by the occurrences of pattern.
  22. The compile() method _____.
    The compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods. The expression’s behaviour can be modified by specifying a flags value.
  23. The compile() method flags include _____.
    The compile() method flags include re.IGNORECASE, re.MULTILINE, and re.DOTALL for case insensitivity and processing more than one line at a time.
  24. Match groups match _____.
    Match groups match whatever regular expression is inside parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed.

Assessments[edit | edit source]

See Also[edit | edit source]

References[edit | edit source]

  1. Python.org: Regular expression operations
  2. Python.org: Regular expression operations
  3. Python.org: Regular expression operations
  4. Python.org: Regular expression operations
  5. Python.org: Regular expression operations
  6. Python.org: Regular expression operations
  7. Python.org: Regular expression operations
  8. Python.org: Regular expression operations
  9. Wikipedia: Regular expression
  10. Wikipedia: Regular expression
  11. Wikipedia: Regular expression
  12. Wikipedia: Regular expression
  13. Wikipedia: Regular expression
  14. Wikipedia: Regular expression
  15. Wikipedia: Regular expression
  16. Wikipedia: Regular expression
  17. Wikipedia: Regular expression
  18. Wikipedia: Regular expression
  19. Wikipedia: Regular expression
  20. Wikipedia: Regular expression
  21. Wikipedia: Regular expression
  22. Wikipedia: Regular expression
  23. Wikipedia: Regular expression
  24. Python.org: Regular expression operations
  25. Python.org: Regular expression operations
  26. Python.org: Regular expression operations
  27. Python.org: Regular expression operations
  28. Python.org: Regular expression operations
  29. Python.org: Regular expression operations
  30. Python.org: Regular expression operations
  31. Python.org: Regular expression operations
  32. Python.org: Regular expression operations
  33. Python.org: Regular expression operations
  34. PythonLearn: Regular expressions
  35. PythonLearn: Regular expressions
  36. PythonLearn: Regular expressions
  37. PythonLearn: Regular expressions
  38. PythonLearn: Regular expressions