Python Concepts/Regular Expressions

From Wikiversity
Jump to: navigation, search

Objective[edit]

Books-aj.svg aj ashton 01f.png
  • What is a regular expression?
  • How to test a string for content that matches a regular expression?
  • How to retrieve content that matches a regular expression?
  • How to split a string at points in the string that match a given regular expression?
  • How to replace parts of a string that match a given regular expression?

Lesson[edit]

A regular expression is a string. Python's re (regular expression) methods scan a string supplied to determine if the string supplied contains the regular expression. If the regular expression is found in the string supplied, the required action of the method may be to report it, to split the string at the point where the regular expression was found, or to replace the text at the point where the regular expression was found.

A regular expression may be as simple as a few characters to be interpreted literally, eg, 'abc'. A regular expression may contain special characters that tell the regular expression method how to interpret the literal characters. eg, The expression 'abc*' matches 'a' + 'b' + any number of 'c'.

>>> import re
>>> 
>>> re.search(r'abc*', '123abDEF')
<_sre.SRE_Match object; span=(3, 5), match='ab'>
>>> 
>>> re.search(r'abc*', '123abcDEF')
<_sre.SRE_Match object; span=(3, 6), match='abc'>
>>> 
>>> re.search(r'abc*', '123abcccccccDEF')
<_sre.SRE_Match object; span=(3, 12), match='abccccccc'>
>>> 
>>> re.search(r'abc*', '123acccccccDEF')
>>>

Matching literal characters[edit]

A regular expression may be as simple as one character. Search for 'e' within the string 'jumped':

>>> import re
>>> re.search('e', 'jumped')
<_sre.SRE_Match object; span=(4, 5), match='e'>
>>> 
>>> 'jumped'[4:5] == 'e'
True
>>>

Search for 'e' within the string 'jumped over everything':

>>> s1 = 'jumped over everything'
>>> re.search('e', s1)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 1st occurrence
>>> s1[4:5] == 'e'
True
>>> 
>>> s2 = s1[5:] ; s2
'd over everything'
>>> re.search('e', s2)
<_sre.SRE_Match object; span=(4, 5), match='e'> # 2nd occurrence
>>> s2[4:5] == 'e'
True
>>> 
>>> s3 = s2[5:] ; s3
'r everything'
>>> re.search('e', s3)
<_sre.SRE_Match object; span=(2, 3), match='e'> # 3rd occurrence
>>> s3[2:3] == 'e'
True
>>> 
>>> s4 = s3[3:] ; s4
'verything'
>>> re.search('e', s4)
<_sre.SRE_Match object; span=(1, 2), match='e'> # 4th occurrence
>>> s4[1:2] == 'e'
True
>>> 
>>> s5 = s4[2:] ; s5
'rything'
>>> re.search('e', s5)
>>>

Method re.findall(....) produces a list of all matches found:

>>> L1 = re.findall('e', s1) ; L1
['e', 'e', 'e', 'e']
>>>

Iterating over matches found[edit]

>>> print ('\n'.join([ str(p) for p in re.finditer('e', s1 ) ]))
<_sre.SRE_Match object; span=(4, 5), match='e'>
<_sre.SRE_Match object; span=(9, 10), match='e'>
<_sre.SRE_Match object; span=(12, 13), match='e'>
<_sre.SRE_Match object; span=(14, 15), match='e'>
>>>

Modifying the search[edit]

>>> print ('\n'.join([ str(p) for p in re.finditer('R', s1, re.IGNORECASE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>>

Flag re.VERBOSE permits comments in the regular expression. Flags are combined with '|'. 're.IGNORECASE|re.VERBOSE' is read as 're.IGNORECASE or re.VERBOSE' (inclusive or).

>>> print ('\n'.join([ str(p) for p in re.finditer('R # looking for r or R', s1, re.IGNORECASE|re.VERBOSE ) ]))
<_sre.SRE_Match object; span=(10, 11), match='r'>
<_sre.SRE_Match object; span=(15, 16), match='r'>
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer('v # looking for v or V', s1.upper(), re.I|re.X ) ]))
<_sre.SRE_Match object; span=(8, 9), match='V'>
<_sre.SRE_Match object; span=(13, 14), match='V'>
>>>

Matching groups of characters[edit]

Regular expressions can become complicated and unintelligible quickly. It may help to name the more common expressions. By naming expressions you can specify exactly what you want.

To match 'ee':

>>> pattern = 'e' * 2 ; pattern
'ee'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.') ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>

The special characters '{m,n}' cause the resulting RE to match from m to n repetitions of the preceding RE. Common matchings are:

>>> any = r'{0,}' # Match any number of the preceding RE.
>>> one_or_more = r'{1,}'  # Match one or more of the preceding RE.
>>> zero_or_one = r'{0,1}'  # Match zero or one of the preceding RE.
>>> 
>>> 'e' + any
'e{0,}' # Match any number of 'e'.
>>> 'e' + one_or_more
'e{1,}' # Match one or more of 'e'.
>>> 'e' + zero_or_one 
'e{0,1}' # Match zero or one of 'e'.
>>>

To match one or more of 'e':

>>> pattern = 'e' + one_or_more ; pattern
'e{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(1, 3), match='ee'>
<_sre.SRE_Match object; span=(8, 9), match='e'>
<_sre.SRE_Match object; span=(12, 14), match='ee'>
>>>


Matching members of a set[edit]

The string 'abc' means match 'abc' exactly. If 'abc' are members of a set (within brackets '[]'), the expression '[abc]' means 'a' or 'b' or 'c'.

Alpha-numeric[edit]

>>> pattern = 'abcdefghijklmnopqrstuvwxyz';len(pattern)
26
>>> 
>>> lower = r'[' + pattern + r']' ; lower
'[abcdefghijklmnopqrstuvwxyz]'
>>> upper = r'[' + pattern.upper() + r']' ; upper
'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> alpha = r'[' + pattern + pattern.upper() + r']' ; alpha
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> numeric = r'[0123456789]' ; numeric
'[0123456789]'
>>> 
>>> alpha_numeric = alpha[:-1] + numeric[1:] ; alpha_numeric 
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>> word = r'[_' + alpha_numeric[1:] ; word
'[_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>>

Find all groups of alpha characters:

>>> pattern = alpha + one_or_more ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

Find all groups of numeric characters:

>>> pattern = numeric + one_or_more ; pattern
'[0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 1), match='1'>
<_sre.SRE_Match object; span=(2, 3), match='2'>
<_sre.SRE_Match object; span=(4, 5), match='3'>
>>>

Find all words in the string that contain the letters 'ee':

>>> pattern = alpha + any + 'ee' + alpha + any  ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}ee[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Beets are sweet.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Beets'>
<_sre.SRE_Match object; span=(10, 15), match='sweet'>
>>>

Find all words in the string that contain at least 5 letters:

>>> pattern = alpha*5 + alpha + any ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

It's OK to be lazy. The important thing is to define the pattern accurately and then let the re method make sense of it. However, with a little practice you will probably write the above search as:

>>> pattern = alpha + r'{5,}' ; pattern
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{5,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(10, 17), match='numeric'>
>>>

Non alpha-numeric[edit]

The caret '^' at the beginning of a set negates all the members of the set. '[^abc]' means any character that is not ('a' or 'b' or 'c').

>>> non_lower = r'[^' + lower[1:] ; non_lower
'[^abcdefghijklmnopqrstuvwxyz]'
>>> non_upper = r'[^' + upper[1:] ; non_upper
'[^ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> non_alpha = r'[^' + alpha[1:] ; non_alpha
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]'
>>> 
>>> non_numeric = r'[^' + numeric[1:] ; non_numeric
'[^0123456789]'
>>> 
>>> non_alpha_numeric = r'[^' + alpha_numeric[1:] ; non_alpha_numeric 
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]'
>>> 
>>> non_word = r'[^' + word[1:] ; non_word
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]'
>>>

Find all groups that contain non numeric characters:

>>> pattern = non_numeric + one_or_more ; pattern
'[^0123456789]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(1, 2), match=','>
<_sre.SRE_Match object; span=(3, 4), match=','>
<_sre.SRE_Match object; span=(5, 18), match=' are numeric.'>
>>>

Find all groups containing non alpha characters:

>>> pattern = non_alpha + one_or_more ; pattern
'[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 6), match='1,2,3 '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
<_sre.SRE_Match object; span=(17, 18), match='.'>
>>>

White space[edit]

>>> white = '[ \t\n\r\f\v]' ; white
'[ \t\n\r\x0c\x0b]'
>>> pattern = white + one_or_more ; pattern
'[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(5, 6), match=' '>
<_sre.SRE_Match object; span=(9, 10), match=' '>
>>>

Non white space[edit]

>>> non_white = r'[^' + white[1:] ; non_white
'[^ \t\n\r\x0c\x0b]'
>>>

Find all blocks of non white space:

>>> pattern = non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(6, 9), match='are'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>

Find all blocks of non white space that contain at least 4 letters:

>>> pattern = non_white*3 + non_white + one_or_more ; pattern
'[^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b][^ \t\n\r\x0c\x0b]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, '1,2,3 are numeric.' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='1,2,3'>
<_sre.SRE_Match object; span=(10, 18), match='numeric.'>
>>>

International characters[edit]

The methods work with international characters:

>>> pattern = white + any + 'στο' + white + one_or_more  ; pattern
'[ \t\n\r\x0c\x0b]{0,}στο[ \t\n\r\x0c\x0b]{1,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(12, 17), match=' στο '>
>>>

Find all words that contain the letter 'α' (Greek alpha):

>>> pattern = non_white + any + 'α' + non_white + any  ; pattern
'[^ \t\n\r\x0c\x0b]{0,}α[^ \t\n\r\x0c\x0b]{0,}'
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
>>>

List all the words in the string:

>>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', 'Καλώς ήρθατε στο Βικιεπιστήμιο' ) ]))
<_sre.SRE_Match object; span=(0, 5), match='Καλώς'>
<_sre.SRE_Match object; span=(6, 12), match='ήρθατε'>
<_sre.SRE_Match object; span=(13, 16), match='στο'>
<_sre.SRE_Match object; span=(17, 30), match='Βικιεπιστήμιο'>
>>>

The special character '\w' matches any word character in both English and Greek.

Matching white space[edit]

White space is any one of '\n', '\t', '\v', '\f', ' '.

The regular expression that means 'any white character' is '[\n\t\v\f ]'. It may help to name the most common regular expressions:

>>> new_line = '''
... '''
>>> white = '[' + new_line + '\t\v\f ]' ; white
'[\n\t\x0b\x0c ]'
>>>

Some special characters that tell the methods how to interpret the other characters in the regular expression are:

>>> any = r'*' # any number of
>>> one_or_more = r'+' # one or more of
>>> zero_or_one = r'?' # zero or one of
>>> 
>>> white + any # any number of white characters
'[\n\t\x0b\x0c ]*'
>>> white + one_or_more # one or more white characters
'[\n\t\x0b\x0c ]+'
>>> white + zero_or_one # zero or one white characters
'[\n\t\x0b\x0c ]?'
>>>

Searching for white space:

>>> s1 = '\v\n \t abcd          EFG \v\t   \n\n  234  \f\f\n' # 4 blocks of white space.
>>> 
>>> re.search(white + one_or_more, s1)
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '> # 1st block
>>> 
>>> re.search(white + one_or_more, s1[5:])
<_sre.SRE_Match object; span=(4, 14), match='          '> # 2nd block.
>>> 
>>> re.search(white + one_or_more, s1[5:][14:])
<_sre.SRE_Match object; span=(3, 13), match=' \x0b\t   \n\n  '> # 3rd block.
>>> 
>>> re.search(white + one_or_more, s1[5:][14:][13:])
<_sre.SRE_Match object; span=(3, 8), match='  \x0c\x0c\n'> # 4th block
>>> 
>>> re.search(white + one_or_more, s1[5:][14:][13:][8:])
>>> # no more.
>>> 5+14+13+8 == len(s1)
True
>>> L1 = re.findall(white + one_or_more, s1) ; L1
['\x0b\n \t ', '          ', ' \x0b\t   \n\n  ', '  \x0c\x0c\n'] # 4 blocks of white space.
>>>

Iterating over matches found:

>>> for p in re.finditer(white + one_or_more, s1 ) :
...     print (p)
... 
<_sre.SRE_Match object; span=(0, 5), match='\x0b\n \t '>
<_sre.SRE_Match object; span=(9, 19), match='          '>
<_sre.SRE_Match object; span=(22, 32), match=' \x0b\t   \n\n  '>
<_sre.SRE_Match object; span=(35, 40), match='  \x0c\x0c\n'>
>>>

Anchoring the pattern:

>>> beginning = r'^' # Anchor pattern at beginning of string.
>>> end = r'$' # Anchor pattern at end of string.
>>> 
>>> beginning + white + one_or_more # 1 or more white characters at beginning of string.
'^[\n\t\x0b\x0c ]+'
>>> 
>>> white + one_or_more + end # 1 or more white characters at end of string.
'[\n\t\x0b\x0c ]+$'
>>>

Searching for white space at extremities of string:

>>> L2 = re.findall(white + one_or_more + end, s1) ; L2
['  \x0c\x0c\n']
>>> L2[0] == L1[-1]
True
>>> L3 = re.findall(beginning + white + one_or_more, s1) ; L3
['\x0b\n \t ']
>>> L3[0] == L1[0]
True
>>>

Splitting on white space[edit]

>>> s1 = '  \n \t  \n   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'
>>> print (s1)
  
 	  
   line 1a
  line 1b

	  
  line 2a
    line 2b   
  		

>>>

Remove white space from beginning of s1, but preserve white space at beginning of line 1a:

>>> pattern = beginning + white + any + new_line ; pattern
'^[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s1)
['', '   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n']
>>> s2 = re.split(pattern, s1)[1] ; s2
'   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'

Remove white space from end of s2, but preserve white space at end of line 2b:

>>> pattern = new_line + white + any + end ; pattern
'\n[\n\t\x0b\x0c ]*$'
>>> re.split(pattern, s2)
['   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   ', '']
>>> s3 = re.split(pattern, s2)[0] ; s3
'   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   '

Split s3 into paragraphs:

>>> pattern = new_line + white + any + new_line ; pattern
'\n[\n\t\x0b\x0c ]*\n'
>>> re.split(pattern, s3)
['   line 1a\n  line 1b', '  line 2a\n    line 2b   ']
>>> paragraphs = re.split(pattern, s3) ; paragraphs
['   line 1a\n  line 1b', '  line 2a\n    line 2b   ']

Produce s4, equivalent to s1 without extraneous white space:

>>> s4 = '\n\n'.join(paragraphs) + new_line ; s4
'   line 1a\n  line 1b\n\n  line 2a\n    line 2b   \n'
>>> print (s4,end='')
   line 1a
  line 1b

  line 2a
    line 2b   
>>>

Special characters[edit]

Special characters are sometimes called metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

Special characters '[]'[edit]

Brackets contain members of a class:

>>> alpha
'[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]' # Any character found in the English alphabet.
>>>

Special characters '{}'[edit]

Braces indicate a range:

e{17} # Match exactly 'e' * 17

[0123456789]{3,} # Match 3 or more numeric characters.

[abc]{3,5} # Match 3 or 4 or 5 of ('a' or 'b' or 'c')

p{,3} # Match 0 or 1 or 2 or 3 of 'p'.

Special characters '()'[edit]

Parentheses indicate that the method is to retain the value that matches the expression within '()'.

>>> print (pattern1)
                                              
[5432]{3} # 3 of ('5' or '4' or '3' or '2')                  
\ {1,}    # 1 or more spaces                                 
[6789]{1,}# 1 or more of ('6' or '7' or '8' or '9')          

>>> 
>>> m = re.search(pattern1, '        2345      9876    ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(9, 22), match='345      9876'>
>>> m.lastindex
>>> m.group(0)
'345      9876'
>>> 
>>> print (pattern2)
                                              
([5432]{3}) # 3 of ('5' or '4' or '3' or '2'). Note the '()'
\ {1,}      # 1 or more spaces                               
([6789]{1,})# 1 or more of ('6' or '7' or '8' or '9'). Note the '()'

>>> 
>>> m = re.search(pattern2, '        2345      9876    ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(9, 22), match='345      9876'>
>>> m.lastindex
2
>>> m.group(0)
'345      9876'
>>> m.group(1)
'345'
>>> m.group(2)
'9876'
>>>

Special characters '*', '+', '?'[edit]

Special character '*' means 'any number of'. The following are equivalent:

p*    p{0,}                      # Any number of 'p'.
[01234567890]* [01234567890]{0,} # Any number of numeric.

Special character '+' means '1 or more of'. The following are equivalent:

p+    p{1,}                      # 1 or more of 'p'.
[01234567890]+ [01234567890]{1,} # 1 or more of numeric.

Special character '?' means '0 or 1 of'. The following are equivalent:

p?    p{0,1}                      # 0 or 1 of 'p'.
[01234567890]? [01234567890]{0,1} # 0 or 1 of numeric.

Special characters '^', '$'[edit]

Special character '^' anchors the search at the beginning of the string.

>>> m = re.search(r'234', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(8, 11), match='234'>
>>> m = re.search(r'^234', '        2345 9876    ') ; m
>>> # No match. '234' not at beginning of string.
>>> m = re.search(r'^\ {1,}234', '        2345 9876    ') ; m # '\ {1,}' 1 or more spaces allowed at beginning of string.
<_sre.SRE_Match object; span=(0, 11), match='        234'>
>>> m = re.search(r'^\ +234', '        2345 9876    ') ; m # Same as above.
<_sre.SRE_Match object; span=(0, 11), match='        234'>
>>>

Special character '$' anchors the search at the end of the string.

>>> m = re.search(r'876', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(14, 17), match='876'>
>>> m = re.search(r'876$', '        2345 9876    ') ; m
>>> # No match. '876' not at end of string.
>>> m = re.search(r'876\ +$', '        2345 9876    ') ; m # '\ {1,}' 1 or more spaces allowed at end of string.
<_sre.SRE_Match object; span=(14, 21), match='876    '>
>>>

When both '^$' are used, the regular expression must match the whole string.

>>> m = re.search(r'2345 9876', '        2345 9876    ') ; m
<_sre.SRE_Match object; span=(8, 17), match='2345 9876'>
>>> m = re.search(r'^2345 9876$', '        2345 9876    ') ; m
>>> # No match.
>>> m = re.search(r'^\ *2345 9876\ *$', '        2345 9876    ') ; m # Regular expression permits white space at beginning and end of string.
<_sre.SRE_Match object; span=(0, 21), match='        2345 9876    '>
>>>

Special character '^'[edit]

When the caret is the first character in a set, it negates the whole set.

[0123456789] # Any numeric character.
[^0123456789] # Any non-numeric character.

Special character '.'[edit]

In the default mode, this matches any character except a newline. It is equivalent to:

>>> not_new_line = r'[^' + '\n' + r']' ; not_new_line 
'[^\n]'
>>>

Display all lines in the string s1:

>>> s1 = '  \n \t  \n   line 1a\n  line 1b\n\n\t  \n  line 2a\n    line 2b   \n  \t\t\n'
>>> 
>>> pattern = not_new_line + one_or_more ; pattern
'[^\n]{1,}'
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(pattern, s1 ) ]))
<_sre.SRE_Match object; span=(0, 2), match='  '>
<_sre.SRE_Match object; span=(3, 7), match=' \t  '>
<_sre.SRE_Match object; span=(8, 18), match='   line 1a'>
<_sre.SRE_Match object; span=(19, 28), match='  line 1b'>
<_sre.SRE_Match object; span=(30, 33), match='\t  '>
<_sre.SRE_Match object; span=(34, 43), match='  line 2a'>
<_sre.SRE_Match object; span=(44, 58), match='    line 2b   '>
<_sre.SRE_Match object; span=(59, 63), match='  \t\t'>
>>> 
>>> print ('\n'.join([ str(p.span()) for p in re.finditer(pattern, s1 ) ]))
(0, 2)
(3, 7)
(8, 18)
(19, 28)
(30, 33)
(34, 43)
(44, 58)
(59, 63)
>>> 
>>> print ('\n'.join([ p.group() for p in re.finditer(pattern, s1 ) ]))
  
 	  
   line 1a
  line 1b
	  
  line 2a
    line 2b   
  		
>>>

Escaped special characters[edit]

\s and \S[edit]

Special character \S means any non white space character. Special character \s means any white space character. The following match \s:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\s+', s1 ) ]))
<_sre.SRE_Match object; span=(9, 14), match='\t\n\x0b\x0c\r'>
<_sre.SRE_Match object; span=(28, 33), match='\x1c\x1d\x1e\x1f '>
<_sre.SRE_Match object; span=(133, 134), match='\x85'>
<_sre.SRE_Match object; span=(160, 161), match='\xa0'>
>>>

\d and \D[edit]

Special character \D means any non numeric character. Special character \d means any numeric character. The following match \d:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> 
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\d+', s1 ) ]))
<_sre.SRE_Match object; span=(48, 58), match='0123456789'>
>>>

\w and \W[edit]

Special character \W means any non word character, where "word" is a word in Python. Special character \w means any word character. The following 134 characters match \w:

>>> s1 = ''.join([chr(p) for p in range(256)])
>>> print ('\n'.join([ str(p) for p in re.finditer(r'\w+', s1 ) ]))
<_sre.SRE_Match object; span=(48, 58), match='0123456789'>
<_sre.SRE_Match object; span=(65, 91), match='ABCDEFGHIJKLMNOPQRSTUVWXYZ'>
<_sre.SRE_Match object; span=(95, 96), match='_'>
<_sre.SRE_Match object; span=(97, 123), match='abcdefghijklmnopqrstuvwxyz'>
<_sre.SRE_Match object; span=(170, 171), match='ª'>
<_sre.SRE_Match object; span=(178, 180), match='²³'>
<_sre.SRE_Match object; span=(181, 182), match='µ'>
<_sre.SRE_Match object; span=(185, 187), match='¹º'>
<_sre.SRE_Match object; span=(188, 191), match='¼½¾'>
<_sre.SRE_Match object; span=(192, 215), match='ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ'>
<_sre.SRE_Match object; span=(216, 247), match='ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö'>
<_sre.SRE_Match object; span=(248, 256), match='øùúûüýþÿ'>
>>>

Some words in English carry an accent: 'fiancée', 'café', 'naïve'. Special character '\w' matches all letters in these words.

>>> [p for p in ('fiancée', 'café', 'naïve') if re.search(r'^\w+$', p) ]
['fiancée', 'café', 'naïve']
>>>

Matching '^', '$', '*', '+', '?' literally[edit]

Within a set (within brackets '[]') special characters lose their special significance. To search for a '$' literally search for r'[$]':

>>> pattern = r'[$]' + one_or_more ; pattern 
'[$]+' # One or more of '$' literally.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['$$$$$']
>>> 
>>> pattern = r'[*]' + one_or_more + r'[$]' + any; pattern
'[*]+[$]*' # One or more of '*' and any number of '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['***', '**', '**$$$$$']
>>>

Characters listed individually within brackets '[]':

>>> pattern = r'[2aX?*$]' ; pattern
'[2aX?*$]' # '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '*', '*', '*', '2', '*', '*', 'X', '?', '?', '*', '*', '$', '$', '$', '$', '$', '?', '?', '?', '?', '?']
>>> 
>>> pattern = r'[2aX?*$]' + one_or_more ; pattern
'[2aX?*$]+' # One or more of '2' or 'a' or 'X' or '?' or '*' or '$'.
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$?????' )
['a', '***', '2', '**X', '??', '**$$$$$?????']
>>>

The caret '^' has a special meaning when it is the first in a set. Match all characters not in the set. To match a caret:

>>> pattern = r'[\^]' + one_or_more ; pattern
'[\\^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['^^']
>>>

or put it after first place in the set:

>>> pattern = r'[$?^]' + one_or_more ; pattern
'[$?^]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^???' )
['??', '$$$$$??^^???']
>>>

Characters that may have a special meaning within a set include '^', ']', '|', '\'. For consistent results every time:

>>> pattern = r'123^]?*\ '[:-1] ; pattern
'123^]?*\\' # Backslash at end.
>>> 
>>> pattern = r'[' + re.escape(pattern) + r']' ; pattern
'[123\\^\\]\\?\\*\\\\]'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['*', '*', '*', '1', '2', '3', '*', '*', '?', '?', '*', '*', '?', '?', '^', '^', '?', '?', '?', ']', ']', ']', '\\', '\\', '\\']
>>> 
>>> pattern = pattern + one_or_more ; pattern
'[123\\^\\]\\?\\*\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***123**', '??', '**', '??^^???', ']]]', '\\\\\\']
>>> 
>>> pattern = r'^3]?*}{)(\ '[:-1] ; pattern
'^3]?*}{)(\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\^3\\]\\?\\*\\}\\{\\)\\(\\\\]+'
>>> re.findall (pattern, r'abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( ))) {{{ }}}} ||| \\\ ' )
['***', '3**', '??', '**', '??^^???', ']]]', '(((', ')))', '{{{', '}}}}', '\\\\\\']
>>> 
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern
'\'"^3]?}\\'
>>> pattern = r'[' + re.escape(pattern) + r']' + one_or_more ; pattern
'[\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of "'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash.
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['3', '??', '??^^???', ']]]', "''''", '}}}}', '\\\\\\']
>>> 
>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern # Carefully define the pattern.
'\'"^3]?}\\'
>>> pattern = r'[^' + re.escape(pattern) + r']' + one_or_more ; pattern # Build the regular expression.
'[^\\\'\\"\\^3\\]\\?\\}\\\\]+' # One or more of any character that is not ("'" or '"' or '^' or '3' or ']' or '?' or '}' or backslash).
>>> re.findall (pattern, r"""abc***123**XYZ??q**$$$$$??^^??? [[[ ]]] ((( )))'''' {{{ }}}} ||| \\\ """ )
['abc***12', '**XYZ', 'q**$$$$$', ' [[[ ', ' ((( )))', ' {{{ ', ' ||| ', ' ']
>>>

You can see that regular expressions can become complicated and unintelligible quickly.


Pattern escaped:

>>> pattern = r""" '" """[1:3] + r'^3]?}\ '[:-1]  ; pattern
'\'"^3]?}\\'
>>> L1 = list(pattern) ; L1
["'", '"', '^', '3', ']', '?', '}', '\\'] # Each member of L1 is one character.
>>> 
>>> pattern_escaped = re.escape(pattern) ; pattern_escaped 
'\\\'\\"\\^3\\]\\?\\}\\\\'
>>> r'''\'\"\^3\]\?\}\\''' == pattern_escaped == r"\'" + r'\"' + r'\^' + '3' + r'\]' + r'\?' + r'\}' + r'\\'
True # All characters in pattern except A-Za-z0-9_ have been escaped.
>>>

Advanced Regular Expressions[edit]

Matching dates[edit]

A date has format 7/4/1776 or July 4, 1776. Liberal use of white space is acceptable, as is a month of 3 characters. The following are acceptable dates:

3 /9   / 1923
11/ 22/  1987
Aug23,2017
Septe 4  ,  2001

The ultimate regular expression will be pattern1 | pattern2.

pattern1 = r'''             
\b        # word boundary   
\d{1,2}   # 1 or 2 numeric  
\s*       # any white       
/                           
\s*       # any white       
\d{1,2}   # 1 or 2 numeric  
\s*       # any white       
/                           
\s*       # any white       
\d{4}     # 4 numeric       
\b        # word boundary   
'''

pattern2 = r'''                     
\b        # word boundary           
''' + upper + lower + r'''{2,} # upper + 2 or more lower               
\s*       # any white               
\d{1,2}   # 1 or 2 numeric          
\s*       # any white               
,                                   
\s*       # any white               
\d{4}     # 4 numeric               
\b        # word boundary           
'''

pattern = pattern1 + '|' + pattern2

print (pattern)
\b        # word boundary
\d{1,2}   # 1 or 2 numeric
\s*       # any white
/
\s*       # any white
\d{1,2}   # 1 or 2 numeric
\s*       # any white
/
\s*       # any white
\d{4}     # 4 numeric
\b        # word boundary
|
\b        # word boundary
[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,} # upper + 2 or more lower
\s*       # any white
\d{1,2}   # 1 or 2 numeric
\s*       # any white
,
\s*       # any white
\d{4}     # 4 numeric
\b        # word boundary

The above verbose format is much more readable than:

r'''\b\d{1,2}\s*/\s*\d{1,2}\s*/\s*\d{4}\b|\b[ABCDEFGHIJKLMNOPQRSTUVWXYZ][abcdefghijklmnopqrstuvwxyz]{2,}\s*\d{1,2}\s*,\s*\d{4}\b'''
s3 = '''   7/4 / 1776   3/2/2001     12  / 19
 / 2007
  Jul4,1776  July 4 , 1776    xbcvgdf  ,,
 vnhgb   August13  ,2003...  Nove 22,  2007,,,February14,1776  '''

print ('\n\n', '\n'.join([ str(p.group()) for p in re.finditer(pattern, s3 , re.VERBOSE) ]), sep='')
7/4 / 1776
3/2/2001
12  / 19
 / 2007
Jul4,1776
July 4 , 1776
August13  ,2003
Nove 22,  2007
February14,1776

Matching integers and floats[edit]

Integers[edit]

Examples of integers are: 123, +123, -123. Python's regular expressions scan strings, therefore int in this context means string representing int. Python's eval function tolerates some white space, therefore the following are examples of int: ' 123 ', ' +123', '-123 ', ' + 123 '.

Do not rely on Python's eval function to determine what a string represents:

>>> date = '12/3/4' ; eval(date) ; isinstance(eval(date), float)
1.0
True
>>>

Searching for integers:

>>> print (pattern)
                        
^         # anchor at beginning       
\s*       # any white                 
[+-]?     # 0 or 1 of ('+' or '-')    
\s*       # any white                 
\d+       # 1 or more numeric         
\s*       # any white                 
$         # anchor at end             

>>> re.search (pattern, '          123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match='          123           '>
>>> re.search (pattern, '       -   123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 25), match='       -   123           '>
>>> re.search (pattern, '       -   1 23           ', re.VERBOSE)
>>> # No match.

Method str.strip() produces (almost) a clean int:

>>> '  +13   '.strip()
'+13'
>>> '  +     13   '.strip()
'+     13'
>>>

Method str.replace() hides errors:

>>> ' + 12 34   '.replace(' ', '') # Error in input.
'+1234'                            # Good output.
>>>

To produce a clean int:

>>> print (pattern)
                                               
^         # anchor at beginning                              
\s*       # any white                                        
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around the '[+-]?'.         
\s*       # any white                                        
(\d+)     # 1 or more numeric. Notice the '()' around the '\d+'.              
\s*       # any white                                        
$         # anchor at end                                    

>>> re.search (pattern, '          123           ', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 24), match='          123           '>
>>> m = re.search (pattern, '        -  123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 25), match='        -  123           '>
>>> m.group()
'        -  123           '
>>> m.group(0)
'        -  123           '
>>> m.group(1,2)
('-', '123') # Values that match the expressions in '()' above.
>>> ''.join(m.group(1,2))
'-123'
>>>

Floats[edit]

Examples of point floats: '3.', '.3', '3.3', ' - .3 ', ' + 4.4 '

Examples of exponent floats: ' 3e4 ', '3.E3', '.3e-3', '3.3E-3', ' - .3e+2 ', ' + 4.4E+11 '

An exponent float can contain an int as mantissa: '3e4'.

If not exponent float, it must be point float. This means at least one '.' and one digit.

Matching a point float:[edit]
>>> print (pattern)
                                              
# for point float                                           
^         # anchor at beginning                             
\s*       # any white                                       
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.        
\s*       # any white                                       
(\.\d+|\d+\.|\d+\.\d+)     # .3 or 3. or 3.3                
\s*       # any white                                       
$         # anchor at end                                   

>>>
>>> m = re.search (pattern, '          .123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 26), match='          0.123           '>
>>> m.group(1,2)
('', '.123')
>>> m = re.search (pattern, '      -    0.123           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 27), match='      -    0.123           '>
>>> m.group(1,2)
('-', '0.123')
>>>
Matching an exponent float:[edit]
>>> print (patternE)
                                        
# for exponent float                                   
^         # anchor at beginning                        
\s*       # any white                                  
([+-]?)   # 0 or 1 of ('+' or '-'). Notice the '()' around '[+-]?'.   
\s*       # any white                                  
(\.?\d+|\d+\.|\d+\.\d+)     # 3 or .3 or 3. or 3.3     
[eE]                                                   
([+-]?\d+) # exponent                                  
\s*       # any white                                  
$         # anchor at end                              

>>>
>>> m = re.search (patternE, '          . 123           ', re.VERBOSE) ; m
>>> # No match.
>>> m = re.search (patternE, '      -    0.123e+2           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 30), match='      -    0.123e+2           '>
>>> m.group(1,2,3)
('-', '0.123', '+2')
>>> 
>>> m = re.search (patternE, '      -    3.3E-12           ', re.VERBOSE) ; m
<_sre.SRE_Match object; span=(0, 29), match='      -    3.3E-12           '>
>>> m.group(1,2,3)
('-', '3.3', '-12')
>>> m.group(1) + m.group(2) + 'e' + m.group(3)
'-3.3e-12'
>>> 
>>> [ m.group(p) for p in range(1, m.lastindex+1) ]
['-', '3.3', '-12']
>>>

Decoding a bytes object[edit]

L2 contains the contents of a bytes object presented in binary format:

L2 = (
['11001110', '10010010', '11001110', '10111001', '11001110', '10111010', '11001110'] +
['10111001', '00100000', '11101100', '10011100', '10000100', '11101101', '10000010'] +
['10100100', '11101011', '10110000', '10110000', '11101100', '10011011', '10000000'] +
['00100000', '01010111', '01101001', '01101011', '01101001'] )

Produce list L4 that contains L2 in a format that conforms to standard utf-8

L3 = []

for p in range (len(L2)-1,-1,-1) :
    if re.search(r'^0[01]{7}$', L2[p]) :
        L3 += [L2[p]]
        continue

    if re.search(r'^110[01]{5}$', L2[p]) :
        if p+1 >= len(L2) : exit (99)
        if re.search(r'^10[01]{6}$', L2[p+1]) :
            L3 += [L2[p] + L2[p+1]]
            continue
        exit (98)

    if re.search(r'^1110[01]{4}$', L2[p]) :
        if p+2 >= len(L2) : exit (97)
        if re.search(r'^10[01]{6}$', L2[p+1]) and re.search(r'^10[01]{6}$', L2[p+2]) :
            L3 += [L2[p] + L2[p+1] + L2[p+2]]
            continue
        exit (96)

    if re.search(r'^10[01]{6}$', L2[p]) :
        if p == 0 : exit (95)
        continue

    exit (94)

L4 = L3[::-1]

print (
'''
L4 = (
{} + # Russian
{} + # '\\x20' is a space.
{} + # Korean
{} + # '\\x20' is a space.
{} ) # English
'''.format(L4[0:4], L4[4:5], L4[5:9], L4[9:10], L4[10:])
)
L4 = (
['1100111010010010', '1100111010111001', '1100111010111010', '1100111010111001'] + # Russian
['00100000'] + # '\x20' is a space.
['111011001001110010000100', '111011011000001010100100', '111010111011000010110000', '111011001001101110000000'] + # Korean
['00100000'] + # '\x20' is a space.
['01010111', '01101001', '01101011', '01101001'] ) # English

Decode L4:

L5 = []

for p in range (0, len(L4)) :
    if (len(L4[p]) == 8) :
        m = re.search (r'^0[01]{7}$', L4[p])
        if not m : exit (89)
        I1 = int(L4[p], base=2) ; L5 += chr(I1)
        continue

    if (len(L4[p]) == 16) :
        m = re.search (r'^110([01]{5})10([01]{6})$', L4[p])
        if not m : exit (88)
        if m.lastindex != 2 : exit (87)
        I1 = int(m.group(1) + m.group(2), 2) ; L5 += chr(I1)
        continue

    if (len(L4[p]) == 24) :
        m = re.search (r'^  1110  ([01]{4})  10  ([01]{6})  10  ([01]{6})  $', L4[p], re.VERBOSE)
        if not m : exit (86)
        if m.lastindex != 3 : exit (85)
        I1 = int(m.group(1) + m.group(2) + m.group(3), 2) ; L5 += chr(I1)
        continue

    exit (84)

print ('L5 =', L5)

exit (0)
L5 = ['Β', 'ι', 'κ', 'ι', ' ', '위', '키', '배', '움', ' ', 'W', 'i', 'k', 'i']


Compiling regular expressions[edit]

If a regular expression is complicated or is to be used frequently, it can be compiled to produce a pattern object.

>>> print (pattern)
                                  
([+-]{1}   # 1 of ('+' or '-').
\s*        # any white
\d+)       # 1 or more numeric.
|
(\d+)      # 1 or more numeric.

>>>

The regular expression pattern represents an integer. Produce a pattern object called 'integer'.

>>> integer = re.compile(pattern, re.VERBOSE)

The compiled pattern called 'integer' has methods similar to re.search(), re.finditer() and re.split():

>>> s1 = '    123       -  456((     !!+++    2345 !! -2##'

Displaying all matches[edit]

Displaying all matches manually, one after the other.

>>> integer.search(s1)
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> integer.search(s1[7:])
<_sre.SRE_Match object; span=(7, 13), match='-  456'>
>>> integer.search(s1[7:][13:])
<_sre.SRE_Match object; span=(11, 20), match='+    2345'>
>>> integer.search(s1[7:][13:][20:])
<_sre.SRE_Match object; span=(4, 6), match='-2'>
>>> integer.search(s1[7:][13:][20:][6:])
>>>

The method integer.search(...) accepts optional positional parameters:

>>> m = integer.search(s1) ; m
<_sre.SRE_Match object; span=(4, 7), match='123'>
>>> m = integer.search(s1, 7) ; m
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
>>> m = integer.search(s1, 20) ; m
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
>>> m = integer.search(s1, m.span()[1]) ; m
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>> m = integer.search(s1, m.span()[1]) ; m
>>>

Iterating through all matches.[edit]

>>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>

or:

v = 0

while True :
    m = integer.search(s1, v)
    if not m : break
    print (m)
    v = m.span()[1]

Output is same as above.

Splitting the string[edit]

Splitting the string s1:

Preserving the substrings that match[edit]

>>> L1 = integer.split(s1) ; L1
['    ', None, '123', '       ', '-  456', None, '((     !!++', '+    2345', None, ' !! ', '-2', None, '##']
>>> # I'm not able to explain each 'None' above.
>>> L2 = [p for p in L1 if p != None] 
>>> print ('L2 =', L2)
L2 = ['    ', '123', '       ', '-  456', '((     !!++', '+    2345', ' !! ', '-2', '##']
>>> 
>>> s2 = ''.join(L2) ; s2
'    123       -  456((     !!+++    2345 !! -2##'
>>> s2 == s1
True
>>>

Without preserving the substrings that match[edit]

In pattern_ below note that parentheses have been removed from the expressions r'[+-]{1}\s*\d+' and r'\d+'.

>>> print (pattern_)
                    
[+-]{1}   # 1 of ('+' or '-').    
\s*        # any white            
\d+       # 1 or more numeric.    
|                                 
\d+      # 1 or more numeric.     

>>> integer_ = re.compile(pattern_, re.VERBOSE)
>>> s1 = '    123       -  456((     !!+++    2345 !! -2##'
>>> L1 = integer_.split(s1) ; L1
['    ', '       ', '((     !!++', ' !! ', '##'] # L1 does not contain the substrings that match.
>>>

Replacing all substrings that match[edit]

Replacing all integers in string s1:

After splitting the string[edit]

>>> L2
['    ', '123', '       ', '-  456', '((     !!++', '+    2345', ' !! ', '-2', '##']
>>> L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']
>>>

'123' is to be replaced by 'INT_1'.

'- 456' is to be replaced by 'INT_2'.

'+ 2345' is to be replaced by 'INT_3'.

'-2' is to be replaced by 'INT_4'.

>>> L5 = [ (L2[p], L4[(p-1)>>1])[p & 1] for p in range (len(L2)) ] ; L5
['    ', 'INT_1', '       ', 'INT_2', '((     !!++', 'INT_3', ' !! ', 'INT_4', '##']
>>> 
>>> s3 = ''.join(L5) ; s3
'    INT_1       INT_2((     !!++INT_3 !! INT_4##'
>>>

Without splitting the string[edit]

print ("s2 =", "'"+s2+"'",'\n')

L1 = [m for m in integer.finditer(s2)]
print ( '\n'.join(['4 matches found:'] + [str(p) for p in L1]),'\n' )

print ("L4 =", L4,'\n')

for p in range (3,-1,-1) :
    m = L1[p]
    repl = L4[p]
    start,end = m.span()
    s2 = s2[:start] + repl + s2[end:]
    print (
'''s2 = '{}' after replacing span {}
'''.format(s2, m.span()),
end=''
)
s2 = '    123       -  456((     !!+++    2345 !! -2##'

4 matches found:
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>

L4 = ['INT_1', 'INT_2', 'INT_3', 'INT_4']

s2 = '    123       -  456((     !!+++    2345 !! INT_4##' after replacing span (44, 46)
s2 = '    123       -  456((     !!++INT_3 !! INT_4##' after replacing span (31, 40)
s2 = '    123       INT_2((     !!++INT_3 !! INT_4##' after replacing span (14, 20)
s2 = '    INT_1       INT_2((     !!++INT_3 !! INT_4##' after replacing span (4, 7)

Assignments[edit]

Crystal Clear app kedit.svg

Simplify the pattern?[edit]

Under "Compiling regular expressions" above the expression for integer is:

>>> print (pattern)
                                  
([+-]{1}   # 1 of ('+' or '-').
\s*        # any white
\d+)       # 1 or more numeric.
|
(\d+)      # 1 or more numeric.

>>>

Why not simplify the expression and use:

>>> print (pattern)
                                  
([+-]{0,1}   # 0 or 1 of ('+' or '-').
\s*          # any white
\d+)         # 1 or more numeric.

>>>

Because this expression produces the following matches:

>>> print ( '\n'.join([str(p) for p in integer.finditer(s1)]) )
<_sre.SRE_Match object; span=(0, 7), match='    123'> # This match is not considered accurate.
<_sre.SRE_Match object; span=(14, 20), match='-  456'>
<_sre.SRE_Match object; span=(31, 40), match='+    2345'>
<_sre.SRE_Match object; span=(44, 46), match='-2'>
>>>

Further Reading or Review[edit]

References[edit]

1. Python's documentation:

"6.2. re — Regular expression operations," "Regular Expression HOWTO"


2. Python's methods:


3. Python's built-in functions: