Jump to content

Python Concepts/Strings

From Wikiversity

Objective

[edit | edit source]

Lesson

[edit | edit source]

Python Strings

[edit | edit source]

The string is one of the simplest data types in python. Strings can be created by putting either single quotations (') or double quotations (") at the beginning and end of a sequence of textual characters.


A simple string with single quotations:

>>> 'Hello!'
'Hello!'


A string can also use double quotations, which do not affect the string in any way.

>>> "Hello!"
'Hello!'


You can concatenate (join together in sequence) strings by using the plus sign (+).

>>> "Hello," + " world!"
'Hello, world!'


Strings can also be concatenated if they are literals (Strings not held in variables).

>>> "Wiki" "versity" "!"
'Wikiversity!'



Now, let's say you need to type a very long string that repeats itself. You can repeat words by using the multiplication operator (*).

>>> print("hey" * 3)
heyheyhey


Examples of strings with "'" and '"' mixed:

>>> s = "This is John's shoe." ; s
"This is John's shoe."
>>> 
>>> 'He said "I will come."'
'He said "I will come."'
>>> 
>>> 'He said "I'll come."'
  File "<stdin>", line 1
    'He said "I'll come."'
                 ^
SyntaxError: invalid syntax
>>> 
>>> s = 'He said "I'    "'"    'll come."' ; s
'He said "I\'ll come."'    # more about escaped characters below.

Escape Characters

[edit | edit source]

There are some characters that cannot be easily expressed within a string. These characters, called escape characters, can be easily integrated within a string by using two or more characters. In Python, we denote escape characters with a backslash (\) at the beginning. For example, to put a new line in the string we could add a linefeed (\n).

>>> "Hello, world!\n"
'Hello, world!\n'


That's not really impressive, is it? To actually see that new line in action, use the built-in function print().

>>> print("Hello, world!")
Hello, world!
>>> print("Hello, world!\n")
Hello, world!

>>>


Here is a table of other escape characters (no need to memorize them, the most important one you'll use is \n).[1]

Escape Sequence Meaning
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Form-feed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo.
\xhh Character with hex value hh.
\N{name} Character named name in the Unicode database.
\uxxxx Character with 16-bit hex value xxxx.
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx.


Now you might start to see a problem with using \ in your string. Let's print a Windows directory name.

>>> print("C:\new folder")
C:
ew folder


See how \n was interpreted as a linefeed? To correct this, use the backslash escape character. Be careful when using backslashes; remember that two of them will only output one backslash.

>>> print("C:\\new folder")
C:\new folder


It could get tiresome to do that with very long directory strings, so let's use a simpler way than using two backslashes; just use the prefix r or R. By putting this prefix before there are any strings quotations, we tell Python that this string is a literal string ('r' stands for raw, so it really is a raw string). That essentially tells Python to ignore all of the escape characters.

print(r"C:\new folder")
C:\new folder
>>>

You can easily assign strings to variables.

>>> spam = r"C:\new folder"
>>> print(spam)
C:\new folder
>>> s = 'He said "I\'ll come."\n' ; s ; print (s)    # escaping the single quote
'He said "I\'ll come."\n'
He said "I'll come."

>>> s = "He said \"I'll come.\"\n" ; s ; print (s)    # escaping the double quote
'He said "I\'ll come."\n'
He said "I'll come."

>>> s = r"He said \"I'll come.\"\n" ; s ; print (s)    # a raw string
'He said \\"I\'ll come.\\"\\n'
He said \"I'll come.\"\n
>>>

The difference between displaying a string and printing a string:

>>> s3 = r'\:' ; s3
'\\:'
>>> print("'{}'".format(s3))
'\:'
>>>

Newlines

[edit | edit source]

Now, let's say you want to print some multi-line text. You could do it like this.

>>> print("Heya!\nHi!\nHello!\nWelcome!")


A string like that could grow really long, but we can use an easy trick which will allow text to span multiple lines without cramming it all onto one line. To do this we use three quotations ( (""" or ''' ) to start and end a string:

>>> print("""
... Heya!
... Hi!
... Hello!
... Welcome!
... """)

Heya!
Hi!
Hello!
Welcome!

>>>


That made things a lot easier. But we can still do better. By adding a backslash (\) we can remove the first linefeed.

>>> print("""\
... Heya!
... Hi!
... Hello!
... Welcome!""")
Heya!
Hi!
Hello!
Welcome!
>>>


Some of you may have noticed that print() automatically ends with an extra linefeed (\n). There is a way to bypass this.

>>> print("I love Wikiversity!", end="")
I love Wikiversity!>>>


A useful way to span a string over multiple physical lines without inserting automatic line-feeds is to use parentheses (and escaping new lines as necessary):

>>> spam = ("Hello,\
...  world!")
>>> print(spam)
Hello, world!


Use parentheses and concatenation for long strings:

>>> spam = ("hello, hello, hello, hello, hello, hello, hello, hello, "
...         "world world world world world world world world world world.")
>>> print (spam)
hello, hello, hello, hello, hello, hello, hello, hello, world world world world world world world world world world.
>>> spam = ("hello, hello, hello, hello, hello, hello, hello, hello, "    '\n'
...         "world world world world world world world world world world.")
>>> print (spam)
hello, hello, hello, hello, hello, hello, hello, hello, 
world world world world world world world world world world.
>>>

Formatting

[edit | edit source]

Strings in Python can be subjected to special formatting, much like strings in C. Formatting serves a special purpose by making it easier to make well formatted output. You can format a string using a percent sign (%) or you could use the newer curly braces ({}) formatting. A simple example is given below.

>>> print("The number three (%d)." % 3)
The number three (3).


The above simple code uses special format characters (%d), which are interpreted as and replaced with a decimal integer. The percent sign (%) after the format string indicates that following the "%" is the data to be printed according to the format string. That can be a lot to take in. Let's demonstrate this a couple more times.

>>> name = "I8086"
>>> print("Copyright (c) %s 2014" % name)
Copyright (c) I8086 2014


This time, we used a different type of format that inserts a string. You'll need to do some extra work if the string needs to be formatted more than once.

>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % (name, date))
Copyright (c) I8086 2014


Notice the need for parentheses and the comma when there are two or more items in the list to be printed. If we don't add the parentheses around the format arguments, then we'll get an error.

>>> name = "I8086"
>>> date = 2014
>>> print("Copyright (c) %s %d" % name, date)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string


If you wish to print %d literally, this will work:

>>> print ("The characters %s are used as a numeric specification thus: %d" % ('%d', 1234))
The characters %d are used as a numeric specification thus: 1234
>>>

or

>>> print ("The characters %d are used as a numeric specification thus: {}".format(1234))
The characters %d are used as a numeric specification thus: 1234
>>>


To keep you from guessing what is what, here is a table of all possible formats with a little information about them.

Type Meaning
s String format. Default for formatting.
b Binary format.
c Converts an integer to a Unicode character before it is formatted.
d Decimal format.
o Octal format.
x Hexadecimal format. Use lowercase for a-h.
X Hexadecimal format. Use uppercase for A-H.
n Number format. This is the same as 'd', except that it uses the current locale setting to insert the appropriate number separator characters.[2]
e Exponent notation. Prints a number in scientific notation. Default precision is 6.
E Exponent notation. Same as 'e', except it prints 'E' in the notation.
f Fix point. Displays a fixed-point number. Default precision is 6.
F Fixed point. Same as 'f', but converts nan to NAN and inf to INF.[3]
g General format.
G General format. Switches to 'E' if numbers are too large.



Examples of formatted strings:

[edit | edit source]
>>> name = 'Fred'
>>> 'He said his name is {0}.'.format(name) # name replaces {0}
'He said his name is Fred.'
>>> 'He said his name is {{{0}}}.'.format(name) # '{{}}' to print literal '{}'
'He said his name is {Fred}.'
>>> 
>>> '{0}, {1}, {2}'.format('a', 'b', 'c')
'a, b, c'
>>> '{0}, {1}, {1}, {0}, {2}'.format('a', 'b', 'c')
'a, b, b, a, c'
>>> 
>>> ('The complex number {0} contains a real part {0.real} '
...  'and an imaginary part {0.imag}.').format(11+3j)
'The complex number (11+3j) contains a real part 11.0 and an imaginary part 3.0.'
>>> 
>>> coordinates1 = (3, -2) ; coordinates2 = (-17, 13)
>>> 'X1 = {0[0]};  Y1 = {0[1]}; X2 = {1[0]}; Y2 = {1[1]}.'.format(coordinates1, coordinates2)
'X1 = 3;  Y1 = -2; X2 = -17; Y2 = 13.'
>>> 
>>> '{:<30}||'.format('left aligned')
'left aligned                  ||'
>>> '{:>30}||'.format('right aligned')
'                 right aligned||'
>>> '{:.>30}||'.format(' right aligned')  # right aligned with '.' fill
'................ right aligned||'
>>> '{:@^30}||'.format(' centered ')    # centered with '@' fill
'@@@@@@@@@@ centered @@@@@@@@@@||'
>>>
>>> "int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(23)  # conversions to different bases
'int: 23;  hex: 17;  oct: 27;  bin: 10111'
>>>
>>> d = '${0:03.2f}'.format(123.456) ; d
'$123.46'
>>> 'Gross receipts are {0:.>15}'.format( ' ' + d )
'Gross receipts are ....... $123.46'
>>> 
>>> 'Gross receipts for {}, {:d} {:.>15}'.format(    # sequence {0} {1} {2} is the default. 
...                                              'July', 2017,
...                                              ' ' + '${:03.2f}'.format(123.456)
...                                             )
'Gross receipts for July, 2017 ....... $123.46'
>>> 
>>> d = 1234567.5678 ; d
1234567.5678    # d is float
>>> d = '{:03.2f}'.format(d) ; d
'1234567.57'    # d is str
>>> d = '${:,}'.format( float(d) ) ; d    # input to this format statement is float
'$1,234,567.57'    # d is str formatted with '$' and commas
>>> 'Gross receipts for {}, {:d} {:.>20}'.format(
...                                              'July', 2017,
...                                              ' ' + d
...                                              )
'Gross receipts for July, 2017 ...... $1,234,567.57'
>>>

Indexing

[edit | edit source]

Strings in Python support indexing, which allows you to retrieve part of the string. It would be better to show you some indexing before we actually tell you how it's done, since you'll grasp the concept more easily.

>>> "Hello, world!"[1]
'e'
>>> spam = "Hello, world!"
>>> spam[1]
'e'


By putting the index number inside brackets ([]), you can extract a character from a string. But what magic numbers correspond to the characters? Indexing in Python starts at 0, so the maximum index of a string is one less than its length. Let's try and index a string beyond its limits.

>>> spam = "abc"
>>> spam[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range


Here's a little chart of "Hello, world!"'s character positions.

0 1 2 3 4 5 6 7 8 9 10 11 12
H e l l o , w o r l d !


Hopefully that chart above helped to visually clarify some things about indexing. Now that we know the formula for the last character in a string, we should be able to get that character.

>>> eggs = "Hello, world!"
>>> eggs[len(eggs)-1]
'!'



In the above code, we used the formula, string length minus one, to get the last character of a string. By using the built-in function len(), we can get the length of a string. In this instance, len() returns 13, which we reduce by 1, resulting in 12. This can be a bit exhausting and repetitive when you need to repeat this over and over again. Luckily, Python has a special indexing method that allows you to get the last character of string without needing know the string's length. By using negative numbers, we can index from right to left instead of left to right.

>>> spam = "I love Wikiversity!"
>>> spam[-1]
'!'
>>> spam[-2]
'y'



There is a table below showing the indexing number corresponding to the character. Take some time to study the table.

-19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
I l o v e W i k i v e r s i t y !


It is important that you understand that strings are immutable, which means that their content cannot be manipulated. Immutable data types have a fixed value that cannot change. The only way to change their value is to completely re-assign the variable.

>>> spam = "Hello,"
>>> spam = spam + " world!"
>>> spam
'Hello, world!'


From the above example, spam is re-assigned to a different value. So what does this have to do with indexing? Well, the same rules apply to indexing, so all of the indexes cannot be assigned with a new value nor can they be manipulated. The example below will help clarify this concept.

>>> spam = "Hello, world!"
>>> spam[3] = "y"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> spam[7] = " Py-"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment


To re-assign a string variable while replacing part of the substring will need a little extra work with slicing. If you aren't familiar with slicing, it is taught in the next section. You'll probably want to come back to this after you have read that section.

>>> spam = "Hello, world!"
>>> spam = spam[:2] + "y" + spam[3:]
>>> spam
'Heylo, world!'
>>> spam = "Hello, world!"
>>> spam = spam[:6] + " Py-" + spam[7:]
>>> spam
'Hello, Py-world!'

Slicing

[edit | edit source]

Slicing is an important concept that you'll be using in Python. Slicing allows you to extract a substring that is in the string. A substring is part of a string or a string within a string, so "I", "love", and "Python" are all substrings of "I love Python.". When you slice in Python, you'll need to remember that the colon (:) is important. It would be better to show you, then to tell you right away how to slice strings.

>>> spam = "I love Python."
>>> spam[0:1]
'I'
>>> spam[2:6]
'love'
>>> spam[7:13]
'Python'


As you can see, slicing builds onto Python's indexing concepts which were taught in the previous section. spam[0:1] gets the substring starting with the character at 0 until the character immediately before 1. So really the first number is where you start your slice and the number after the colon (:) is where you end your slice. (The character at the number after the colon is not included)

>>> spam[0:3] == ( spam[0] + spam[1] + spam[2] )
True
>>>

Now slicing like this can be helpful in situations, but what if you'd like to get the first 4 characters after the start of a string? We could use the len() function to help us, but there is an easier way. By omitting one of the parameters in the slice, it will slice from the beginning or end, depending on which parameter was omitted.

>>> eggs = "Hello, world!"
>>> eggs[:6]
'Hello,'
>>> eggs[6:]
' world!'


By slicing like this, we can remove or get part of a string without needing to know its length. As you can see from the example above, eggs[:6] + eggs[6:] is equal to eggs. This helps ensure that we don't get the same character into both strings.

>>> eggs = "Hello, world!"
>>> eggs[:6]+eggs[6:]
'Hello, world!'
>>> eggs[:6] + eggs[6:] == eggs
True

The handling of IndexError is when slicing or indexing. An attempt to index a string with a number larger than (or equal to) its length will produce an error.

>>> "Hiya!"[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range


While slicing, this kind of error is suppressed, since it returns ''

>>> "Hiya!"[10:]
''
>>> "Hiya!"[10:11]
''
>>> "Hiya!"[:10]
'Hiya!'


The table below shows the indexing number corresponding to each character in the string "I love Wikiversity!"

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Indexing left to right Line 1
-19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 Indexing right to left Line 2
I l o v e W i k i v e r s i t y ! "I love Wikiversity!" Line 3
s1[0] s1[1] s1[2] s2[0] s2[1] s2[2] s2[3] s2[4] s2[5] s3 s4[0] s4[1] s4[2] s4[3] s4[4] s4[5] s4[6] s4[7] s5 slices s1,s2,s3,s4,s5 Line 4


The following examples illustrate the use of slicing to extract a substring from the string or to compare substrings within the string:

>>> spam = "I love Wikiversity!"
>>> len(spam)
19
>>> spam[18]
'!'
>>> spam == spam[0:19]
True
>>> 
>>> s1 = spam[-19:3] ; s2 = spam[3:9] ; s3 = spam[-10:10] ; s4 = spam[10:-1] ; s5 = spam[18] # Line 4 above
>>> spam == s1 + s2 + s3 + s4 + s5
True
>>> 
>>> spam == s1[:] + s2[0:] + s3[-1:] + s4[0:8] + s5[-1:1]
True
>>> s4 == spam[10:18] == spam[10:-1] == spam[-9:18] == spam[-9:-1]
True
>>>

The expression s5 = spam[-1:-0] returns an empty string for s5 without error. To capture the '!' for s5:

>>> s5 = spam[-1:] ; s5 == '!'
True
>>> s5 = spam[-1] ; s5 == '!'
True
>>> s5 = spam[18:] ; s5 == '!'
True
>>> s5 = spam[18:19] ; s5 == '!'
True
>>> s5 = spam[18] ; s5 == '!'
True
>>> 
>>> s5[-9:13] == '!' # slicing doesn't produce an error.
True
>>> s5[-9] # indexing produces an error.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 
>>> s5[12] # indexing produces an error.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>>


Suppose you have a string like '0b110001010110001111001000' and you want to format it to make it more readable: '0b_1100_0101_0110_0011_1100_1000'. The following code illustrates the use of slicing to do the job:

a = 0b110001010110001111001000  # a is int
b = bin(a)                      # b is string
print ('a = ', b)
c = b[2:]                       # c is substring of b
print ('c =   ', c)

insertPosition = len(c) -  4
print ('1)insertPosition = ' , insertPosition)

while insertPosition > 0 :
    print ('2)insertPosition = ' , insertPosition)
    c = c[:insertPosition] + '_' + c[insertPosition:]
    insertPosition -= 4

c = '0b_' + c
print ('c =', c)
(int(c,0) == a) or print ('error in conversion')    # check result

Execute the python code above and the result is:

a =  0b110001010110001111001000
c =    110001010110001111001000
1)insertPosition =  20
2)insertPosition =  20
2)insertPosition =  16
2)insertPosition =  12
2)insertPosition =  8
2)insertPosition =  4
c = 0b_1100_0101_0110_0011_1100_1000

Encoding

[edit | edit source]

So we know what a string is and how it works, but what really is a string? Depending on the encoding, it could be different things without changing. The most prominent string encodings are ASCII and Unicode. The first is a simple encoding for some, but not all, Latin characters and other things like numbers, signs, and monetary units. The second, called Unicode, is a larger encoding that can have thousands of characters. The purpose of Unicode is to create one encoding that can contain all of the world's alphabets, characters, and scripts. In Python 3 Unicode is the default encoding. So this means we can put almost any character into a string and have it print correctly. This is great news for non-English countries, because the ASCII encoding doesn't permit many types of characters. In fact, ASCII allows only 127 characters! (0x00 .. 0x7F, generally the characters and symbols which you see on an American-English keyboard plus control (non-printing) characters.) Here are some examples using different languages, some with non-Latin characters.

>>> print("Witaj świecie!")
Witaj świecie!
>>> print("Hola mundo!")
Hola mundo!
>>> print("Привет мир!")
Привет мир!
>>> print("שלום עולם!")
שלום עולם!


A brief review of ASCII

[edit | edit source]

Each ASCII character fits into one byte, specifically the least significant 7 bits of the byte. Therefore each ASCII character has the value 0x00 .. 0x7F. The ord() and chr() built-in functions do the conversion:

>>> chr(65); chr(0x41)
'A'
'A'
>>> ord('a'); hex(ord('a'))
97
'0x61'
>>>

The printable characters have values 0x20 .. 0x7E Remaining characters are control or non-printing characters.

The numbers '0' .. '9' have values 0x30 .. 0x39.

The letters 'A' .. 'Z' have values 0x41 .. 0x5A.

The letters 'a' .. 'z' have values 0x61 .. 0x7A.

Control characters have values 0x1 .. 0x1F and 0x7F.

Control character 0x01 = 'A' & 0x1F, called '^A' read as 'control A'. Control character 0x02 is called '^B', control character 0x03 '^C', etc. Control character with value 0x0 is '^@' or NULL. To be specific:

>>> chr(0x01) ==  chr( ord('a') & 0x1F ) ==  chr( ord('A') & 0x1F )
True
>>> chr(0x09) ==  chr( ord('i') & 0x1F ) ==  chr( ord('I') & 0x1F ) == '\t'
True
>>> chr(0x0C) ==  chr( ord('l') & 0x1F ) ==  chr( ord('L') & 0x1F ) == '\f'
True
>>>

Named control characters are:

\a Bell            = ^G = 0x7
\b Backspace       = ^H = 0x8
\f Form Feed       = ^L = 0xc
\n Line Feed       = ^J = 0xa

\r Carriage Return = ^M = 0xd
\t Horizontal Tab  = ^I = 0x9
\v Vertical Tab    = ^K = 0xb

See the table of Escape Sequences under "Escape Characters" above. To make sense of the named control characters think of the 1970's and a 'modern' typewriter for the period. A proficient typist who did not look at the keyboard needed an audible alarm when the carriage approached end of line, hence "Bell". A Line feed advanced the paper one line. A Form Feed advanced paper to the next page. Perhaps the only control characters relevant today are Tab, Backspace and Return (which is interpreted as '^J'.)


The following Python code prints characters 0x00 .. 0x7F, the ASCII character set.

num=0
while num <= 0x7F: print ( hex(num), '=', chr(num) ); num += 1

For example 0x23 = '#'; 0x24 = '$'; 0x25 = '%'.

If you send the output to a file and open the file with emacs, you will see that control character '^A' is one byte. If a character is not recognized, emacs prints it using octal notation:

>>> chr(0x80) == '\200'
True
>>>

Modern character sets

[edit | edit source]

In times past when hardware was expensive and the English speaking world dominated computing, one character occupied seven bits (0-6) of one byte. Then bit 7 was used to provide 128 extra characters.

For example 0xA3 = '£' (English pound); 0xB1 = '±'; 0xBD = '½'.


Today with cheap computers a world-wide phenomenon, one character may occupy more than two bytes in a four byte word with space for expansion up to and including 32 bits. Hence:

'\xhh' = chr(0xhh) = '\u00hh' = '\U000000hh'
chr(0xhhhh) = '\uhhhh' = '\U0000hhhh'
chr(0x1hhhh) = '\U0001hhhh'

where h is a hexadecimal digit. Examples of modern characters in action:

>>> '\x10be' == '\x10' + 'be'
True
>>> 
>>> '\u004110be' == '\u0041' + '10be' == 'A10be'
True
>>> 
>>> '\x61' == 'a' == '\u0061' == '\U00000061' == chr(0x61)
True
>>> 
>>> ' \U000010be ' == ' Ⴞ ' == ' \u10be ' == ' ' + chr(0x10BE) + ' '
True
>>>
>>> ' \u004110be '
' A10be '
>>> ' \U004110BE '
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-10: illegal Unicode character
>>> 
>>> '\u0041\u0042\u0043\u0044'
'ABCD'
>>>
>>> '\U0041\U0042\U0043\U0044'
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
>>>
>>> '\U00000041\U00000042\U00000043\U00000044'
'ABCD'
>>>

String as bytes object

[edit | edit source]

The method str.encode() returns an encoded version of the string as a bytes object. The method bytes.decode() returns a string decoded from the given bytes.

>>> a = ('deヿfㇿg').encode() ; a ; isinstance(a, bytes) ; len(a)
b'de\xe3\x83\xbff\xe3\x87\xbfg'
True
10 # bytes
>>> 
>>> a ==  ( b'de'      # Ascii characters 'de'
... + b'\xe3\x83\xbf'  # Character 'ヿ'
... + b'f'             # Ascii character 'f'
... + b'\xe3\x87\xbf'  # Character 'ㇿ'
... + b'g')            # Ascii character 'g'
True
>>> b'\xe3\x83\xbf'.decode()
'ヿ' # 3 bytes for this character.
>>> b'\xe3\x87\xbf'.decode()
'ㇿ' # 3 bytes for this character.
>>> 
>>> s = a.decode() ; s ; isinstance(s, str) ; len(s)
'deヿfㇿg'
True
6 # characters
>>> 
>>> a = ('deヿfㇿg').encode(encoding='ascii') ; a
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u30ff' in position 2: ordinal not in range(128)
>>>
>>> a = ('defg').encode(encoding='ascii') ; a
b'defg'
>>> 
>>> a.decode()
'defg'
>>> 
>>> s = ' \u0391 \u03b1 \u0392 \u03b2 \u0393 \u03b3 \u0394 \u03b4 '; isinstance(s, str)
True
>>> 
>>> s == ' Α α Β β Γ γ Δ δ ' # Greek
True
>>> a = (s).encode() ; a ; isinstance(a, bytes)
b' \xce\x91 \xce\xb1 \xce\x92 \xce\xb2 \xce\x93 \xce\xb3 \xce\x94 \xce\xb4 ' # Greek encoded
True
>>> a.decode()
' Α α Β β Γ γ Δ δ ' # Greek decoded
>>>
>>> s = ' \u0426 \u0446 \u0427 \u0447 \u0428 \u0448 \u0429 \u0449 ' ; s
' Ц ц Ч ч Ш ш Щ щ ' # Cyrillic
>>> a = (s).encode() ; a
b' \xd0\xa6 \xd1\x86 \xd0\xa7 \xd1\x87 \xd0\xa8 \xd1\x88 \xd0\xa9 \xd1\x89 ' # Cyrillic encoded
>>> a.decode()
' Ц ц Ч ч Ш ш Щ щ ' # Cyrillic decoded
>>>

Common String methods

[edit | edit source]

In the lesson above we've seen some methods in action: str.format(), str.encode(), bytes.decode(). Many more methods are available for processing strings. Some are shown below:

Information about string

[edit | edit source]
>>> '123456'.isnumeric()
True
>>> 
>>> '   abcd efg abc '.isalnum()
False # not alpha-numeric
>>>
>>> 'abcd123efg789abc'.isalnum()
True # all alpha-numeric
>>>
>>> 'abcdABCefgZYXabc'.isalpha()
True # all alphabetic
>>>
>>> '01234567'.isdecimal()
True # all decimal
>>>
>>> '           '.isspace()
True
>>> ''.isspace()
False
>>>

Substring within string

[edit | edit source]

Existence of substring

[edit | edit source]
>>> 'abc' in ' abc 123 '
True
>>> 'abcd' in ' abc 123 '
False
>>>

Position of substring

[edit | edit source]
>>> '   abcd efg abc '.find('c')
5 # found 'c' in position 5
>>>
>>> '   abcd efg abc '.find('x')
-1 # did not find 'x'
>>>
>>> '   abcd efg abc '.find('c',7)
14 # found 'c' at or after position 7 at position 14.
>>>
>>> '   abcd efg abc '.find('c',7,12)
-1 # did not find 'c' in range specified
>>>
>>> '   abcd efg abc '.index('c',7)
14 # same as 'find()' above unless not found.
>>>
>>> '   abcd efg abc '.index('c',7,12)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found
>>>

Formatting the string

[edit | edit source]
>>> '      1234          '.strip()
'1234' # remove leading and trailing whitespace
>>>
>>> '\t2345\t56\t678'.expandtabs()
'        2345    56      678' # default is 8.
>>>
>>> '\t2345\t56\t678'.expandtabs(12)
'            2345        56          678'
>>>
>>> 'abc def'.zfill(30)
'00000000000000000000000abc def' # left fill with zeroes.
>>>
>>> 'abcd'.center(30) 
'             abcd             ' # center given string in a length of 30
>>>
>>> ' abcd '.center(30,'+') 
'++++++++++++ abcd ++++++++++++' # and fill with '+'
>>>
>>> 'value = {}'. format(6)
'value = 6'
>>>

Splitting the string

[edit | edit source]
>>> '      1234     456  23456     '.split()
['1234', '456', '23456'] # retain non-whitespace in a list
>>> 
>>> '     1234.567e-45     '.partition('e')
('     1234.567', 'e', '-45     ') # split into 3 around string supplied
>>> 
>>> '\n'.join(['abc\n','def','012\n\n']).splitlines(keepends=True)
['abc\n', '\n', 'def\n', '012\n', '\n']
>>>

Miscellaneous

[edit | edit source]
>>> '   The  quick brown fox ....   '.replace('quick','lazy')
'   The  lazy brown fox ....   '
>>>
>>> '_'.join(['abc','def','012'])
'abc_def_012'
>>>

Methods may be chained

[edit | edit source]
>>> '     1234.567E-45     '.strip().lower().partition('e')
('1234.567', 'e', '-45')
>>> 
>>> '   The  quick brown fox ....   '.replace('quick','lazy').upper().split()
['THE', 'LAZY', 'BROWN', 'FOX', '....'] 
>>>

Methods recognize international text

[edit | edit source]
>>> 'Βικιεπιστήμιο'.isupper()
False
>>> 'Βικιεπιστήμιο'.upper()
'ΒΙΚΙΕΠΙΣΤΉΜΙΟ'
>>> 'Βικιεπιστήμιο'.lower()
'βικιεπιστήμιο'
>>> 'Βικιεπιστήμιο'.isalpha()
True
>>>

More operations on strings

[edit | edit source]

At this point we're familiar with strings as perhaps a single line. But strings can be much more than a single line. The whole of "War and Peace" could be a single string. In this part of the lesson we'll look at "paragraphs" where a paragraph contains one or more non-blank lines. Consider string:

>>> a = '\n\n     line 1   \n        line 2     \n    line 3  \n\n'
>>> print (a)


     line 1   
        line 2     
    line 3  


>>>

This string contains a paragraph surrounded by messy white space. We'll improve the appearance of the string by removing insignificant white space. First:

>>> a = a.strip() ; a
'line 1   \n        line 2     \n    line 3'
>>>
>>> print (a)
line 1   
        line 2     
    line 3

Remove insignificant white space around each line:

>>> L1 = a.splitlines() ; L1
['line 1   ', '        line 2     ', '    line 3'] # a list containing 3 lines.
>>> for p in range(len(L1)-1, -1, -1) : L1[p] = L1[p].strip()
... 
>>> L1
['line 1', 'line 2', 'line 3'] # each line has had beginning and ending white space removed, including '\n'.
>>>
>>> s = ''.join(L1) ; s # lines joined with ''.
'line 1line 2line 3'
>>>
>>> print (s)
line 1line 2line 3
>>>
>>> s = '\n'.join(L1) ; s # lines joined with '\n'.
'line 1\nline 2\nline 3'
>>> print (s)
line 1
line 2
line 3 # a clean paragraph

The next string is a "page" where a page contains two or more paragraphs.

>>> a = '''
... 
... 
...     paragraph 1, line 1
...   paragraph 1, line 2
... 
... 
... 
...     paragraph 2, line 1
...     paragraph 2, line 2
...            paragraph 2, line 3         
... 
... 
...     paragraph 3, line 1
... 
... 
... '''
>>>

With this page we'll do the same as above, that is, remove insignificant white space.

>>> b = a.strip() ; print (b)
paragraph 1, line 1
  paragraph 1, line 2



    paragraph 2, line 1
    paragraph 2, line 2
           paragraph 2, line 3         


    paragraph 3, line 1
>>>

Remove white space around each line (including blank lines):

>>> L1 = b.splitlines() 
>>>
>>> for p in range(len(L1)-1, -1, -1) : L1[p] = L1[p].strip()
... 
>>> print ('\n'.join(L1))
paragraph 1, line 1
paragraph 1, line 2



paragraph 2, line 1
paragraph 2, line 2
paragraph 2, line 3


paragraph 3, line 1
>>>

Remove extraneous lines between paragraphs:

>>> for p in range(len(L1)-1, 0, -1) : # terminator here is 0.
...     if len( L1[p] ) == len( L1[p-1] ) == 0 : del L1[p]
... 
>>> print ('\n'.join(L1))
paragraph 1, line 1
paragraph 1, line 2

paragraph 2, line 1
paragraph 2, line 2
paragraph 2, line 3

paragraph 3, line 1
>>>

Complicated strings simplified

[edit | edit source]

If you are working with strings containing complicated sequences of escaped characters, or if the whole concept of escaped characters is difficult, you might try:

>>> back_slash = '\ '[:1] ; back_slash ; len(back_slash)
'\\'
1
>>> 
>>> new_line = """
... """ # Between """ and """ there is exactly one return.
>>> new_line ; len(new_line)
'\n'
1
>>> 
>>> tab = '     ' ; tab ; len(tab) # Between ' and ' there is exactly one tab.
'\t'
1
>>> 
>>> hex(ord(back_slash))
'0x5c'
>>> hex(ord(new_line))
'0xa'
>>> hex(ord(tab))
'0x9'
>>>

Then you can build long strings:

>>> 'abc' + back_slash*7 + '123' + (back_slash*5 + new_line*3)*5 + 'xyz' 
'abc\\\\\\\\\\\\\\123\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\n\\\\\\\\\\\n\n\nxyz'
>>>

You can put the back_slash at the end of a string:

>>> a = 'abc' + back_slash ; a ; len (a)
'abc\\'
4
>>>

If you have a long string, splitting it might help to reveal significant parts:

>>> a = 'abc' + back_slash*7 + '123' + (back_slash*5 + new_line*3)*5 + 'xyz'
>>> a.split(back_slash)
['abc', '', '', '', '', '', '', '123', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\n', '', '', '', '', '\n\n\nxyz']
>>> 
>>> a.split(back_slash*5)
['abc', '\\\\123', '\n\n\n', '\n\n\n', '\n\n\n', '\n\n\n', '\n\n\nxyz']
>>> 
>>> a.split(back_slash*5 + new_line*3)
['abc\\\\\\\\\\\\\\123', '', '', '', '', 'xyz']
>>>

Assignments

[edit | edit source]

It seems that modern international characters with numeric values greater then 0xFFFF are not standardized. Consider character '\U00010022'. This displays on the interactive Python command line as a little elephant with three legs, within emacs and on the Unix command line as a Greek delta (almost), and in Wikiversity as, well, that depends.

>>> '\U00010022'
'𐀢' # Copied from Python in interactive mode.
>>> '\U00010022' == chr(0x10022) == '\U+10022'
True
>>> 
>>> '\U+10022'
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape
>>>

Within interactive python, when you move the cursor over '\U+10022', it steps over one character. When you move the cursor over '\U00010022', it steps over ten characters. '\U00010022' is 'ð' as copied from emacs.


Experiment with characters with numeric values greater than 0xFFFF and note the apparently inconsistent results. If you are producing text with modern international characters, ensure that the character/s displayed are what you want.

Further Reading or Review

[edit | edit source]
Completion status: Almost complete, but you can help make it more thorough.

References

[edit | edit source]


4. Python's documentation:

"3.1.2. Strings", "4.7.1. String Methods", "String and Bytes literals", "String literal concatenation", "Formatted string literals", "Format String Syntax", "Standard Encodings", "Why are Python strings immutable?", "Why can’t raw strings (r-strings) end with a backslash?", "Unicode HOWTO"


5. Python's methods:

"bytes.decode()", "str.encode()"


6. Python's built-in functions:

"ord()", "chr()"