Python Concepts/Bytes objects and Bytearrays
Objective
[edit | edit source]
|
Lesson
[edit | edit source]
One byte is a memory location with a size of 8 bits. A bytes object is an immutable sequence of bytes, conceptually similar to a string. Because each byte must fit into 8 bits, each member of a bytes object is an unsigned int that satisfies The bytes object is important because data written to disk is written as a stream of bytes, and because integers and strings are sequences of bytes. How the sequence of bytes is interpreted or displayed makes it an integer or a string. |
bytes objects
[edit | edit source]
[edit | edit source] |
Conversion to bytes object
[edit | edit source]
from int[edit | edit source]The following code illustrates the process for a positive integer: def int_to_bytes (input_int) :
isinstance(input_int, int) or exit (99)
(input_int >= 0) or exit (98)
if (input_int == 0) : return bytes([0])
L1 = []
num_bits = input_int.bit_length()
while input_int :
L1[0:0] = [(input_int & 0xFF)]
input_int >>= 8
if (num_bits % 8) == 0 :
L1[0:0] = [0]
return bytes(L1)
for i1 in [0, 0x8_9a_bc_de, 0xa8_9a_bc_de] :
b1 = int_to_bytes (i1)
print ('''i1 = {}, b1 = {}'''.format(hex(i1), b1))
i1 = 0x0, b1 = b'\x00' i1 = 0x89abcde, b1 = b'\x08\x9a\xbc\xde' i1 = 0xa89abcde, b1 = b'\x00\xa8\x9a\xbc\xde' Method int.to_bytes(length, ....)[edit | edit source]Method int.to_bytes(length, byteorder, *, signed=False) returns a bytes object representing an integer. >>> 0x12_84.to_bytes(2, 'big', signed=True)
b'\x12\x84'
>>> (-0xE2_04).to_bytes(3, 'big', signed=True) # Note the parentheses: (-0xE2_04).
b'\xff\x1d\xfc' # 3 bytes for a signed, negative int.
>>>
from str[edit | edit source]If each character of the string fits into one byte, the process is simple: >>> s1 = '\011\012\015\016 123 abc \345\346\347' ; s1
'\t\n\r\x0e 123 abc åæç'
>>>
>>> L1 = []
>>> for p in s1 : L1 += [ord(p)]
...
>>> L1
[9, 10, 13, 14, 32, 49, 50, 51, 32, 97, 98, 99, 32, 229, 230, 231]
>>>
>>> bytes( L1 )
b'\t\n\r\x0e 123 abc \xe5\xe6\xe7'
>>>
A listcomp simplifies the process: >>> b1 = bytes( [ ord(p) for p in s1 ] ) ; b1
b'\t\n\r\x0e 123 abc \xe5\xe6\xe7'
>>>
The above implements encoding 'Latin-1': >>> b1a = s1.encode('Latin-1') ; b1a
b'\t\n\r\x0e 123 abc \xe5\xe6\xe7'
>>> b1 == b1a
True
>>>
Method >>> s1
'\t\n\r\x0e 123 ሴ abc åæç'
>>> s1.encode() # Each character 'åæç' occupies one byte but is encoded as 2 bytes.
b'\t\n\r\x0e 123 \xe1\x88\xb4 abc \xc3\xa5\xc3\xa6\xc3\xa7' # Character 'ሴ' is encoded as 3 bytes.
>>>
>>> s1.encode() == s1.encode('utf-8') # Encoding 'utf-8' is default.
True
>>>
>>> len(s1.encode('utf-8'))
23
>>> len(s1.encode('utf-16')) # 'utf-16' and 'utf-32' are optional encodings.
38
>>> len(s1.encode('utf-32'))
76
>>>
from str containing international text[edit | edit source]>>> s1 = 'Γ γ Δ δ Ζ ζ Ξ ξ' # Greek
>>> s1.encode()
b'\xce\x93 \xce\xb3 \xce\x94 \xce\xb4 \xce\x96 \xce\xb6 \xce\x9e \xce\xbe' # Greek encoded
>>> len(s1)
15 # 8 Greek characters + 7 spaces
>>> len(s1.encode())
23 # 8*2 + 7
>>>
Each Greek character occupies 2 bytes and is encoded as 2 bytes. Note for example: >>> hex(ord('ξ'))
'0x3be'
>>> chr(0x3be)
'ξ'
>>> chr(0x3be).encode()
b'\xce\xbe'
>>>
>>> s1 = 'А а Б б В в Г г Щ щ Я я' # Cyrillic
>>> s1.encode()
b'\xd0\x90 \xd0\xb0 \xd0\x91 \xd0\xb1 \xd0\x92 \xd0\xb2 \xd0\x93 \xd0\xb3 \xd0\xa9 \xd1\x89 \xd0\xaf \xd1\x8f' # Cyrillic encoded.
>>>
>>> s1 = 'A Α А' # English 'A', Greek 'Α', Cyrillic 'А'
>>> s1.encode()
b'A \xce\x91 \xd0\x90'
>>>
>>> s1 = 'ウ ィ キ ペ デ ィ ア' # Japanese
>>> s1.encode()
b'\xe3\x82\xa6 \xe3\x82\xa3 \xe3\x82\xad \xe3\x83\x9a \xe3\x83\x87 \xe3\x82\xa3 \xe3\x82\xa2' # Japanese encoded.
>>> len(s1)
13 # 7 Japanese characters + 6 spaces.
>>> len(s1.encode())
27 # 7*3 + 6
>>>
>>> s1
'Ξ ξ ウ ィ Щ щ' # Mixture.
>>> s1.encode()
b'\xce\x9e \xce\xbe \xe3\x82\xa6 \xe3\x82\xa3 \xd0\xa9 \xd1\x89'
>>> len(s1)
11 # characters
>>>
>>> len(s1.encode())
19 # bytes
>>>
from str containing hexadecimal digits[edit | edit source]classmethod bytes.fromhex(string) returns a bytes object, decoding the given string object: >>> bytes.fromhex( ' 12 04 E6 d5 ' )
b'\x12\x04\xe6\xd5' # ASCII whitespace in string is ignored provided that hex digits are grouped as bytes.
>>> bytes.fromhex( ' 1204 E6 d5 ' )
b'\x12\x04\xe6\xd5'
>>> bytes.fromhex( ' 12 04E6d5 ' )
b'\x12\x04\xe6\xd5'
>>> bytes.fromhex( ' 41 42 432044 4546 ' ) # To include a space, add '20' meaning b'\x20' or b' '.
b'ABC DEF'
>>>
>>> bytes.fromhex( ' 12 20 04 20 E6 20 d5 ' )
b'\x12 \x04 \xe6 \xd5'
>>>
>>> bytes.fromhex( ' 12 20 04 20 E6 20 d ' ) # The string must contain exactly two hexadecimal digits per byte.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: non-hexadecimal number found in fromhex() arg at position 24
>>>
classmethod bytes.fromhex(string) can be used to convert from positive >>> i1 = 0xF12B4 ; i1
987828
>>> h1 = hex(i1)[2:] ; h1
'f12b4'
>>> h2 = ('0' * ((len(h1))%2)) + h1 ; h2 # Prepend '0' if necessary.
'0f12b4' # Length is even.
>>> b1 = bytes.fromhex( h2 ) ; b1
b'\x0f\x12\xb4'
>>>
Some technical information about encoding standard 'utf-8'[edit | edit source]Strings encoded according to encoding standard 'utf-8' conform to the following table:
Examples of characters encoded with 'utf-8'[edit | edit source]>>> bin(ord('Q'))
'0b101_0001' # 7 bits
>>>
>>> 'Q'.encode()
b'Q' # ASCII fits into 1 byte.
>>>
>>>
>>> bin(ord('Ξ')) # Greek
'0b11_1001_1110'
>>> ord('Ξ').bit_length()
10
>>> 'Ξ'.encode()
b'\xce\x9e'
# 1100_1110_1001_1110 0xCE_9E
# 110_01110,10_011110, markers '110' and '10', payload bits 01110_011110 or 011_1001_1110 0x39E
>>>
>>>
>>> c1 = '先' # Chinese
>>> len(c1)
1
>>> bin(ord(c1))
'0b101_0001_0100_1000'
>>> ord(c1).bit_length()
15
>>> c1.encode()
b'\xe5\x85\x88'
# 1110_0101_1000_0101_1000_1000 0xE5_85_88
# 1110_0101,10_000101,10_001000, 3 markers and payload bits 0101_000101_001000 or 0101_0001_0100_1000 0x5148
>>>
The following code examines chr(0x10006), encoded in 4 bytes: c1 = chr(0x10006)
print ('c1 = ', c1, '\nord(c1) = ', hex(ord(c1)), sep='')
c1_encoded = c1.encode()
print ('c1_encoded =', c1_encoded)
print ([hex(p) for p in c1_encoded], '# each byte of c1_encoded')
print (
'''
The marker bits:
c1_encoded[0] & 0b11111_000 == 0b11110_000 : {}
c1_encoded[1] & 0b11_000000 == 0b10_000000 : {}
c1_encoded[2] & 0b11_000000 == 0b10_000000 : {}
c1_encoded[3] & 0b11_000000 == 0b10_000000 : {}
'''.format(
c1_encoded[0] & 0b11111_000 == 0b11110_000 ,
c1_encoded[1] & 0b11_000000 == 0b10_000000 ,
c1_encoded[2] & 0b11_000000 == 0b10_000000 ,
c1_encoded[3] & 0b11_000000 == 0b10_000000
)
)
# Produce the payload bits:
mask0 = 0b111; mask123 = 0b11_1111
payload = [c1_encoded[0] & mask0]
for p in range (1,4) : payload += [c1_encoded[p] & mask123]
print (
'''
The payload bits:
payload[0] = c1_encoded[0] & 0x07 = {} & 0x07 = {}
payload[1] = c1_encoded[1] & 0x3F = {} & 0x3F = {}
payload[2] = c1_encoded[2] & 0x3F = {} & 0x3F = {}
payload[3] = c1_encoded[3] & 0x3F = {} & 0x3F = {}
'''.format(
hex(c1_encoded[0]), hex(payload[0]),
hex(c1_encoded[1]), hex(payload[1]),
hex(c1_encoded[2]), hex(payload[2]),
hex(c1_encoded[3]), hex(payload[3])
)
)
s1 = 'payload[3] + (payload[2] << 6) + (payload[1] << 12) + (payload[0] << 18)'
i1 = eval(s1)
print (
'''
Building c1:
i1 = {} = {}
i1 == ord(c1) : {}
'''.format(
s1, hex(i1),
i1 == ord(c1)
)
)
c1 = 𐀆 ord(c1) = 0x10006 c1_encoded = b'\xf0\x90\x80\x86' ['0xf0', '0x90', '0x80', '0x86'] # each byte of c1_encoded The marker bits: c1_encoded[0] & 0b11111_000 == 0b11110_000 : True c1_encoded[1] & 0b11_000000 == 0b10_000000 : True c1_encoded[2] & 0b11_000000 == 0b10_000000 : True c1_encoded[3] & 0b11_000000 == 0b10_000000 : True The payload bits: payload[0] = c1_encoded[0] & 0x07 = 0xf0 & 0x07 = 0x0 payload[1] = c1_encoded[1] & 0x3F = 0x90 & 0x3F = 0x10 payload[2] = c1_encoded[2] & 0x3F = 0x80 & 0x3F = 0x0 payload[3] = c1_encoded[3] & 0x3F = 0x86 & 0x3F = 0x6 Building c1: i1 = payload[3] + (payload[2] << 6) + (payload[1] << 12) + (payload[0] << 18) = 0x10006 i1 == ord(c1) : True Theoretically 21 payload bits can contain '\U001FFFFF' but the standard stops at '\U0010FFFF': >>> chr(0x110006)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(0x110000)
>>>
A disadvantage of 'utf-8'[edit | edit source]A bytes object produced with encoding 'utf-8' can contain the null byte b'\x00'. This could cause a problem if you are sending a stream of bytes through a filter that interprets b'\x00' as end of data. Standard 'utf-8' never produces b'\xFF'. If your bytes object must not contain b'\x00' after encoding, you could convert the null byte to b'\xFF', then convert b'\xFF' to b'\x00' before decoding: >>> s1 = 'Ξ ξ a\000bc \000 Я я 建 页' ; s1
'Ξ ξ a\x00bc \x00 Я я 建 页'
>>> b1 = s1.encode() ;b1
b'\xce\x9e \xce\xbe a\x00bc \x00 \xd0\xaf \xd1\x8f \xe5\xbb\xba \xe9\xa1\xb5'
>>> {'Found 0' for p in b1 if p == 0}
{'Found 0'}
>>>
# Convert b'\x00' to b'\xFF'
>>> b2 = bytes([ (p,0xFF)[p == 0] for p in b1 ]) ; b2
b'\xce\x9e \xce\xbe a\xffbc \xff \xd0\xaf \xd1\x8f \xe5\xbb\xba \xe9\xa1\xb5'
>>> {'Found 0' for p in b2 if p == 0}
set()
>>>
>>> # Check conversion from b1 to b2:
>>> {(p == (0,0xFF)) for p in zip(b1,b2) if p[0] != p[1]} # Difference between b1 and b2.
{True}
>>>
>>> b2.decode() # b2 is not standard 'utf-8'.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 7: invalid start byte
>>>
# Before decoding convert b'\xFF' to b'\x00'.
>>> b3 = bytes([ (p,0)[p == 0xFF] for p in b2 ]) ; b3
b'\xce\x9e \xce\xbe a\x00bc \x00 \xd0\xaf \xd1\x8f \xe5\xbb\xba \xe9\xa1\xb5'
>>> s3 = b3.decode() ; s3
'Ξ ξ a\x00bc \x00 Я я 建 页'
>>> s3 == s1
True
>>>
|
Conversion from bytes object
[edit | edit source]
to int[edit | edit source]The following code illustrates the process for a positive integer: def bytes_to_int (input_bytes) :
isinstance(input_bytes, bytes) or exit (99)
if (len(input_bytes) == 0) : return 0
(input_bytes[0] < 0x80) or exit (98)
shift = i1 = 0
for p in range(1, len(input_bytes)+1) :
i1 += (input_bytes[-p] << shift)
shift += 8
return i1
for b1 in [b'', B"\x00\x00\x00", b'''\x13\xd8''', b"""\x00\xf7\x14"""] :
i1 = bytes_to_int (b1)
print ('''b1 = {}, i1 = {}'''.format(b1, hex(i1)))
b1 = b'', i1 = 0x0 b1 = b'\x00\x00\x00', i1 = 0x0 b1 = b'\x13\xd8', i1 = 0x13d8 b1 = b'\x00\xf7\x14', i1 = 0xf714 Class method int.from_bytes(bytes, ....)[edit | edit source]Class method int.from_bytes(bytes, byteorder, *, signed=False) simplifies the conversion from bytes to int: >>> hex( int.from_bytes(b'\x13\xf8', 'big', signed=True) )
'0x13f8'
>>> hex( int.from_bytes(b'\xd3\xf8', 'big', signed=True) )
'-0x2c08'
>>> hex( int.from_bytes(b'\x00\xd3\xf8', 'big', signed=True) )
'0xd3f8'
>>>
The following code ensures that the integer produced after encoding and decoding is the same as the original int: def int_to_bytes (input) : # input is int.
num_bits = input.bit_length()
num_bytes = (num_bits + 7) // 8
if ((num_bits % 8) == 0) : num_bytes += 1
return input.to_bytes(num_bytes, byteorder='big', signed=True)
def int_from_bytes (input) : # input is bytes.
return int.from_bytes(input, byteorder='big', signed=True)
to str[edit | edit source]If the bytes object contains only characters that fit into one byte: >>> b2
b'\t\n\r\x0e 123 abc \xe5\xe6\xe7'
>>> [chr(p) for p in b2]
['\t', '\n', '\r', '\x0e', ' ', '1', '2', '3', ' ', 'a', 'b', 'c', ' ', 'å', 'æ', 'ç']
>>> s2 = ''.join([chr(p) for p in b2]) ; s2
'\t\n\r\x0e 123 abc åæç'
>>>
>>> s2a = b2.decode('Latin-1') ; s2a
'\t\n\r\x0e 123 abc åæç'
>>> s2 == s2a
True
>>>
Method >>> b1
b'\t\n\r\x0e 123 \xe1\x88\xb4 abc \xc3\xa5\xc3\xa6\xc3\xa7'
>>> s1a = b1.decode() ; s1a
'\t\n\r\x0e 123 ሴ abc åæç'
>>>
It is important to use the correct decoding: >>> s1b = b1.decode('utf-16') ; s1b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa7 in position 22: truncated data
>>>
to str containing international text[edit | edit source]>>> b1 = b'\xce\x93 \xce\xb3 \xce\x94 \xce\xb4 \xce\x96 \xce\xb6 \xce\x9e \xce\xbe'
>>> s1 = b1.decode() ; s1
'Γ γ Δ δ Ζ ζ Ξ ξ' # Greek decoded
>>>
>>> b1 = b'\xd0\x90 \xd0\xb0 \xd0\x91 \xd0\xb1 \xd0\x92 \xd0\xb2 \xd0\x93 \xd0\xb3 \xd0\xa9 \xd1\x89 \xd0\xaf \xd1\x8f'
>>> s1 = b1.decode() ; s1
'А а Б б В в Г г Щ щ Я я' # Cyrillic decoded
>>>
>>> b1 = b'\xe3\x82\xa6 \xe3\x82\xa3 \xe3\x82\xad \xe3\x83\x9a \xe3\x83\x87 \xe3\x82\xa3 \xe3\x82\xa2'
>>> s1 = b1.decode() ; s1
'ウ ィ キ ペ デ ィ ア' # Japanese decoded.
>>>
>>> len(s1) ; len(b1)
13 # Length of s1 in characters.
27 # Length of s1 in bytes.
>>>
It is possible to produce different results depending on encoding/decoding: >>> s1 = 'б в е ж п Т У Х Щ Ю'
>>> b1 = s1.encode('Latin-1') ; b1
b'\xd0\xb1 \xd0\xb2 \xd0\xb5 \xd0\xb6 \xd0\xbf \xd0\xa2 \xd0\xa3 \xd0\xa5 \xd0\xa9 \xd0\xae'
>>> s1a = b1.decode() ; s1a
'б в е ж п Т У Х Щ Ю' # Cyrillic.
>>>
>>> s3
'维 帮 建 页 任 计 个'
>>> b3 = s3.encode('Latin-1') ; b3
b'\xe7\xbb\xb4 \xe5\xb8\xae \xe5\xbb\xba \xe9\xa1\xb5 \xe4\xbb\xbb \xe8\xae\xa1 \xe4\xb8\xaa'
>>> s3a = b3.decode() ; s3a
'维 帮 建 页 任 计 个' # Chinese
>>>
>>> s3 == s3a
False # As strings s3 and s3a are not equal. However, they are equal as bytes.
>>>
to str containing hexadecimal digits[edit | edit source]method bytes.hex() returns a string object containing two hexadecimal digits for each byte in the instance. >>> b'\xf0\xf1\xf2'.hex()
'f0f1f2'
>>> b'\xf0\xf1\xf2'*3.hex()
File "<stdin>", line 1
b'\xf0\xf1\xf2'*3.hex()
^
SyntaxError: invalid syntax
>>> (b'\xf0\xf1\xf2'*3).hex() # Use parentheses to enforce correct syntax.
'f0f1f2f0f1f2f0f1f2'
>>>
>>> (b'\xf0\xf1\xf2'*3)[3].hex() # Individual member is returned as int.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute 'hex'
>>> (b'\xf0\xf1\xf2'*3)[3:4].hex() # A slice containing 1 byte.
'f0'
>>>
method bytes.hex() can be used to convert from >>> b1 = b'\xF0\x23\x95' ; b1
b'\xf0#\x95'
>>> h1 = b1.hex() ; h1
'f02395'
>>> i1 = int(h1, 16) ; i1
15737749
>>> hex(i1)
'0xf02395'
>>>
|
Operations with methods on bytes objects
[edit | edit source]
Operations on strings usually require str arguments. Similarly, operations on bytes objects usually require bytes arguments. Occasionally, a suitable int may be substituted. The following methods on bytes are representative of methods described in the reference. All can be used with arbitrary binary data. bytes.count(sub[, start[, end]]) >>> b'abcd abcd abcd'.count(b' ')
2
>>> b'abcd abcd abcd'.count(b'bcd')
3
>>> b'abcd abcd abcd'[3:].count(b'bcd')
2
>>> b'abcd abcd abcd'[3:10].count(b'bcd')
1
>>> ord('c')
99
>>> b'abcd abcd abcd'.count(99) # chr(99) = 'c'
3
>>>
>>> b'\x0a\x0a\x0A'.count(b'\n')
3
>>> b'\x0a\012\n'.count(10)
3
>>>
Creating and using a translation table:[edit | edit source]static bytes.maketrans(from, to) returns a translation table to map a byte in >>> tt1 = bytes.maketrans(b'', bytes(0)) # A translation table without mapping.
>>> tt2 = bytes([p for p in range(256)]) # Same again.
>>> tt1 == tt2
True
>>> tt1 is tt2
False
>>>
>>> bytes.maketrans(b'aaaaa', b'1234A') == bytes.maketrans(b'a', b'A') # Duplicates are processed without error.
True # Last in sequence wins.
>>>
bytes.translate(table, delete=bytes(0)) returns a copy of the bytes object where all bytes occurring in the optional argument delete are removed, and the remaining bytes have been mapped through the given translation table, which must be a bytes object of length 256. To invert the case of all alphabetic characters: >>> UPPER = b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> lower = b'abcdefghijklmnopqrstuvwxyz'
>>> tt1 = bytes.maketrans(UPPER+lower, lower+UPPER)
>>>
>>> b'The Quick, Brown FOX jumps ....'.translate(tt1)
b'tHE qUICK, bROWN fox JUMPS ....'
>>>
To delete specified bytes: >>> b'The Quick, Brown FOX jumps ....'.translate(None, b'., oO')
b'TheQuickBrwnFXjumps'
>>>
Deletion is completed before translation: >>> b'The Quick, Brown FOX jumps ....'.translate(tt1, b'., o')
b'tHEqUICKbRWNfoxJUMPS'
>>>
|
bytes objects and disk files
[edit | edit source]
Data is written to disk as a stream of bytes. Therefore the bytes object is ideal for this purpose. The following code writes a stream of bytes to disk and then reads the data on disk as text. Python automatically performs the appropriate decoding (default 'utf-8') when reading text. $ cat test.py b1 = b'English (cur | prev)'
b2 = b'Chinese \xef\xbc\x88\xe5\xbd\x93\xe5\x89\x8d | \xe5\x85\x88\xe5\x89\x8d\xef\xbc\x89'
b3 = b'Japanese (\xe6\x9c\x80\xe6\x96\xb0 | \xe5\x89\x8d)'
b4 = b'Greek (\xcf\x80\xce\xb1\xcf\x81\xcf\x8c\xce\xbd | \xcf\x80\xcf\x81\xce\xbf\xce\xb7\xce\xb3.)'
b5 = b'Russian (\xd1\x82\xd0\xb5\xd0\xba\xd1\x83\xd1\x89. | \xd0\xbf\xd1\x80\xd0\xb5\xd0\xb4.)'
number_written = 0
try:
with open("test.bin", "wb") as ofile: # Write bytes to disk.
for bx in b1,b2,b3,b4,b5 :
nw = ofile.write(bx + b'\n') # in bytes
number_written += nw
end_of_file = ofile.tell() # in bytes
except:
print ('Error1 detected in "with" statement.')
(number_written == end_of_file) or exit (99)
try:
with open("test.bin", "rt") as ifile: # Read characters from disk.
for line in ifile :
print (len(line), line, end='')
except:
print ('Error2 detected in "with" statement.')
exit (0)
$ python3.6 test.py >test.sout 2>test.serr $ $ od -t x1 test.bin # The contents of disk file test.bin (edited for clarity): 0000000 E n g l i s h ' ' ( c u r ' ' | ' ' p # English 0000016 r e v )'\n' C h i n e s e ' ' ef bc 88 # Chinese 0000032 e5 bd 93 e5 89 8d ' ' | 20 e5 85 88 e5 89 8d ef 0000048 bc 89'\n' J a p a n e s e ' ' ( e6 9c 80 # Japanese 0000064 e6 96 b0 ' ' | ' ' e5 89 8d )'\n' G r e e k # Greek 0000080 ' ' ( cf 80 ce b1 cf 81 cf 8c ce bd ' ' | ' ' cf 0000096 80 cf 81 ce bf ce b7 ce b3 . )'\n' R u s s # Russian 0000112 i a n ' ' ( d1 82 d0 b5 d0 ba d1 83 d1 89 . 0000128 ' ' | ' ' d0 bf d1 80 d0 b5 d0 b4 . )'\n' 0000142 # Values in left hand column are decimal. $ $ ls -la test.bin -rw-r--r-- 1 user staff 142 Nov 12 08:46 test.bin $ $ cat test.bin English (cur | prev) Chinese (当前 | 先前) Japanese (最新 | 前) Greek (παρόν | προηγ.) Russian (текущ. | пред.) $ $ cat test.sout 21 English (cur | prev) 18 Chinese (当前 | 先前)# 18 characters including '\n' 18 Japanese (最新 | 前) 23 Greek (παρόν | προηγ.) 25 Russian (текущ. | пред.) $
|
bytearrays
[edit | edit source]
The [edit | edit source] |
Conversion to bytearray
[edit | edit source]
from int[edit | edit source]The following code illustrates the process for a positive integer: >>> I1 = 0x18b9e4
>>> bytearray ( [ (I1 >> 16) & 255, (I1 >> 8) & 255, I1 & 255 ] )
bytearray(b'\x18\xb9\xe4')
>>>
Method int.to_bytes(length, ....)[edit | edit source]Method int.to_bytes(length, byteorder, *, signed=False) returns a >>> bytearray ( 0x12_84.to_bytes(2, 'big', signed=True) )
bytearray(b'\x12\x84')
>>>
>>> bytearray ( (-0xE2_04).to_bytes(3, 'big', signed=True) ) # Note the parentheses: (-0xE2_04).
bytearray(b'\xff\x1d\xfc') # 3 bytes for a signed, negative int.
>>>
from str[edit | edit source]If each character of the string fits into one byte, the process is simple: >>> s1 = '\011\012\015\016 123 abc \345\346\347' ; s1
'\t\n\r\x0e 123 abc åæç'
>>>
>>> ba1 = bytearray( [ ord(p) for p in s1 ] ) ; ba1
bytearray(b'\t\n\r\x0e 123 abc \xe5\xe6\xe7')
>>>
>>> s1.encode('Latin-1') == ba1 # Comparing bytes object and bytearray.
True
>>>
Method >>> s1 = '\t\n\r\x0e 123 ሴ abc åæç' ; s1
'\t\n\r\x0e 123 ሴ abc åæç'
>>>
>>> bytearray ( s1.encode() ) # Each character 'åæç' occupies one byte but is encoded as 2 bytes.
bytearray(b'\t\n\r\x0e 123 \xe1\x88\xb4 abc \xc3\xa5\xc3\xa6\xc3\xa7') # Character 'ሴ' is encoded as 3 bytes.
>>>
>>> bytearray ( s1.encode() ) == s1.encode('utf-8')
True
>>>
from str containing international text[edit | edit source]>>> s1
'Ξ ξ ウ ィ Щ щ' # Mixture of Greek, Japanese, Russian.
>>>
>>> ba1 = bytearray ( s1.encode() ) ; ba1
bytearray(b'\xce\x9e \xce\xbe \xe3\x82\xa6 \xe3\x82\xa3 \xd0\xa9 \xd1\x89')
>>>
>>> len(s1)
11 # characters
>>>
>>> len(ba1)
19 # bytes
>>>
>>> ba1 == s1.encode('utf-8')
True
>>>
from str containing hexadecimal digits[edit | edit source]classmethod bytearray.fromhex(string) returns a >>> bytearray.fromhex( ' 12 04 E6 d5 ' )
bytearray(b'\x12\x04\xe6\xd5') # ASCII whitespace in string is ignored provided that hex digits are grouped as bytes.
>>>
>>> bytearray.fromhex( ' 12 20 04 20 E6 20 d5 ' )
bytearray(b'\x12 \x04 \xe6 \xd5')
>>>
>>> bytearray.fromhex( ' 12 20 04 20 E6 20 d ' ) # The string must contain exactly two hexadecimal digits per byte.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: non-hexadecimal number found in fromhex() arg at position 24
>>>
classmethod bytearray.fromhex(string) can be used to convert from positive >>> i1 = 0xF12B4 ; i1
987828
>>> h1 = hex(i1)[2:] ; h1
'f12b4'
>>> h2 = ('0' * ((len(h1)) & 1)) + h1 ; h2 # Prepend '0' if necessary.
'0f12b4' # Length is even.
>>> ba1 = bytearray.fromhex( h2 ) ; ba1
bytearray(b'\x0f\x12\xb4')
>>>
|
Conversion from bytearray
[edit | edit source]
to int[edit | edit source]The following code illustrates the process for a positive integer: >>> ba1
bytearray(b'\x0f\x12\xb4')
>>> i1 = (ba1[0] << 16) + (ba1[1] << 8) + ba1[2] ; hex(i1)
'0xf12b4'
>>>
Class method int.from_bytes(bytes, ....)[edit | edit source]Class method int.from_bytes(bytes, byteorder, *, signed=False) simplifies the conversion from bytearray to int: >>> hex( int.from_bytes(bytearray(b'\xd3\xf8'), 'big', signed=True) )
'-0x2c08'
>>> hex( int.from_bytes(bytearray(b'\x00\xd3\xf8'), 'big', signed=True) )
'0xd3f8'
>>>
to str[edit | edit source]If the bytearray contains only characters that fit into one byte: >>> ba2 = bytearray(b'\t\n\r\x0e 123 abc \xe5\xe6\xe7') ; ba2
bytearray(b'\t\n\r\x0e 123 abc \xe5\xe6\xe7')
>>> s2 = ''.join([chr(p) for p in ba2]) ; s2
'\t\n\r\x0e 123 abc åæç'
>>> s2a = ba2.decode('Latin-1') ; s2a
'\t\n\r\x0e 123 abc åæç'
>>> s2 == s2a
True
>>>
Method >>> ba1 = bytearray(b'\t\n\r\x0e 123 \xe1\x88\xb4 abc \xc3\xa5\xc3\xa6\xc3\xa7') ; ba1
bytearray(b'\t\n\r\x0e 123 \xe1\x88\xb4 abc \xc3\xa5\xc3\xa6\xc3\xa7')
>>> s1 = ba1.decode() ; s1
'\t\n\r\x0e 123 ሴ abc åæç'
>>>
It is important to use the correct decoding: >>> s1b = ba1.decode('utf-16') ; s1b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xa7 in position 22: truncated data
>>>
to str containing international text[edit | edit source]>>> ba1 = bytearray(b'\xce\x93 \xce\xb3 \xce\x94 \xce\xb4 \xce\x96 \xce\xb6')
>>> s1 = ba1.decode() ; s1
'Γ γ Δ δ Ζ ζ' # Greek decoded.
>>>
>>> ba2 = bytearray(b'\xd0\x90 \xd0\xb0 \xd0\x91 \xd0\xb1 \xd0\x92 \xd0\xb2 \xd0\x93 \xd0\xb3 \xd0\xa9 \xd1\x89')
>>> s2 = ba2.decode() ; s2
'А а Б б В в Г г Щ щ' # Cyrillic decoded.
>>>
>>> ba3 = bytearray(b'\xe3\x82\xa6 \xe3\x82\xa3 \xe3\x82\xad \xe3\x83\x9a \xe3\x83\x87 \xe3\x82\xa3')
>>> s3 = ba3.decode() ; s3
'ウ ィ キ ペ デ ィ' # Japanese decoded.
>>> len(s3)
11 # Length of s3 in characters.
>>> len(ba3)
23 # Length of s3 in bytes.
>>>
to str containing hexadecimal digits[edit | edit source]method bytearray.hex() returns a string object containing two hexadecimal digits for each byte in the instance. >>> bytearray(b'\xf0\xf1\xf2').hex()
'f0f1f2'
>>> (bytearray(b'\xf0\xf1\xf2') * 2).hex()
'f0f1f2f0f1f2'
>>> (bytearray(b'\xf0\xf1\xf2') * 2)[1:4].hex()
'f1f2f0'
>>>
method bytearray.hex() can be used to convert from >>> ba1 = bytearray(b'\xF0\x23\x95') ; ba1
bytearray(b'\xf0#\x95')
>>> h1 = ba1.hex() ; h1
'f02395'
>>> i1 = int(h1, 16) ; i1 ; hex(i1)
15737749
'0xf02395'
>>>
|
Operations with methods on bytearrays
[edit | edit source]
The following methods on bytearrays are representative of methods described in the reference.
All can be used with [edit | edit source] |
Assignments
[edit | edit source]
|
Further Reading or Review
[edit | edit source]
|
References
[edit | edit source]
1. Python's documentation: "2.4.1. String and Bytes literals," "4.8. Binary Sequence Types — bytes, bytearray, ....," "4.8.3. Bytes and Bytearray Operations," "4.4.2. Additional Methods on Integer Types," "Unicode HOWTO," "7.2.2. Encodings and Unicode"
"bytes.decode()," "str.encode()"
"bytes()," "bytearray()," "ord()," "chr()," "open(file, ....)" |