How to Convert String to Bytes in Python

In Python, use the .encode() method on a string to convert it into bytes, optionally specifying the desired encoding (UTF-8 by default).

Jun 2024 · 7 min read

Strings represent human-readable text and are one of the most basic and important data types in computer programming. However, each character in a text string is represented by one or more bytes of binary data. Applications such as input and output operations and data transmission require strings to be converted to bytes using a specific encoding.

This tutorial explores the techniques of converting strings to bytes in Python. If you're interested in the reverse operation, check out my tutorial on how to convert bytes to strings in Python.

Before getting into details, let’s start with a short answer for those of you in a hurry.

Short Answer: How to Convert String to Bytes in Python

Python makes it straightforward to convert a string into bytes using the built-in .encode() method:

my_string = "Hello, world!"
bytes_representation = my_string.encode(encoding="utf-8")  
# Optional: Specify the desired encoding (UTF-8 is the default)

print(bytes_representation)
# Output: b'Hello, world!'

The .encode() method returns a new bytes object representing the encoded version of the original string. By default, it uses UTF-8 encoding, but you can specify other encodings like 'ascii' or 'latin-1' if needed.

Let's explore this conversion process in more detail.

Understanding Strings and Bytes in Python

Two of the core built-in data types in Python are str and bytes. These data types share common features but have key differences.

Both str and bytes are immutable sequences, meaning we can't modify their elements after creation. A string is an immutable sequence of characters, whereas a bytes object is an immutable sequence of integers between 0 and 255. This range of integers can be represented by 8 bits, which is one byte. Therefore, a bytes object is a sequence of bytes.

ASCII character encoding

Let's consider ASCII characters first. ASCII (American Standard Code for Information Interchange) is a character encoding that contains only 128 characters. Therefore, any ASCII character can be represented by seven bits, which is fewer than a single byte.

We can create a bytes object by adding b in front of the single, double, or triple quotes we normally use for strings:

word_as_bytes = b"Python"
print(word_as_bytes)
print(type(word_as_bytes))

b'Python'
<class 'bytes'>

Although the code displays the characters spelling Python, each element in the bytes object is an integer between 0 and 255:

print(word_as_bytes[0])
print(list(word_as_bytes))

80
[80, 121, 116, 104, 111, 110]

The first element of word_as_bytes is the integer 80, which is the ASCII code for uppercase P:

print(chr(80))

When casting word_as_bytes to a list, the list contains the integers representing each byte. The integers are the ASCII codes for the letters in the word Python.

However, the ASCII character set is limited.

UTF-8 character encoding

The most common character encoding is UTF-8, which is an 8-bit Unicode encoding. The 128 ASCII characters are represented by the same integers in UTF-8, but other characters can also be represented using two or more bytes for each character.

Let's create a bytes object using non-ASCII characters. We'll need to use the bytes() constructor:

word_as_bytes = bytes("café", "utf-8")
print(word_as_bytes)

b'caf\xc3\xa9'

The bytes object displays the first three letters in café directly. However, the accented é is not an ASCII character, and it's represented by two bytes, which are displayed as \xc3 and \xa9. These bytes represent the hexadecimal numbers c3 and a9, which are the integers 195 and 169. These two bytes combined represent é in UTF-8:

print(len(word_as_bytes))
print(list(word_as_bytes))

5
[99, 97, 102, 195, 169]

A five-element bytes object represents the four-letter string.

Converting Strings to Bytes: The encode() Method

Earlier, we used the bytes() constructor to convert a text string into a bytes object. A more common way of converting Python strings to bytes is to use the .encode() string method, which gives control over encoding and error handling. This method returns a bytes object that represents the string:

word_as_bytes = "Hello Python!".encode()
print(word_as_bytes)
print(type(word_as_bytes))

b'Hello Python!'
<class 'bytes'>

The .encode() method defaults to the UTF-8 encoding. UTF-8 is the most widely used encoding format, and it supports a much wider range of characters than ASCII. UTF-8 represents each character with a sequence of one, two, three, or four-byte units.

We can call .encode() with an alternative encoding as an argument:

word_as_bytes = "Hello Python!".encode("utf-16")
print(word_as_bytes)

b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00P\x00y\x00t\x00h\x00o\x00n\x00!\x00'

The bytes object is different in this case as it represents the UTF-16 encoding of the same text string.

Encoding Errors

Since not all encodings include all characters, errors can occur when encoding a string into a bytes object.

Let's consider the string "Café • £2.20", which has three non-ASCII characters. We can encode this using the default UTF-8 encoding:

word_as_bytes = "Café • £2.20".encode()
print(word_as_bytes)

b'Caf\xc3\xa9 \xe2\x80\xa2 \xc2\xa32.20'

The non-ASCII characters are replaced by their hexadecimal escape sequences. However, .encode() raises an error if the same string is encoded using ASCII since several characters aren't present in the ASCII encoding:

word_as_bytes = "Café • £2.20".encode("ascii")
print(word_as_bytes)

Traceback (most recent call last):
...
word_as_bytes = "Café • £2.20".encode("ascii")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)

The .encode() string method has a parameter errors, which has a default value of "strict". The "strict" argument forces .encode() to raise a UnicodeEncodeError when a character can't be encoded.

However, there are other options for handling errors. One option is to ignore the errors using the "ignore" argument:

word_as_bytes = "Café • £2.20".encode("ascii", errors="ignore")
print(word_as_bytes)

b'Caf 2.20'

This option doesn't raise an error. Instead, .encode() returns a bytes object, and the accented é, the bullet point, and the pound sign are omitted.

Another option is to replace the characters that can't be encoded with something else. There are several replacement options. One of these is the errors="replace" argument:

word_as_bytes = "Café • £2.20".encode("ascii", errors="replace")
print(word_as_bytes)

b'Caf? ? ?2.20'

The three non-ASCII characters, which can't be encoded in this example, are replaced with a question mark. Therefore, each missing character is replaced with another single character that acts as a placeholder.

We can also replace the missing characters with more informative text:

word_as_bytes = "Café • £2.20".encode("ascii", errors="backslashreplace")
print(word_as_bytes)

b'Caf\\xe9 \\u2022 \\xa32.20'

Calling .encode() with errors="backslashreplace" replaces the characters that can't be encoded with their hexadecimal escape sequences. The hexadecimal number e9 represents the accented é. The bullet point is the Unicode character u2022, and the hexadecimal number a3 represents the pound sign.

We can also use "xmlcharrefreplace" to replace the missing characters with their XML code:

word_as_bytes = "Café • £2.20".encode("ascii", errors="xmlcharrefreplace")
print(word_as_bytes)

b'Caf&#233; &#8226; &#163;2.20'

Another option is to replace invalid characters with their formal name using errors="namereplace":

word_as_bytes = "Café • £2.20".encode("ascii", errors="namereplace")
print(word_as_bytes)

b'Caf\\N{LATIN SMALL LETTER E WITH ACUTE} \\N{BULLET} \\N{POUND SIGN}2.20'

Different situations may require tailored error handling, and the .encode() string method provides several options for dealing with characters that can't be encoded.

Applications of String-to-Byte Conversion in Data Science

String-to-byte conversion is a fundamental operation that finds applications in various data science domains:

Natural language processing (NLP): When working with text data for tasks like sentiment analysis, topic modeling, or machine translation, we often preprocess text by tokenizing it into words or subwords. This tokenization process frequently involves converting strings to byte sequences for efficient representation and manipulation.
Data cleaning and preprocessing: Byte-level operations can be useful for cleaning text data, such as removing invalid characters or normalizing text based on specific byte patterns.
Feature engineering: In some cases, byte-level features (e.g., n-grams of bytes) can be extracted from text data and used as input features for machine learning models.
Web scraping and data extraction: When scraping data from websites, we often receive HTML or other text-based content that might need to be parsed and processed at the byte level to extract relevant information.
Data compression: Certain data compression algorithms operate on byte sequences, so converting strings to bytes can be a necessary step before applying compression techniques.

Understanding these applications can help us identify situations where converting strings to bytes can be useful.

Conclusion

Strings are sequences of human-readable characters. These characters are encoded as bytes of binary data, which can be stored in a bytes object. A bytes object is a sequence of integers, and each integer represents a byte.

Applications require strings to be converted to bytes objects, which we can convert using either the bytes() constructor or the string .encode() method. Mastering the conversion between strings and bytes enables more flexible data manipulation.

You can continue your Python learning with the following tutorials and courses:

What’s the difference between str and bytes?

Should I use bytes() or str.encode() to convert a string to bytes?

Author

Stephen Gruppetta

Topics

Python

Data Science

Learn Python with these courses!

course

Introduction to Python

4 hours

5.6M

Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.

See Details

Start Course

course

Intermediate Python

4 hours

1.1M

Level up your data science skills by creating visualizations using Matplotlib and manipulating DataFrames with pandas.

See Details

Start Course

course

Writing Efficient Python Code

4 hours

120.3K

Learn to write efficient code that executes quickly and allocates resources skillfully to avoid unnecessary overhead.

See Details

Start Course

tutorial

How to Convert Bytes to String in Python

To convert bytes to strings in Python, we can use the .decode() method, specifying the appropriate encoding.

Stephen Gruppetta

8 min

tutorial

Python String Tutorial

In this tutorial, you'll learn all about Python Strings: slicing and striding, manipulating and formatting them with the Formatter class, f-strings, templates and more!

Sejal Jaiswal

16 min

tutorial

Python String to DateTime: How to Convert Strings to DateTime Objects in Python

In the tutorial, learn everything about the Python datetime module. Find a step-by-step guide for strings to datetime conversion, along with code samples and common errors.