Fast access to item data data of str and bytes objects #1037

JukkaL · 2023-11-26T13:45:46Z

Right now accessing individual items of str or bytes objects is not very efficient. One way to make this faster would be to support primitive types that allow direct access to item data via view objects. Fast access to individual bytes/str items is very useful in various libraries and low-level use cases, such as when writing parsers, decoders, etc.

For example, BytesView could allow direct access to the data within a bytes object. It would act like a sequence of integers, and wouldn't support most bytes methods. We'd represent it as a stack-allocated, immutable value object with three attributes:

Pointer to the beginning of the data view (char *)
Length of data (size_t)
Object (PyObject *)

We'd also support slicing, which would return a smaller view, and wouldn't copy any data. It would be a very fast operation in compiled code.

Example:

v = BytesView(b'foo')
v[0]  # 102 (ord('f'))
len(v)  # 3
v[0:2]  # BytesView(b'fo')

This approach has some benefits over adding primitives to operate directly on bytes values:

Slicing would be constant-time (and very fast), unlike bytes slicing, which allocates a new object.
We can support subclasses of bytes and bytearray universally without slowing down the bytes case, which is the common case. We'd construct a temporary bytes object behind the scenes if the target is mutable.
All operations on BytesView can be fast, so performance will be more predictable compared to dealing with bytes objects directly, as the latter have many unoptimized methods.
We can provide a very similar interface for direct access to the contents of str objects.

Similarly, StrView would provide direct, read-only access to the code point array backing a str. Here the performance benefit is more obvious compared to BytesView, since indexing strings produces strings with length 1, which are clearly not as efficient as (native) integers. Example:

v = StrView('foo')
v[0]  # 102 (ord('f'))
len(v)  # 3
v[0:2]  # StrView('fo')

All views could support some additional operations for convenience, beyond basic sequence operations:

Equality with bytes / str objects
startswith() and endswith()
Others that turn out to be useful

Additionally, StrView could support some operations for querying the internal representation (whether 1, 2 or 4 bytes is used per code point; maximum code point value).

Related issue:

Fast str and bytes builders #1036

The text was updated successfully, but these errors were encountered:

JukkaL added the speed label Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast access to item data data of str and bytes objects #1037

Fast access to item data data of str and bytes objects #1037

JukkaL commented Nov 26, 2023

Fast access to item data data of str and bytes objects #1037

Fast access to item data data of str and bytes objects #1037

Comments

JukkaL commented Nov 26, 2023