Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast access to item data data of str and bytes objects #1037

Open
JukkaL opened this issue Nov 26, 2023 · 0 comments
Open

Fast access to item data data of str and bytes objects #1037

JukkaL opened this issue Nov 26, 2023 · 0 comments
Labels

Comments

@JukkaL
Copy link
Collaborator

JukkaL commented Nov 26, 2023

Right now accessing individual items of str or bytes objects is not very efficient. One way to make this faster would be to support primitive types that allow direct access to item data via view objects. Fast access to individual bytes/str items is very useful in various libraries and low-level use cases, such as when writing parsers, decoders, etc.

For example, BytesView could allow direct access to the data within a bytes object. It would act like a sequence of integers, and wouldn't support most bytes methods. We'd represent it as a stack-allocated, immutable value object with three attributes:

  • Pointer to the beginning of the data view (char *)
  • Length of data (size_t)
  • Object (PyObject *)

We'd also support slicing, which would return a smaller view, and wouldn't copy any data. It would be a very fast operation in compiled code.

Example:

v = BytesView(b'foo')
v[0]  # 102 (ord('f'))
len(v)  # 3
v[0:2]  # BytesView(b'fo')

This approach has some benefits over adding primitives to operate directly on bytes values:

  • Slicing would be constant-time (and very fast), unlike bytes slicing, which allocates a new object.
  • We can support subclasses of bytes and bytearray universally without slowing down the bytes case, which is the common case. We'd construct a temporary bytes object behind the scenes if the target is mutable.
  • All operations on BytesView can be fast, so performance will be more predictable compared to dealing with bytes objects directly, as the latter have many unoptimized methods.
  • We can provide a very similar interface for direct access to the contents of str objects.

Similarly, StrView would provide direct, read-only access to the code point array backing a str. Here the performance benefit is more obvious compared to BytesView, since indexing strings produces strings with length 1, which are clearly not as efficient as (native) integers. Example:

v = StrView('foo')
v[0]  # 102 (ord('f'))
len(v)  # 3
v[0:2]  # StrView('fo')

All views could support some additional operations for convenience, beyond basic sequence operations:

  • Equality with bytes / str objects
  • startswith() and endswith()
  • Others that turn out to be useful

Additionally, StrView could support some operations for querying the internal representation (whether 1, 2 or 4 bytes is used per code point; maximum code point value).

Related issue:

@JukkaL JukkaL added the speed label Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant