Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash performance #56

Open
chrstphrbrns opened this issue Jul 13, 2023 · 6 comments
Open

hash performance #56

chrstphrbrns opened this issue Jul 13, 2023 · 6 comments

Comments

@chrstphrbrns
Copy link

julia> a_string="a"^32;

julia> a_asciistring = ASCIIString(a_string);

julia> a_utf8string = UTF8String(a_string);

julia> @btime hash(a_string)
  24.000 ns (1 allocation: 16 bytes)
0xd44c9a12c524b43a

julia> @btime hash(a_asciistring)
  354.739 ns (5 allocations: 240 bytes)
0xd44c9a12c524b43a

julia> @btime hash(a_utf8string)
  460.051 ns (5 allocations: 240 bytes)
0xd44c9a12c524b43a
@stevengj
Copy link
Member

stevengj commented Jul 13, 2023

That's because these are calling the fallback hashing routine that converts to a String: https://github.com/JuliaLang/julia/blob/9091cb0ac667049a1095955b4a134f6eeb70d403/base/strings/basic.jl#L355

We could add specialized hashing methods to LegacyStrings.jl, of course.

It's not completely clear who is using this package, however, and whether its performance matters — it was mainly introduced for backwards compatibility when these types were kicked out of Base, but for most purposes we recommend using the built-in String type (or more specialized string types like StringViews, StaticStrings, etcetera).

@chrstphrbrns
Copy link
Author

The idea of an ASCIIString type is useful as it allows certain optimizations, like O(1) length and getindex, so I think it's worth maintaining

I was actually going to suggest such a type be added to Base before learning of this package (in fact, I still think it belongs there)

@stevengj
Copy link
Member

stevengj commented Jul 13, 2023

The idea of an ASCIIString type is useful as it allows certain optimizations, like O(1) length and getindex, so I think it's worth maintaining

Can you give an example where O(1) length and character-based indexing is useful? (String already has O(1) getindex, just based on codeunit indices rather than character indices — is normally all you need since indices don't simply fall out of the air … they almost invariably come from a previous loop over the string, e.g. a search, in which case opaque references like codeunit indices are fine.)

See also myth 8.3 at utf8everywhere.org … we're quite aware of the instinctive desire for O(1) length etcetera and generally feel that it has been convincingly refuted for the vast majority of applications.

But maybe this isn't the best place to debate this. We've discussed the UTF-8 encoding and string indices ad nauseam on Julia discourse (e.g. see here), and it is also discussed in many places online e.g. utf8everywhere.org.

@chrstphrbrns
Copy link
Author

I don't have a super compelling use case in mind, but the possibilities aren't limited by my ability to imagine them

Anyways, I trust you've given this subject more thought than I have. I'm okay with closing this if you think it isn't worth the effort

@stevengj
Copy link
Member

stevengj commented Jul 14, 2023

It would be easy to at least add fast hashing for ASCIIString, since that's literally the same bytes as UTF-8 and so it can use the same code as in base, e.g. it should be as simple as:

function Base.hash(s::ASCIIString, h::UInt)
    h += Base.memhash_seed
    ccall(Base.memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), s, sizeof(s), h % UInt32) + h
end

Fast hashing for the other legacy string types is harder because we want them to hash to the same value as String.

@chrstphrbrns
Copy link
Author

chrstphrbrns commented Jul 14, 2023

Sounds good

BTW, isequal is also much slower for ASCIIString

julia> @btime isequal(a,b)
  17.790 μs (0 allocations: 0 bytes)          # ::String
true

julia> @btime isequal(x,y)
  1.333 ms (0 allocations: 0 bytes)          # ::ASCIIString
true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants