You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
String expression operators operate on bytes rather than characters, with unexpected results for any input that isn’t pure ASCII. The style specification unhelpfully only mentions lengths and indices but doesn’t define them in terms of anything. Even so, a typical developer is very likely to assume it counts the number of characters. A byte count would be the least expected value.
Example
Create a new iOS project with the following view controller code:
The label should indicate the length of the name constant in characters:
name
UTF-8
Expected label
Actual label
A
41
1
1
ñ
C3 B1
1
2
丐
E4 B8 90
1
3
𦨭
F0 A6 A8 AD
1
4
🇺🇳
F0 9F 87 BA F0 9F 87 B3
1 or 2 or 4
8
丐𦨭市镇
E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87
4
13
Here’s what the map looks like when name is 丐𦨭市镇:
Impact
This would be especially surprising to an iOS/macOS developer, since Objective-C and Swift are both very opinionated about how strings are stored and measured:
String
UTF-8
NSString.length (Objective-C)
String.count (Swift)
A
41
1
1
ñ
C3 B1
1
1
丐
E4 B8 90
1
1
𦨭
F0 A6 A8 AD
2
1
🇺🇳
F0 9F 87 BA F0 9F 87 B3
4
1
丐𦨭市镇
E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87
5
4
The gold standard is to count graphemes, as in Swift, but at least counting UTF-16 characters would be a little more reasonable and consistent with GL JS.
Platform information
OS: iOS
Platform: iOS/macOS (but this probably affects every native platform)
Version: 6.5.4 (but this has probably been present since expressions were first implemented)
Diagnosis
As detailed in maplibre/maplibre-gl-js#4550 (comment), MVT-compliant tiles encode strings as UTF-8. Each implementation is free to store the string however it pleases; apparently mbgl is storing it as an std::string (aka std::basic_string<char>). It isn’t necessarily a problem that mbgl stores the string as raw bytes, but this implementation detail should not be exposed to the developer. Unfortunately, the length operator simply calls size() on the raw byte string:
At least std::string should be replaced by a multibyte container such as std::u8string or std::u16string to handle extremely common cases like accented Latin text and Arabic text. But really this implementation should be using ICU, which is already a dependency or available from the platform on every supported platform, as far as I can tell.
The text was updated successfully, but these errors were encountered:
String expression operators operate on bytes rather than characters, with unexpected results for any input that isn’t pure ASCII. The style specification unhelpfully only mentions lengths and indices but doesn’t define them in terms of anything. Even so, a typical developer is very likely to assume it counts the number of characters. A byte count would be the least expected value.
Example
Create a new iOS project with the following view controller code:
Expected and actual behavior
The label should indicate the length of the
name
constant in characters:name
A
ñ
丐
𦨭
🇺🇳
丐𦨭市镇
Here’s what the map looks like when
name
is丐𦨭市镇
:Impact
This would be especially surprising to an iOS/macOS developer, since Objective-C and Swift are both very opinionated about how strings are stored and measured:
NSString.length
(Objective-C)
String.count
(Swift)
A
ñ
丐
𦨭
🇺🇳
丐𦨭市镇
The gold standard is to count graphemes, as in Swift, but at least counting UTF-16 characters would be a little more reasonable and consistent with GL JS.
Platform information
Diagnosis
As detailed in maplibre/maplibre-gl-js#4550 (comment), MVT-compliant tiles encode strings as UTF-8. Each implementation is free to store the string however it pleases; apparently mbgl is storing it as an
std::string
(akastd::basic_string<char>
). It isn’t necessarily a problem that mbgl stores the string as raw bytes, but this implementation detail should not be exposed to the developer. Unfortunately, thelength
operator simply callssize()
on the raw byte string:maplibre-native/src/mbgl/style/expression/length.cpp
Line 16 in ac606a1
Some other string operators also appear to operate on raw bytes, even expecting a raw byte offset as input:
maplibre-native/src/mbgl/style/expression/index_of.cpp
Line 80 in ac606a1
maplibre-native/src/mbgl/style/expression/slice.cpp
Line 92 in ac606a1
At least
std::string
should be replaced by a multibyte container such asstd::u8string
orstd::u16string
to handle extremely common cases like accented Latin text and Arabic text. But really this implementation should be using ICU, which is already a dependency or available from the platform on every supported platform, as far as I can tell.The text was updated successfully, but these errors were encountered: