Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String expressions operate on bytes instead of characters #2730

Open
1ec5 opened this issue Aug 16, 2024 · 2 comments
Open

String expressions operate on bytes instead of characters #2730

1ec5 opened this issue Aug 16, 2024 · 2 comments

Comments

@1ec5
Copy link
Contributor

1ec5 commented Aug 16, 2024

String expression operators operate on bytes rather than characters, with unexpected results for any input that isn’t pure ASCII. The style specification unhelpfully only mentions lengths and indices but doesn’t define them in terms of anything. Even so, a typical developer is very likely to assume it counts the number of characters. A byte count would be the least expected value.

Example

Create a new iOS project with the following view controller code:

import UIKit
import MapLibre

// Set this constant to any non-ASCII string.
let name = "ñ"

class ViewController: UIViewController, MLNMapViewDelegate {
	var mapView: MLNMapView!

	override func viewDidLoad() {
		super.viewDidLoad()
		
		mapView = MLNMapView(frame: view.bounds)
		mapView.delegate = self
		mapView.autoresizingMask = [.flexibleWidth, .flexibleHeight]
		mapView.styleURL = MLNStyle.predefinedStyle("Streets")?.url
		view.addSubview(mapView)
	}
	
	func mapView(_ mapView: MLNMapView, didFinishLoading style: MLNStyle) {
		let feature = MLNPointFeature()
		feature.coordinate = CLLocationCoordinate2D(latitude: 0, longitude: 0)
		feature.attributes = [
			"name": name,
		]
		let source = MLNShapeSource(identifier: "test", shape: feature)
		style.addSource(source)
		
		let lengthLayer = MLNSymbolStyleLayer(identifier: "length", source: source)
		lengthLayer.text = NSExpression(mglJSONObject: ["to-string", ["length", ["get", "name"]]])
		lengthLayer.textFontNames = NSExpression(forConstantValue: ["Open Sans Semibold"])
		lengthLayer.textFontSize = NSExpression(forConstantValue: 30)
		lengthLayer.textOffset = NSExpression(forConstantValue: CGPoint(x: 0, y: 0))
		style.addLayer(lengthLayer)
	}
}

Expected and actual behavior

The label should indicate the length of the name constant in characters:

name UTF-8 Expected label Actual label
A 41 1 1
ñ C3 B1 1 2
E4 B8 90 1 3
𦨭 F0 A6 A8 AD 1 4
🇺🇳 F0 9F 87 BA F0 9F 87 B3 1 or 2 or 4 8
丐𦨭市镇 E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87 4 13

Here’s what the map looks like when name is 丐𦨭市镇:

The number 13 centered over Null Island, over a map of the world.

Impact

This would be especially surprising to an iOS/macOS developer, since Objective-C and Swift are both very opinionated about how strings are stored and measured:

String UTF-8 NSString.length
(Objective-C)
String.count
(Swift)
A 41 1 1
ñ C3 B1 1 1
E4 B8 90 1 1
𦨭 F0 A6 A8 AD 2 1
🇺🇳 F0 9F 87 BA F0 9F 87 B3 4 1
丐𦨭市镇 E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87 5 4

The gold standard is to count graphemes, as in Swift, but at least counting UTF-16 characters would be a little more reasonable and consistent with GL JS.

Platform information

  • OS: iOS
  • Platform: iOS/macOS (but this probably affects every native platform)
  • Version: 6.5.4 (but this has probably been present since expressions were first implemented)

Diagnosis

As detailed in maplibre/maplibre-gl-js#4550 (comment), MVT-compliant tiles encode strings as UTF-8. Each implementation is free to store the string however it pleases; apparently mbgl is storing it as an std::string (aka std::basic_string<char>). It isn’t necessarily a problem that mbgl stores the string as raw bytes, but this implementation detail should not be exposed to the developer. Unfortunately, the length operator simply calls size() on the raw byte string:

return value->match([](const std::string& s) { return EvaluationResult{static_cast<double>(s.size())}; },

Some other string operators also appear to operate on raw bytes, even expecting a raw byte offset as input:

size_t index = string.find(keywordString, fromIndexValue);
int length = static_cast<int>(string.size());

At least std::string should be replaced by a multibyte container such as std::u8string or std::u16string to handle extremely common cases like accented Latin text and Arabic text. But really this implementation should be using ICU, which is already a dependency or available from the platform on every supported platform, as far as I can tell.

@1ec5 1ec5 added the bug Something isn't working label Aug 16, 2024
@louwers
Copy link
Collaborator

louwers commented Aug 16, 2024

@1ec5 Thanks for the detailed bug report.

Do you happen to know the behavior of the length expression with MapLibre GL JS?

@1ec5
Copy link
Contributor Author

1ec5 commented Aug 16, 2024

Do you happen to know the behavior of the length expression with MapLibre GL JS?

Not as bad, but still suboptimal and very different than the native implementation: maplibre/maplibre-style-spec#778.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants