Add IANA-to-BCP47 and reverse mappings #4024

sffc · 2023-09-11T10:18:26Z

Depends on #4021
Depends on #4022
Depends on #4023

Replaces #3499

sffc · 2023-09-11T10:22:15Z

Design Question:

The data store I'm using for IANA-to-BCP47 is

pub struct IanaToBcp47MapV1<'data> {
    /// A map from IANA time zone identifiers to indexes of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub map: ZeroTrie<ZeroVec<'data, u8>>,
    /// A sorted list of BCP-47 time zone identifiers.
    #[cfg_attr(feature = "serde", serde(borrow))]
    // Note: this is 9739B as ZeroVec<TinyStr8> and 9335B as VarZeroVec<str>
    pub bcp47_ids: ZeroVec<'data, TimeZoneBcp47Id>,
}

For BCP47-to-IANA, I could simply do ZeroMap<TimeZoneBcp47Id, String>. However, I could save a bit of space (probably on the order of 1 kB) by instead storing a VarZeroVec<String> where the indices in that map correspond to the indices of bcp47_ids in the other data key.

We have examples of multi-key dependencies, but I don't think they are the type that have invariants between them. In this case, the invariant would be that the lists correspond to each other (essentially the keys of the map are in one DataKey and the values of the map are in a different DataKey).

Thoughts? @robertbastian @Manishearth

robertbastian · 2023-09-11T15:59:15Z

Seems brittle.

Manishearth · 2023-09-11T17:48:19Z

Not a huge fan of interkey dependencies where a broken dependency is not detectable (in this case 99% of the time a mismatch in versions would lead to buggy data in an undetectable way)

Now, there is a way to solve that problem: we can store a hash of the index value with the map, and if they mismatch we throw an error. If we want to be really fancy we can even have the map store an Option<Indices> and you can configure datagen to emit a full index map when you know there is going to be a discrepancy in the two data keys.

This is in line with previous decisions that datagen configurability should not be used to add or omit data in a user-visible way but can be used for optimizations.

sffc · 2023-09-12T08:33:35Z

we can store a hash of the index value with the map

Yeah, or how about this: we take the hash of the whole indices zerovec and store just that in both keys, like this:

struct Foo<'data> {
    bcp47_ids: ZeroVec<'data, TinyStr8>,
    bcp47_ids_hash: u32,
    // ... other data ...
}

struct Bar<'data> {
    bcp47_ids: Option<ZeroVec<'data, TinyStr8>>,
    bcp47_ids_hash: u32,
    // ... other data ...
}

The behavior would be:

If the bcp47_ids_hash are the same in both data structs, use Foo::bcp47_ids
If the bcp47_ids_hash are not the same, use Bar::bcp47_ids if it is Some
If the hashes are not the same and Bar::bcp47_ids is None, return an error from the constructor

Is this a great idea to save a kilobyte or two, or is it overengineering?

dpulls · 2023-09-12T10:30:59Z

🎉 All dependencies have been resolved !

sffc · 2023-09-13T07:14:46Z

Okay, here is what I ended up implementing.

I made a checksum for the ZeroVec that was duplicated between the two keys, but I did not wire it up to be able to be duplicated in the second key. It is just always absent, and if the checksum is inconsistent between the two keys, the constructor fails. I didn't want to fiddle with datagen options for a situation that is unlikely to happen in the real world. If we ever do encounter this, we can add a new key or a V2 of the current key.

The ZeroVec in question is about 3.5 KB which seems about the size where this type of thing could be justifiable.

The data in postcard is 9749B (primary direction) and 7569B (reverse direction) which I'm quite happy with. For comparison, the non-ZeroTrie version was 14475B (primary direction) and 11249B (reverse with the deduplicated ZeroVec which ZeroTrie enabled).

sffc · 2023-09-13T16:59:21Z

Are all the following statements true?

The only blocking issue for this PR is the hash choice
The only hash crates still in contention are twox_hash and crc32fast
We consider both twox_hash and crc32fast to be robust, widely used crates that contain algorithms suitable for a checksum hash

robertbastian · 2023-09-14T08:57:06Z

The only blocking issue for this PR is the hash choice

✅

The only hash crates still in contention are twox_hash and crc32fast

I also still think SipHasher is a possible way forward, as we do not need any cryptographic properties. The worst that can happen is GIGO.

We consider both twox_hash and crc32fast to be robust, widely used crates that contain algorithms suitable for a checksum hash

✅

robertbastian · 2023-09-14T09:01:52Z

provider/datagen/src/transform/cldr/time_zones/names.rs

+ let mut hasher = twox_hash::XxHash64::with_seed(0);
+ for bcp47 in bcp47_ids.iter() {
+ hasher.write(bcp47.0.all_bytes());
+ }
+ let checksum2 = hasher.finish();


uhm this basically tests that the implementation of as_bytes is the same as all_bytes for each entry. While this is useful to test, it should be a test in zeroslice, not here.

This should test things like:

If the order changes the checksums are different

The checksum for "abc", "def" is different from the one of "abcd", "ef"

The checksum for "abc", "def", "" is different from the one for "abc", "def", ""

The checksum for the hardcoded list is equal to a hardcoded checksum (stability)

robertbastian · 2023-09-14T17:56:49Z

Discussion:

@zbraniecki and @robertbastian advocate for SipHasher
@sffc agrees if SipHasher is platform independent, will investigate

sffc · 2023-09-14T22:26:21Z

CC @Manishearth to weigh in on the hash choice (current decision is to use std::hash::SipHasher with an #[allow(deprecated)])

justingrant · 2023-09-14T22:47:18Z

provider/datagen/src/transform/cldr/time_zones/names.rs

+ fn load(&self, _: DataRequest) -> Result<DataResponse<Bcp47ToIanaMapV1Marker>, DataError> {
+ let resource: &cldr_serde::time_zones::bcp47_tzid::Resource =
+ self.cldr()?.bcp47().read_and_parse("timezone.json")?;
+ // Note: The BTreeMap retains the order of the aliases, which is important for establishing


Note that the next CLDR release will include a new iana attribute that, if present, overrides the alias order. See unicode-org/cldr#3105.

Does this PR handle that attribute?

No, CLDR 44 Alpha is not in scope for the 1.3 release, but this will be a priority for the 1.4 release. Filed #4044

sffc · 2023-09-14T23:06:28Z

@Manishearth said in #4030 (comment):

(we should not use SipHash, it's deprecated and nobody on the libs team wants people to use it for stuff like this. Since it is deprecated it's really not something that I would consider "maintained", even if it is in the stdlib.)

sffc · 2023-09-14T23:23:56Z

Based on @Manishearth's comment, I pre-emptively pushed another commit reverting SipHash back to XxHash, pending additional feedback from @robertbastian or @zbraniecki

sffc · 2023-09-15T17:42:12Z

I think we should rule out crc32fast because it only generates a 32-bit hash, and it mainly exists because of compiler intrinsics. I think we should choose between XxHash and SipHash.

sffc · 2023-09-20T05:24:42Z

I'm merging this and leaving the normalization follow-up for #4031.

sffc force-pushed the new-iana branch 3 times, most recently from a290442 to 8ff4151 Compare September 12, 2023 08:00

sffc mentioned this pull request Sep 12, 2023

Add icu_provider::fxhash_32 #4028

Closed

sffc added 19 commits September 13, 2023 00:09

Add data structs for time zone names, based on properties name structs

7176117

Add datagen code (not wired in yet)

2b1d2b1

Add public-facing APIs

af7b806

FFI for the IANA name mappers

4c15a57

Add to datagen registry

61c7a37

cargo make testdata

1f2b33b

Change data model for IANA-to-BCP47 to use ZeroTrie

eb90683

lockfile

7eb8e67

cargo make testdata

8e1aa22

Use new function name

57762fd

Initial work for new Bcp47ToIanaMapper

c4cc8da

Renames and TODOs

5c23615

Remove the list and retain only the checksum. Validate in constructor.

0d40da5

cargo make testdata

3444868

Add the two new keys to bakeddata

50667eb

Wire the baked data into the APIs

94393f2

Change to all case insensitive lookup

4b57cf9

Update diplomat bindings

88c0666

diplomat-gen

49c9783

sffc force-pushed the new-iana branch from c4faba3 to 49c9783 Compare September 13, 2023 07:09

sffc requested a review from robertbastian September 13, 2023 16:52

sffc added 2 commits September 13, 2023 13:26

Add zerotrie to testdata deps

5c62ec8

generate-readmes

6bcd8ea

robertbastian reviewed Sep 14, 2023

View reviewed changes

sffc added 3 commits September 14, 2023 13:50

Switch XxHash to SipHash

057dd0e

More testing

dd54fe3

Delete the unused dependency

5fea5f3

sffc force-pushed the new-iana branch from fc9de8f to 5fea5f3 Compare September 14, 2023 22:19

add zerotrie to depcheck

44af79f

sffc requested a review from robertbastian September 14, 2023 22:25

justingrant reviewed Sep 14, 2023

View reviewed changes

Switch SipHash to XxHash

b33aed9

robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Sep 15, 2023

twox-hash deps

6b8690b

Merge branch 'main' into new-iana

01cd5f2

robertbastian removed the discuss Discuss at a future ICU4X-SC meeting label Sep 19, 2023

robertbastian previously approved these changes Sep 19, 2023

View reviewed changes

datagen

4a6cb1b

robertbastian dismissed their stale review via 4a6cb1b September 19, 2023 12:36

robertbastian approved these changes Sep 19, 2023

View reviewed changes

sffc merged commit 95350a4 into unicode-org:main Sep 20, 2023
26 checks passed

sffc deleted the new-iana branch September 20, 2023 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IANA-to-BCP47 and reverse mappings #4024

Add IANA-to-BCP47 and reverse mappings #4024

sffc commented Sep 11, 2023

sffc commented Sep 11, 2023

robertbastian commented Sep 11, 2023

Manishearth commented Sep 11, 2023

sffc commented Sep 12, 2023

dpulls bot commented Sep 12, 2023

sffc commented Sep 13, 2023 •

edited

Loading

sffc commented Sep 13, 2023

robertbastian commented Sep 14, 2023

robertbastian Sep 14, 2023 •

edited

Loading

sffc Sep 14, 2023

robertbastian commented Sep 14, 2023

sffc commented Sep 14, 2023

justingrant Sep 14, 2023

sffc Sep 14, 2023

sffc commented Sep 14, 2023 •

edited

Loading

sffc commented Sep 14, 2023

sffc commented Sep 15, 2023

sffc commented Sep 20, 2023

Add IANA-to-BCP47 and reverse mappings #4024

Add IANA-to-BCP47 and reverse mappings #4024

Conversation

sffc commented Sep 11, 2023

sffc commented Sep 11, 2023

robertbastian commented Sep 11, 2023

Manishearth commented Sep 11, 2023

sffc commented Sep 12, 2023

dpulls bot commented Sep 12, 2023

sffc commented Sep 13, 2023 • edited Loading

sffc commented Sep 13, 2023

robertbastian commented Sep 14, 2023

robertbastian Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

sffc Sep 14, 2023

Choose a reason for hiding this comment

robertbastian commented Sep 14, 2023

sffc commented Sep 14, 2023

justingrant Sep 14, 2023

Choose a reason for hiding this comment

sffc Sep 14, 2023

Choose a reason for hiding this comment

sffc commented Sep 14, 2023 • edited Loading

sffc commented Sep 14, 2023

sffc commented Sep 15, 2023

sffc commented Sep 20, 2023

sffc commented Sep 13, 2023 •

edited

Loading

robertbastian Sep 14, 2023 •

edited

Loading

sffc commented Sep 14, 2023 •

edited

Loading