Add support for large, multi-part config messages #11

jagerman · 2024-12-11T00:08:08Z

This implements multi-part config messages, which have been stubbed in for a while but still needed an implementation.

Such messages are split after post-compression, post-padding, but before encryption (i.e. each piece gets encrypted separately), and all but the last piece will be a maximum size message (76800 bytes).

The Config object itself keeps track of partially received messages, joining them up as they arrive and processing them once it has a full set (or else expiring them if they are too old). It also tracks (without storing) recently fully processed multipart messages so that they can be discarded without pointlessly storing them if they happen to arrive again for some reason.

All of this requires some (breaking) external API changes to implement:

merge(...) now returns an ordered_set of config messages that failed to merge rather than a vector; we couldn't keep track of the order anymore, because of the next point.
merge(...) now returns all the hashes that make up a multipart set when one gets merged, even if they weren't passed in in the current merge call. For instance if you have a three-chunk multipart config and merge parts 1 and 2, but not 3, then 1 and 2 will be preserved internally. When you later provide 3, then all of hash1, hash2, and hash3 get returned to indicate that all of those were processed as a part of this merge.
confirm_pushed() previously took a vector and implied the order matters (although we never previously returned more than one). It now takes an unordered_set to be consistent with merge() and current_hashes(), and because the order is irrelevant.
The C version of confirm_pushed (config_confirm_pushed) previously only accepted a single message hash + size (and would convert this into a one-element vector for confirm_pushed()). Its API changes quite considerably as it now has to pass an array of hashes and sizes and the size of those arrays. As with the C++ version, the order of elements here is unimportant.
The C++ current_hashes() previously only ever returned a 0- or 1-length vector; it now returns an unordered_set, and can easily return more than one (i.e. when the current config is a multipart config message).
Added a new active_hashes() method that returns the current hashes plus the hashes of any partially loaded multi-part configs for which we are awaiting remaining parts. When used for "do we want to keep this alive?" this is the more appropriate method to use compared to current_hashes().

Other multi-part related changes (which don't affect the API, but are worth noting):

multipart messages do not support (and must not be) protobuf wrapped, because multipart messages already will not be understood by clients not using this new code, and so anything supporting multipart should also have been fixed to not apply extra protobuf wrapping. We also are returning maximum size message parts, and the protobuf overhead will push it over the storage limit in a way that isn't trivial for libsession-util to determine and compensate for.

Other changes not directly related to multipart, but included as part of this:

Started converting some use of ustring_view to span<const unsigned char> (alias: ucspan), adding some helper functions and changing some functions to take spans instead. For the most part this conversion can happen piecemeal as, luckily, ustring_views are implicitly convertible to ucspan's, and so most places passing a ustring_view (or ustring) will not break. (The other way, i.e. ucspan to ustring_view, does not work, and so the conversion needs to happen from innermost functions out; this initial change is focussed on some of those inner functions).
fixed a bug where a duplicate config (multipart or otherwise) that gets merge()d would result in that config's hash ending up in both current hashes and old hashes, but it should not have gone into old hashes.
started converting some string concatenation and std::to_string usage to fmt and/or "..."_format instead now that we have those available (via oxen-logging).

include/session/config/base.hpp

tests/test_config_contacts.cpp

include/session/config.hpp

include/session/config/base.hpp

tests/CMakeLists.txt

Bilb · 2024-12-19T03:25:09Z

LGTM. I made a fix for the clang-16 build issues at #15

mpretty-cyro

So it looks like this PR doesn't play nicely with the libQuic changes due to the updated oxen-encoding version (oxenc::bt_list_consumer btlc(data); is no longer valid) - it seems we might need to wait until oxen-io/oxen-libquic#152 has been merged?

I worked around this for my testing by modifying btstream.cpp to do this instead:

auto val = oxenc::const_span<std::byte>{req.data(), req.size()};
oxenc::bt_list_consumer btlc(val);

but that resulted in Cannot create a bt_list_consumer with no data errors in the longs (and then I wasn't able to continue testing due to the config_push assertion but I assume I would have been prevented from testing properly due to this hack)

src/config/base.cpp

- Update oxen-encoding to latest dev for new span support in bt code. - bring in oxen-logging so that we can use fmt; for now this only for fmt in the test suite, but we will definitely start using it in the future in libsession-util (the libquic PR already does). - added an overload of the test suite "printable" functions that takes format+args. - replace some `"..."_hex` string concatenations that broke when oxenc switched to constexpr _hex no longer return strings. - Fixed some oxen-related test code that broke where it was touching oxenc::detail functions.

This implements multi-part config messages, which have been stubbed in for a while but still needed an implementation. Such messages are split after post-compression, post-padding, but before encryption (i.e. each piece gets encrypted separately), and all but the last piece will be a maximum size message (76800 bytes). The Config object itself keeps track of partially received messages, joining them up as they arrive and processing them once it has a full set (or else expiring them if they are too old). It also tracks (without storing) recently fully processed multipart messages so that they can be discarded without pointlessly storing them if they happen to arrive again for some reason. All of this requires some (breaking) external API changes to implement: - `merge(...)` now returns an ordered_set of config messages that failed to merge rather than a vector; we couldn't keep track of the order anymore, because of the next point. - `merge(...)` now returns *all* the hashes that make up a multipart set when one gets merged, even if they weren't passed in in the current merge call. For instance if you have a three-chunk multipart config and merge parts 1 and 2, but not 3, then 1 and 2 will be preserved internally. When you later provide 3, then all of hash1, hash2, and hash3 get returned to indicate that all of those were processed as a part of this merge. - confirm_pushed() previously took a vector and implied the order matters (although we never previously returned more than one). It now takes an unordered_set to be consistent with merge() and current_hashes(), and because the order is irrelevant. - The C version of confirm_pushed (config_confirm_pushed) previously only accepted a single message hash + size (and would convert this into a one-element vector for confirm_pushed()). Its API changes quite considerably as it now has to pass an array of hashes and sizes and the size of those arrays. As with the C++ version, the order of elements here is unimportant. - The C++ `current_hashes()` previously only ever returned a 0- or 1-length vector; it now returns an unordered_set, and can easily return more than one (i.e. when the current config is a multipart config message). - Added a new `active_hashes()` method that returns the current hashes *plus* the hashes of any partially loaded multi-part configs for which we are awaiting remaining parts. When used for "do we want to keep this alive?" this is the more appropriate method to use compared to `current_hashes()`. Other multi-part related changes (which don't affect the API, but are worth noting): - multipart messages *do not* support (and must not be) protobuf wrapped, because multipart messages *already* will not be understood by clients not using this new code, and so anything supporting multipart should also have been fixed to not apply extra protobuf wrapping. We also are returning maximum size message parts, and the protobuf overhead will push it over the storage limit in a way that isn't trivial for libsession-util to determine and compensate for. Other changes not directly related to multipart, but included as part of this: - Started converting some use of ustring_view to `span<const unsigned char>` (alias: `ucspan`), adding some helper functions and changing some functions to take spans instead. For the most part this conversion can happen piecemeal as, luckily, ustring_views are implicitly convertible to `ucspan`'s, and so most places passing a ustring_view (or ustring) will not break. (The other way, i.e. ucspan to ustring_view, does not work, and so the conversion needs to happen from innermost functions out; this initial change is focussed on some of those inner functions). - fixed a bug where a duplicate config (multipart or otherwise) that gets merge()d would result in that config's hash ending up in *both* current hashes and old hashes, but it should not have gone into old hashes. - started converting some string concatenation and std::to_string usage to fmt and/or "..."_format instead now that we have those available (via oxen-logging).

mpretty-cyro · 2025-01-14T04:06:22Z

src/config/base.cpp

-            _curr_hash = all_hashes[_config->unmerged_index()];
+            _curr_hashes.clear();
+            auto& hashes = all_hashes[_config->unmerged_index()];
+            _curr_hashes.insert(hashes.begin(), hashes.end());
        }
    } else {
        // the merging affect nothing (if it had seqno would have been incremented), so don't
        // pointlessly replace the inner config object.
        assert(new_conf->unmerged_index() == 0);


Just wondering whether this assert should be removed - just thinking that there are situations where a merge can occur with a config which is not going to result in a change

Eg. if you link a new device and, due to network conditions, it wasn't able to retrieve the existing configs then it would push up it's "new user" config, which would then be pulled down on your existing device, merged, and likely result in no changes right?

Actually I just realised that at the moment we consistently merge in the current config because when you push up a new config message we don't update the lastHash value (so we don't unintentionally miss a config that was pushed up by another device at the same time)

Which means on the poll after pushing the current config we will always receive the current config back and try to merge it 🙃

jagerman requested a review from mpretty-cyro December 11, 2024 00:08

mpretty-cyro reviewed Dec 12, 2024

View reviewed changes

include/session/config/base.hpp Show resolved Hide resolved

mpretty-cyro reviewed Dec 12, 2024

View reviewed changes

include/session/config/base.hpp Show resolved Hide resolved

mpretty-cyro reviewed Dec 12, 2024

View reviewed changes

include/session/config/base.hpp Show resolved Hide resolved

mpretty-cyro reviewed Dec 12, 2024

View reviewed changes

tests/test_config_contacts.cpp Outdated Show resolved Hide resolved

Bilb reviewed Dec 19, 2024

View reviewed changes

include/session/config.hpp Outdated Show resolved Hide resolved

include/session/config/base.hpp Outdated Show resolved Hide resolved

tests/CMakeLists.txt Show resolved Hide resolved

jagerman force-pushed the multipart branch from e9ff336 to 3edd072 Compare December 19, 2024 03:15

Bilb approved these changes Dec 19, 2024

View reviewed changes

mpretty-cyro requested changes Jan 6, 2025

View reviewed changes

src/config/base.cpp Outdated Show resolved Hide resolved

src/config/base.cpp Show resolved Hide resolved

jagerman added 2 commits January 7, 2025 14:10

jagerman force-pushed the multipart branch from 4b00e5e to 0979cde Compare January 7, 2025 18:10

mpretty-cyro reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for large, multi-part config messages #11

Add support for large, multi-part config messages #11

jagerman commented Dec 11, 2024

Bilb commented Dec 19, 2024

mpretty-cyro left a comment

mpretty-cyro Jan 14, 2025

mpretty-cyro Jan 18, 2025

Add support for large, multi-part config messages #11

Are you sure you want to change the base?

Add support for large, multi-part config messages #11

Conversation

jagerman commented Dec 11, 2024

Bilb commented Dec 19, 2024

mpretty-cyro left a comment

Choose a reason for hiding this comment

mpretty-cyro Jan 14, 2025

Choose a reason for hiding this comment

mpretty-cyro Jan 18, 2025

Choose a reason for hiding this comment