Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus: sanitize label value for text protocol #2400

Closed
wants to merge 3 commits into from

Conversation

pgellert
Copy link
Contributor

This PR fixes a bug in the prometheus text-based endpoint where a label value with a special character (backslash (), double-quote (") or line feed (\n)) makes the whole metrics output unparseable by prometheus.

The prometheus text protocol requires that the label value to be escaped. Note that this is only needed for the text protocol and not for the protobuf-based prometheus protocol.

label_value can be any sequence of UTF-8 characters, but the backslash (), double-quote ("), and line feed (\n) characters have to be escaped as \, ", and \n, respectively.

From https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md

This also adds a new test (prometheus_http_test.cc) to validate the output of the text-based prometheus response. I've introduced this because it was simpler to make explicit checks on the text output than modifying prometheus_test.py and metrics_tester.cc to do that. I am open to give a go at working this into prometheus_test.py if you prefer that.

Finally, we switch to using fmt::print to format the metric value and the metric type instead of the originally used std::to_string. std::to_string is going to change its formatting behaviour in C++26 and so we prefer a more deterministic formatter here. The formatting matters because if the floating point numbers are converted to scientific notation, they can lose precision.

As an added benefit, using fmt::print writes these values to the output stream directly which means that we avoid a string copy when writing these values to the prometheus text-based response.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits, lgtm.

output.flush().get();
auto resp = input.read().get();
auto resp_str = std::string(resp.get(), resp.size());
BOOST_REQUIRE_NE(resp_str.find("200 OK"), std::string::npos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also use

using namespace std::literals;
// ...
BOOST_REQUIRE(std::ranges::search(resp, "200 OK"sv));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, done


future<> client = seastar::async([&lsi] {
connected_socket c_socket = lsi.connect(socket_address(ipv4_addr()), socket_address(ipv4_addr())).get();
input_stream<char> input(c_socket.input());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could use

auto close_input = deferred_close(input);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


co_await seastar::async([] {
loopback_connection_factory lcf(1);
http_server server("test");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could use

auto stop_server = deferred_stop(server);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't use this one because the stop() method of http_server is not noexcept so it doesn't satisfy the stoppable concept required for deferred_stop.

@tchaikov
Copy link
Contributor

@amnonh hey Amnon, would you please take a look?

@amnonh
Copy link
Contributor

amnonh commented Aug 24, 2024

@pgellert I like the idea and the optimization. It will be better to perform the label clean up on creation, so it can be done once. I would even convider throwing an exception in that case

Adds a test to verify that the metrics returned with the prometheus text
format are formatted as expected.
The prometheus text protocol requires that the label value to be
escaped. Note that this is only needed for the text protocol and not for
the protobuf-based prometheus protocol.

Without this sanitization, a single label with one of these special
characters makes the whole metrics output unparseable by prometheus.

> label_value can be any sequence of UTF-8 characters, but the backslash
> (\), double-quote ("), and line feed (\n) characters have to be
> escaped as \\, \", and \n, respectively.

From https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md
@pgellert
Copy link
Contributor Author

Force-push: improve test code based on code review feedback

@pgellert
Copy link
Contributor Author

@tchaikov @amnonh thank you for the reviews.

It will be better to perform the label clean up on creation, so it can be done once. I would even convider throwing an exception in that case

This one is trickier to implement than it first seems. The main complication is that the escaping must apply to the label values only when serialized with the text protocol, but not when serialized with protobuf. I wanted to add a new field _escaped_value to label_instance but it's not that simple, the label instances associated with a metric are converted into a labels_type and it's difficult to trace which label representations are used at what points on the text-based serialization path and the protobuf-based serialization path. I am also not sure the tests would catch if I made a mistake during the refactoring necessary to make that change, so I am tempted to keep the change simpler and hold off with this optimization. Let me know what you think.

@pgellert pgellert requested a review from tchaikov August 27, 2024 16:59
@xemul xemul requested a review from amnonh August 30, 2024 12:20
@amnonh
Copy link
Contributor

amnonh commented Aug 30, 2024 via email

@mykaul
Copy link

mykaul commented Sep 1, 2024

I think we should do it for both - we do not want to be in the situation where one decides to switch from one to another and then be surprised by the result.

switch (v.type()) {
case seastar::metrics::impl::data_type::GAUGE:
case seastar::metrics::impl::data_type::REAL_COUNTER:
return std::to_string(v.d());
fmt::print(os, "{:.6f}", v.d());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think .6f is a good format. For small values, it loses precision. For large values, (beyond 10^(std::numeric_limit::digits10-6)), it has precision without accuracy.

I expect '{g}' will work and provide maximum accuracy, but haven't tested it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should changing the value's precision be part of a patch that's supposedly about the labels?

I had in the past some bad experience (scylladb/scylladb#19081 (comment)) with how Seastar's metrics convert double-precision to 6 digits of precision in obscure places, causing surprising effects. We need to be very careful about this. Note that using "to_string()" is even worse, because nobody knows what it really does... If I remember correctly (but I can't find a reference now), we even had a case where in the past some to-string function used to print 15 digits of precision but one day when we changed the way we use fmt suddenly it because 6 digits.

Ideally, we'd want to have regression tests for these precision issues, so we won't be surprised if one day the precision of some metric format regresses.

P.S. @avikivity , actually the default precision of printf's %g is 6 digits, similar to %f - the "g" is about scientific notation for large exponents - not about precision. But I don't know if fmt inherited printf's idea of what "g" is or invented its own - it's worth checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is already changing the precision, by changing to_string() to fmt::print("{:.6f").

If plain {g} doesn't work, then {:19g} (or whatever) can be substituted. Or even start with {} to maintain the status quo and improve it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avikivity this is exactly my question - why does a patch for "sanitize label value for text protocol" need to change the formatting of numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is already changing the precision

My intention with this PR is to keep the existing formatting of floating point numbers, keeping this PR purely a refactoring when it comes to the formatting of the metric values. The goal is to avoid the coming changes to std::to_string. std::to_string effectively used {:.6f} for the formatting of doubles and this is what I intend to preserve with this change.

From https://en.cppreference.com/w/cpp/string/basic_string/to_string:

7,8) Converts a floating point value to a string as if by std::sprintf(buf, "%f", value). (until C++26)

Then from https://en.cppreference.com/w/cpp/io/c/fprintf:

f/F: Converts floating-point number to the decimal notation in the style [-]ddd.ddd.
Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.

I'm happy to move this commit to another PR if you prefer.

I think keeping the metric formatting as the original std::to_string-like format ({:.6f}) would be best. An important edge case for which {:.6f} works but {g} or {} would be problematic is gauges that have large integer values (eg. a gauge with a unix timestamp as its value). Changing the formatting to scientific notation could make these integer values less accurate which would be confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is already changing the precision

My intention with this PR is to keep the existing formatting of floating point numbers, keeping this PR purely a refactoring when it comes to the formatting of the metric values. The goal is to avoid the coming changes to std::to_string. std::to_string effectively used {:.6f} for the formatting of doubles and this is what I intend to preserve with this change.

Absolutely the correct approach.

From https://en.cppreference.com/w/cpp/string/basic_string/to_string:

7,8) Converts a floating point value to a string as if by std::sprintf(buf, "%f", value). (until C++26)

Then from https://en.cppreference.com/w/cpp/io/c/fprintf:

f/F: Converts floating-point number to the decimal notation in the style [-]ddd.ddd.
Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6. In the alternative implementation decimal point character is written even if no digits follow it. For infinity and not-a-number conversion style see notes.

I'm happy to move this commit to another PR if you prefer.

I think keeping the metric formatting as the original std::to_string-like format ({:.6f}) would be best. An important edge case for which {:.6f} works but {g} or {} would be problematic is gauges that have large integer values (eg. a gauge with a unix timestamp as its value). Changing the formatting to scientific notation could make these integer values less accurate which would be confusing.

Let's plan a follow-up to improve floating-point formatting. .6f is pretty bad. It's okay for a gauge, but as soon as you take a derivative of a small slowly-changing metric, you get poor results.

case seastar::metrics::impl::data_type::HISTOGRAM:
case seastar::metrics::impl::data_type::SUMMARY:
break;
}
return ""; // we should never get here but it makes the compiler happy
return os; // we should never get here but it makes the compiler happy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is now incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@pgellert
Copy link
Contributor Author

pgellert commented Sep 2, 2024

@amnonh @mykaul:

I disagree, protobuf and text should be the same, I would seriously concider throwing on non standard labels
Protobuf and text should be the same

Could you please elaborate in what sense they should be the same? Just to clarify, if a user of the library intends to create a label value with the string X, prometheus expects this label value to be serialised as escaped(X) on the text protocol whereas it expects it to be serialized as the string field set to X for protobuf. These are the rules of the two independent serialisation protocols. If we set the protobuf field to escaped(X), prometheus would think that the user set this field to escaped(X) and not X. So to show the user-intended label value on the prometheus UI and other metrics frontends the label value needs to be correctly serialized according to the protocol, escaped on the text output but not escaped on the protobuf output.

Prometheus supports label values with special characters so I don't see why seastar should restrict the range of label values that it supports.

This avoids a string copy when writing the metric value and the metric
type to the prometheus text-based response.

Note that now `fmt::print` is used to format the values instead of the
originally used `std::to_string`. This is because `std::to_string` is
going to change its formatting behaviour in C++26, so we prefer to use a
more deterministic formatter here.

The formatting of the numbers matters because if the floating point
numbers are converted to scientific notation, they can lose precision.

Ref: https://en.cppreference.com/w/cpp/string/basic_string/to_string
@pgellert
Copy link
Contributor Author

pgellert commented Sep 2, 2024

Force-push: remove outdated comment

@mykaul
Copy link

mykaul commented Sep 3, 2024

@amnonh @mykaul:

I disagree, protobuf and text should be the same, I would seriously concider throwing on non standard labels
Protobuf and text should be the same

Could you please elaborate in what sense they should be the same? Just to clarify, if a user of the library intends to create a label value with the string X, prometheus expects this label value to be serialised as escaped(X) on the text protocol whereas it expects it to be serialized as the string field set to X for protobuf. These are the rules of the two independent serialisation protocols. If we set the protobuf field to escaped(X), prometheus would think that the user set this field to escaped(X) and not X. So to show the user-intended label value on the prometheus UI and other metrics frontends the label value needs to be correctly serialized according to the protocol, escaped on the text output but not escaped on the protobuf output.

Prometheus supports label values with special characters so I don't see why seastar should restrict the range of label values that it supports.

Since users are free to switch between the protocols, we do not want the end user experience to be different when changing protocol. Unless I misunderstand the end result - would the user see the same label value eventually (say, in a query in Grafana) when switching from native to text?

I'll also add that I think protobuf is regretfully still not GA (I think?).

@pgellert
Copy link
Contributor Author

pgellert commented Sep 3, 2024

would the user see the same label value eventually (say, in a query in Grafana) when switching from native to text?

Yes, they would see the same label value with this PR when they switch between native and text. Without this PR, they would see different results or be unable to switch from protobuf to text if they had label values with special characters (backslash \, double-quote ", or line feed \n).

@amnonh
Copy link
Contributor

amnonh commented Sep 4, 2024

@amnonh @mykaul:

Prometheus supports label values with special characters so I don't see why seastar should restrict the range of label values that it supports.

Label and their values are static on creation, now, we'll need to call escape each time we send them, it's a shame. Was there an actual demand for special characters?

@pgellert
Copy link
Contributor Author

pgellert commented Sep 5, 2024

I've changed the label escaping now to only happen once, on creation. Let me know if the changes look good, and if the changes in the header files are within the source compatibility requirements of seastar.

Was there an actual demand for special characters?

Yes, in redpanda we have labels that could contain these special characters (eg. consumer group names).

@@ -41,7 +41,7 @@ namespace seastar {
namespace metrics {
namespace impl {

using labels_type = std::map<sstring, sstring>;
using labels_type = std::map<sstring, label_value>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's exposed as an API and will break scylladb unit tests that verify metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to avoid changing this in place now: #2400 (comment)

@pgellert
Copy link
Contributor Author

pgellert commented Sep 10, 2024

I've pushed some more changes to try to keep the type alias labels_type around in its original form.

Note that there are still API-incompatible changes in the last commit, namely:

  • The public field metric_definition_impl::labels changes its type. I intentionally didn't duplicate this field because it is public and mutable. I would rather have users fail to compile their code than silently break their code if the user only modifies metric_definition_impl::labels but doesn't modify a new field metric_definition_impl::full_labels.
  • The same reasoning applies for metric_id::labels() which exposes metric_id::_labels and allows its mutation externally.

@amnonh @avikivity could you please re-review and advise on whether the optimization in the last commit is worth doing or if I should drop it since the blast radius of the change is quite large and there are still API-breaking changes in it? I would like to get this PR in if possible.

@avikivity
Copy link
Member

@pgellert let's split the last patch off (I haven't reviewed it yet). The fixes are more important and can be queued immediately, we can consider the optimization after that.

@pgellert
Copy link
Contributor Author

Force-push: dropped the last commit based on @avikivity's feedback above

@avikivity avikivity closed this in 20cf440 Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants