Unexpected behavior with a bullet point character `•` #243

ivankp · 2022-03-15T15:35:12Z

Say, I have a string

    char8_t text[] = u8"• test\n - two\n ••    three\n-• four\n";

I would like to substitute any number of consecutive blank characters, -, or • with just a single space. I tried the following:

    char8_t* b = text + std::size(text)-1;
    for (char8_t* r = text;;) {
      auto m = ctre::search<u8R"([\s\-•]+)">(r,b);
      if (!m) break;
      char8_t* w = m.begin();
      r = m.end();
      if (r==b) {
        b = w;
        break;
      }
      *w++ = ' ';
      if (w!=r) {
        memmove(w,r,b-r);
        b -= r-w;
        r = w;
      }
    }
    *b = '\0';

    cout << ((char*)text) << endl;

But this results in

    • test two •• three • four

I'm including <ctre-unicode.hpp>.

Is this a bug or the intended behavior?

At first, I thought that maybe the problem is with putting a • inside [], because maybe [] only accepts single-byte characters and escape sequences, but I get the same output with (?:[\s\-]|•)+ as with the original [\s\-•]+. And \P{L}+ results in, what I'm assuming is, removal of only some of the bytes comprising the • characters:

    � test two � � three � four

Here's a godbolt link.

The text was updated successfully, but these errors were encountered:

hanickadot · 2022-03-15T16:53:23Z

Currently with two iterators you can't trigger special utf8 iterators.

This is a workaround: https://godbolt.org/z/Kz5arc1qE

Not sure how to do it nicely, your other options are in wrapper.hpp lines 156-184

Keeping this open, if I found a better solution.

ivankp · 2022-03-15T17:43:22Z

Thank you for the quick response!
So, ctre::search decides whether to treat the input as utf8 or bytes based on the type of the argument (right now only if it's a single argument, i.e. std::u8string_view vs. std::string_view).
May I suggest making this decision either based on the type of the template parameter string, or a tag type passed as another template parameter, or defining ctre::search and ctre::search_u8? I think this would (1) avoid the ambiguity of whether we are treating the string as unicode or not, (2) make it more convenient to work with utf8 strings represented as regular old char*, and (3) avoid the back and forth casting. More concerning point (2). char8_t and u8string_view are very new, so most codebases aren't implemented to return these type as is. Plus, correct me if I'm wrong about this, but unless one is working in some specific domain, wouldn't one expect strings to be encoded in utf8 by default? The only thing I'm trying to suggest is that relying on the argument character type being char vs char8_t seems a bit more awkward than having ctre::search and ctre::search_u8.

hanickadot · 2022-03-15T17:50:55Z

It's actually based on type of argument's iterator. You can always take std::string_view and mark it ctre::utf8_range.
The name of "function" just names the algorithm, type of arguments marks the semantics of code-unit/code-points. Making _u8 function would lead into making _u16 and _u32 functions which is not something I want to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior with a bullet point character `•` #243

Unexpected behavior with a bullet point character `•` #243

ivankp commented Mar 15, 2022 •

edited

hanickadot commented Mar 15, 2022

ivankp commented Mar 15, 2022

hanickadot commented Mar 15, 2022

Unexpected behavior with a bullet point character • #243

Unexpected behavior with a bullet point character • #243

Comments

ivankp commented Mar 15, 2022 • edited

hanickadot commented Mar 15, 2022

ivankp commented Mar 15, 2022

hanickadot commented Mar 15, 2022

Unexpected behavior with a bullet point character `•` #243

Unexpected behavior with a bullet point character `•` #243

ivankp commented Mar 15, 2022 •

edited