Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior with a bullet point character #243

Open
ivankp opened this issue Mar 15, 2022 · 3 comments
Open

Unexpected behavior with a bullet point character #243

ivankp opened this issue Mar 15, 2022 · 3 comments

Comments

@ivankp
Copy link

ivankp commented Mar 15, 2022

Say, I have a string

    char8_t text[] = u8"• test\n - two\n ••    three\n-• four\n";

I would like to substitute any number of consecutive blank characters, -, or with just a single space. I tried the following:

    char8_t* b = text + std::size(text)-1;
    for (char8_t* r = text;;) {
      auto m = ctre::search<u8R"([\s\-•]+)">(r,b);
      if (!m) break;
      char8_t* w = m.begin();
      r = m.end();
      if (r==b) {
        b = w;
        break;
      }
      *w++ = ' ';
      if (w!=r) {
        memmove(w,r,b-r);
        b -= r-w;
        r = w;
      }
    }
    *b = '\0';

    cout << ((char*)text) << endl;

But this results in

    • test two •• three • four

I'm including <ctre-unicode.hpp>.

Is this a bug or the intended behavior?

At first, I thought that maybe the problem is with putting a inside [], because maybe [] only accepts single-byte characters and escape sequences, but I get the same output with (?:[\s\-]|•)+ as with the original [\s\-•]+. And \P{L}+ results in, what I'm assuming is, removal of only some of the bytes comprising the characters:

    � test two � � three � four

Here's a godbolt link.

@hanickadot
Copy link
Owner

Currently with two iterators you can't trigger special utf8 iterators.

This is a workaround: https://godbolt.org/z/Kz5arc1qE

Not sure how to do it nicely, your other options are in wrapper.hpp lines 156-184

Keeping this open, if I found a better solution.

@ivankp
Copy link
Author

ivankp commented Mar 15, 2022

Thank you for the quick response!
So, ctre::search decides whether to treat the input as utf8 or bytes based on the type of the argument (right now only if it's a single argument, i.e. std::u8string_view vs. std::string_view).
May I suggest making this decision either based on the type of the template parameter string, or a tag type passed as another template parameter, or defining ctre::search and ctre::search_u8? I think this would (1) avoid the ambiguity of whether we are treating the string as unicode or not, (2) make it more convenient to work with utf8 strings represented as regular old char*, and (3) avoid the back and forth casting. More concerning point (2). char8_t and u8string_view are very new, so most codebases aren't implemented to return these type as is. Plus, correct me if I'm wrong about this, but unless one is working in some specific domain, wouldn't one expect strings to be encoded in utf8 by default? The only thing I'm trying to suggest is that relying on the argument character type being char vs char8_t seems a bit more awkward than having ctre::search and ctre::search_u8.

@hanickadot
Copy link
Owner

It's actually based on type of argument's iterator. You can always take std::string_view and mark it ctre::utf8_range.
The name of "function" just names the algorithm, type of arguments marks the semantics of code-unit/code-points. Making _u8 function would lead into making _u16 and _u32 functions which is not something I want to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants