Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement grapheme clusters #16916

Merged
merged 19 commits into from
Jun 26, 2024
Merged

Implement grapheme clusters #16916

merged 19 commits into from
Jun 26, 2024

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Mar 21, 2024

First, this adds GraphemeTableGen which

  • parses ucd.nounihan.grouped.xml
  • computes the cluster break property for each codepoint
  • computes the East Asian Width property for each codepoint
  • compresses everything into a 4-stage trie
  • computes a LUT of cluster break rules between 2 codepoints
  • and serializes everything to C++ tables and helper functions

Next, this adds GraphemeTestTableGen which

  • parses GraphemeBreakTest.txt
  • splits each test into graphemes and break opportunities
  • and serializes everything to a C++ table for use as unit tests

CodepointWidthDetector.cpp was rewritten from scratch to

  • use an iterator struct (GraphemeState) to maintain state
  • accumulate codepoints until a break opportunity arises
  • accumulate the total width of a grapheme
  • support 3 different measurement modes: Grapheme clusters,
    wcswidth-style, and a mode identical to the old conhost

With this in place the following changes were made:

  • ROW::WriteHelper::_replaceTextUnicode now uses the new
    grapheme cluster text iterators
  • The same function was modified to join new text with existing
    contents of the current cell if they join to form a cluster
  • Otherwise, a ton of places were modified to funnel the selection
    of the measurement mode over from WT's settings to ConPTY

This is part of #1472

Validation Steps Performed

@lhecker lhecker added Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Product-Conhost For issues in the Console codebase Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Product-Terminal The new Windows Terminal. labels Mar 21, 2024
@lhecker lhecker force-pushed the dev/lhecker/8000-graphemes branch from 5cb95f7 to de50835 Compare March 21, 2024 17:02
@lhecker lhecker force-pushed the dev/lhecker/8000-graphemes branch from de50835 to 62b2cdb Compare March 21, 2024 17:02
Comment on lines 18 to 19
fmt.Println(err)
os.Exit(1)
Copy link
Member

@DHowett DHowett Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
fmt.Println(err)
os.Exit(1)
log.Fatalln(err)

src/types/CodepointWidthDetector.cpp Outdated Show resolved Hide resolved
{
char32_t codepoint = 0;
const auto l = lead & 15;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally i prefer masking constants to be hex, but if you prefer decimal that is OK too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'm using 15 because that way I find it easier to see that it makes indexing into a [16] element large array safe.

src/types/CodepointWidthDetector_gen.go Outdated Show resolved Hide resolved
@lhecker
Copy link
Member Author

lhecker commented Mar 21, 2024

@j4james If you ever get a chance to test this branch, I'd love to hear your thoughts on it! 🙂

In particular, Dustin and me have been talking about whether it's necessary for us to allow users to disable grapheme support. It would definitely be beneficial if it were (restores compatibility with conhost v2) but also disadvantageous (limits our freedom in designing the text storage).

@j4james
Copy link
Collaborator

j4james commented Mar 22, 2024

First off, I have to say this is very cool. However, I do think it needs to be disabled by default, because the majority of Linux apps will expect terminals to be compatible with wcwidth, and this is not.

As a simple example, trying pasting something like this into a WSL bash shell and then arrow back to the start of the line:

🏳‍🌈🏳‍🌈🏳‍🌈

The first thing you'll notice is an extra flag getting rendered at the end of the line, and then when you arrow back, the cursor ends up part way through the prompt.

image

Those flags consist of a flag emoji, a zero width joiner, and a rainbow emoji. The flag has a width of 1, ZWJ a width of 0, and the rainbow a width of 2, so most apps will expect the combination to be 3. But since we're treating it as a width of 2, we're out of sync with what bash is expecting, and that's how things end up breaking.

The good news, though, is that there's a mode we can support (?2027), that lets applications decide whether they want the old wcwidth behavior, or they're smart enough to handle all the fancy Unicode stuff that wcwidth doesn't cover. This is something that was originally developed by Contour, but a number of other terminals have adopted it, and I'm sure I came across at least one application that was already using it.

I don't know the details well enough to know whether your implementation is compatible with what is required for mode ?2027, but if it is, I would encourage you to consider supporting it. It's still marked as being in active development, so if there's something you disagree with, I'm sure Christian will be happy to discuss it. The spec is tracked in the repo here: https://github.com/contour-terminal/terminal-unicode-core

@j4james
Copy link
Collaborator

j4james commented Mar 22, 2024

One other thing to bear in mind is how this will work with third-party terminals. Assumedly for conpty to work correctly, the conhost width calculations and the terminal width calculations need to be in sync. So there needs to be a way for the terminal to tell conhost what algorithm it is using, or possibly what algorithms it can support (if, like Contour, the mode is switchable).

Even ignoring the third party terminals, this would assumedly be necessary if you want to make it a user-controllable option in the Windows Terminal settings. I guess this is another use case for the conpty ioctl thing.

@lhecker
Copy link
Member Author

lhecker commented Mar 22, 2024

First off, I have to say this is very cool.

I'm super happy to hear you feel this way! 😌

I apologize for the very long text below. I tried to delete as much as I could, but... oh well:


However, I do think it needs to be disabled by default, because the majority of Linux apps will expect terminals to be compatible with wcwidth, and this is not.

I don't wish to do this, but I am open to changing my mind. In short, I believe that the default mode of operation should be "forward looking" / progressive, with the optimistic mindset that we'll have more decades of terminal usage in front of us than behind us.

I'm otherwise afraid that it puts less pressure on the general terminal landscape to be progressive. As far as I know the other modern terminals that support graphemes have the same stance.
I've recently triaged a couple PSReadLine issues for instance which made me realize it doesn't even support surrogate pairs to begin with, so the wcwidth situation there is even worse. Previously we've supported this implicitly because we were lenient on our own Unicode support as well, but if we now make this feature disabled by default, we turn it into an explicit support for leniency. I think it'd be better if everyone was nudged into supporting modern Unicode instead, ourselves included.


As a simple example, trying pasting something like this into a WSL bash shell and then arrow back to the start of the line:

🏳‍🌈🏳‍🌈🏳‍🌈

(FYI your flag emoji is missing U+FE0F which is why it doesn't properly render. This works: 🏳️‍🌈🏳️‍🌈🏳️‍🌈.)
((FFYYII This is a neat tool I found: https://unicode-explorer.com/tools/decode/))

I've tested this in Contour, WezTerm, kitty, foot, and Konsole and in all of them this failed to work in bash out of the box. I'm not sure whether there's something I need to configure to make it work in any of those terminals, but I believe many of them don't make this configurable. However, I believe this also somewhat validates my position on the choice of the default behavior above.

In my personal opinion this is fine though, because most text is in the BMP and in its NFC form (= no combining marks). With such inputs, bash and other wcwidth based applications will continue to work as before.


The good news, though, is that there's a mode we can support (?2027), that lets applications decide whether they want the old wcwidth behavior, or they're smart enough to handle all the fancy Unicode stuff that wcwidth doesn't cover. This is something that was originally developed by Contour, but a number of other terminals have adopted it, and I'm sure I came across at least one application that was already using it.

Yes, I'm definitely intending to implement ?2027.

However, I don't believe Contour allows grapheme support to be turned off via this sequence. Contour calls it DECMode::Unicode and the setter is not implemented. WezTerm calls it DecPrivateModeCode::GraphemeClustering and they leave it permanently enabled as well. It is possible though that I've missed something for the others.

Edit: I found a support table! https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison


One other thing to bear in mind is how this will work with third-party terminals. Assumedly for conpty to work correctly, the conhost width calculations and the terminal width calculations need to be in sync. So there needs to be a way for the terminal to tell conhost what algorithm it is using, or possibly what algorithms it can support (if, like Contour, the mode is switchable).

FWIW this is why I strongly believe that a conhost.lib is the right path forward. Even with an ioctl thingy, we'd not be able to ship every variant that people would need. fish-shell's widecharwidth for instance assigns unassigned and noncharacter codepoints a width of 0, but many other implementations, including the wcwidth implementations I'm aware of, as well as Windows Terminal so far, assign them a width corresponding to the EastAsian attribute. So even if we had a toggle to disable grapheme support, it would still have broken edge cases. With a in-process .lib it would at least work consistently correct according to the hosting terminal, as long as one doesn't use SSH (which requires something like the current ConPTY architecture).


All of the above aside, I'm not against adding such a toggle per se. I'm just worried it'll either:

  • not provide a significant reduction in edge cases despite increase in complexity
  • block us from making architectural improvements because it locks us into 1 specific way of representing text

If you have any arguments in favor of adding a toggle, or know about any applications that this change would significantly regress, I'd be 110% open to add it. In fact the mere circumstance that you argued in favor it already made me heavily consider it. 😅 I hope the above gives a few good arguments though, as to why we may want to consider not having such a toggle.

@lhecker
Copy link
Member Author

lhecker commented Mar 22, 2024

Out of curiosity I've tried to implement a wcwidth mode that is faithful to how my local WSL bash interprets it, and in a way that works with "🏳️‍🌈". I can't figure it out.
It definitely has zero-width character support because the old Windows Terminal approach clearly doesn't work. I then tried to split up graphemes whenever we encounter a non-zero-width character after we've already encountered one before (= split 🏳️‍🌈 into 🏳 and 🌈) and that almost works, but if I paste a lot of 🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈... in the prompt, backspacing breaks it again.

If you have any tips on how to implement a wcwidth-like mode, I'll try to do so.

@j4james
Copy link
Collaborator

j4james commented Mar 23, 2024

I'm otherwise afraid that it puts less pressure on the general terminal landscape to be progressive.

Are you hoping that making bash not work properly in Windows Terminal will force them to change, and instead break in terminals like XTerm and Gnome Terminal? Because that seems very unlikely. And I can understand the "little guys" like Contour or WezTerm choosing to do their own thing - nobody is going to give a damn - but if Microsoft tries to use its "monopoly" position to force through breaking changes in Linux, that's probably not going to go down well.

Now if XTerm and VTE were in on this too, I think it would be fine. But I can't imagine XTerm breaking decades of backwards compatibility, and recent discussions on the VTE issue tracker gave me the impression that they weren't keen on breaking things either (which surprised me actually). It's possible their views might change at some point, but until there is broader consensus amongst the main terminals, my recommendation would be to avoid deliberately breaking things.

And just to be clear, I personally don't care about the defaults either way. I just don't want Windows Terminal to be seen as another one of Microsoft's EEE attempts.

Yes, I'm definitely intending to implement ?2027. However, I don't believe Contour allows grapheme support to be turned off via this sequence.

I haven't been following this development very closely, but I suspect that is just because they haven't fully implemented the spec yet. The commit message "Starts adapting Terminal Unicode Core's DEC mode 2027" suggests it's still a work in progress. Certainly the spec is very clear about backwards compatibility: "Everything is disabled by default, so legacy apps don’t break more than they used to break already."

That was the whole point of introducing a mode for this in the first place. If you don't care about backwards compatibility, there's really no need for this mode.

If you have any arguments in favor of adding a toggle

Well if you decide not be backwards compatible by default, then when you get a bug report about something like bash breaking, you can at least tell people there's an option they can toggle to fix it, rather than saying you're deliberately breaking apps to force them to change.

If you have any tips on how to implement a wcwidth-like mode, I'll try to do so.

I thought it would just be a matter of dropping any zero width characters, and treating the other characters as either single or double width, as specified in EastAsianWidth. So in the rainbow flag case, any ZWJ and VS16 characters would be ignored, the flag would occupy one cell, and the rainbow would occupy two.

Looking a the script output in bash, when it redraws after you've pasted six flags, you can see it output 18 BS characters to try and position the cursor at the end of the prompt, and then output the flags again without reverse video. This is obviously assuming a flag is three cells wide (hence the 6*3 BS), so for us it's redrawn in the wrong place.

Then as you backspace, you can see it send BS BS EL for the rainbow, and BS EL for the flag, which is again exactly what you'd expect assuming a two cell width for the rainbow, and a one cell width for the flag.

Btw if you want to test with a wcwidth-compatible terminal on Windows, this works as expected in mintty (assuming you're using the wsltty version).

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

This is going to seem like a stupid edge case, but I think it might be an indication that something is not quite right in the architecture. Try this in a WSL shell:

printf "\U01f3f3\ufe0f\e7 \U01f308 DON'T MOVE"; sleep 1; printf "\e8\u200d\n"

Initially it displays a flag and a rainbow with a space between them, but after a second they join together. This shrinks the combination down to two cells, but the "DON'T MOVE" text remains where it started.

Now if you try the same thing without the sleep, the "DON'T MOVE" text appears three columns to the left. I'm assuming that's because the join happens before the conpty renderer has had a chance to refresh.

That seems wrong to me. It should not be possible for text to end up in a different position just based on timing. And I think the underlying issue is that it shouldn't be possible to "join" two characters which have been output separately, just by inserting a ZWJ between them.

@lhecker
Copy link
Member Author

lhecker commented Mar 24, 2024

I haven't fully debugged it yet, but I believe this shows two highly related issues:

  • The text renderer joins all cells to a single string and then passes that to DirectWrite. This causes it to draw 🏳️‍🌈 even though the text buffer contains 🏳️🌈 in separate cells. This can be easily observed in OpenConsole with AtlasEngine, since the ConPTY issue doesn't occur there. I'm not 100% sure how urgent fixing this issue is though.
  • ConPTY does the same thing, but it causes Windows Terminal to never see the cursor movement that happened in between the 2 writes in the first place. That's a lot more problematic.

We can solve both issues in two ways:

  • Assemble rows to strings. Then iterate over graphemes and check whether 1 grapheme consists of >1 cell. If we encounter this, then we know that we need to insert a control character, escape sequence, or similar, to break up the grapheme.
  • After we encounter a cursor movement or similar, and if the first write starts with a combining character of any kind, we insert a single 0x20 space first.

Personally speaking, I highly prefer the latter as it makes the "what you see" when we render on the screen consistent with "what you get" when you copy the text to your clipboard. It also simplifies our implementation, because the former option requires us to implement the workaround both, in ConPTY and in our various text renderers, whereas the latter is a simple if condition in Replace() / Insert() with a couple lines of code.
Dustin has previously mentioned that this is not a good idea, because it means that after writing such text to the console buffer you can't read the exact same text back. Also, it sort of lies to the user what the terminal application actually wrote. I agree in both accounts and so I don't particularly like my preference for the latter option either. I do still prefer it because I think the architectural simplifications are worth it.

Thoughts?

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

The text renderer joins all cells to a single string and then passes that to DirectWrite. This causes it to draw 🏳️‍🌈 even though the text buffer contains 🏳️🌈 in separate cells. I'm not 100% sure how urgent fixing this issue is though.

That makes sense. And I don't think that's urgent to fix at this stage. The important thing is getting the width calculations correct first. The rendering issues can be dealt with later.

After we encounter a cursor movement or similar, and if the first write starts with a combining character of any kind, we insert a single 0x20 space first.

I think this would be my preference. My expectation was that these combining characters would be interpreted at the time the stream of text was received, rather than when rendering the buffer, and this seems like a neat way to accomplish that.

Dustin has previously mentioned that this is not a good idea, because it means that after writing such text to the console buffer you can't read the exact same text back.

I don't think that's a big deal though. Because the minute you've moved the cursor, you're effectively in an edit mode - you're now changing what's in the buffer. You can't expect to read back text that've you overwritten. It's like writing a\x08b and expecting the buffer to somehow contain both a and b.

@lhecker

This comment was marked as off-topic.

@lhecker lhecker marked this pull request as draft March 24, 2024 16:41
@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

Right now your test with 🏳️‍🌈 doesn't work either and it hasn't forced them to change to match our behavior.

The difference is that the current behavior is viewed as a bug. When someone complains about the widths being off somewhere, we point them to the ZWJ issue and say we're still working on it.

I believe my comment about progressiveness shouldn't have conveyed that I'm trying to "force" others unless read with an "us vs them" angle. That's certainly not what I had in mind when I wrote it. It's quite saddening to imagine that it may be read that way. Does this mean I'm not part of "the team"?

I apologize if I've misinterpreted your wording, and if there are teams, I am definitely on your side. I love what you're doing here, but I want to avoid giving ammunition to the groups that are looking for an excuse to attack MS.

To be clear, and as I've said in a previous comment, I don't mind making ?2027l a switch that puts the terminal into a wcwidth-compatible mode and users could just put that into their .bashrc.

If we are going to have this functionality enabled by default, then I think a settings toggle would be preferable to mode ?2027, just because it's more user-friendly. And I'm not sure it's even worth bothering with the mode if everyone is just going to default it on - that seems pointless to me

That said, if we don't yet have an easy way to implement a toggle, then a mode is definitely better than nothing. I just want to make sure we have something we can tell people if anyone does complain.

And again I must apologise if my earlier message has caused you a lot of stress. That's precisely what I was trying to avoid, and I just did a very poor job of it.

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

And you know what, if the wcwidth-compatible algorithm is not a simple fix, maybe we should just leave it for a followup issue. If nobody complains, we can ignore it indefinitely. And if anyone does complain, we can at least say it's a known issue, and gather some real-world test cases before trying to address it.

I'd hate for you to give up on this PR just because I'm being paranoid about hypothetical complaints.

@o-sdn-o
Copy link

o-sdn-o commented Mar 25, 2024

I was happy when I saw that you began implementing text segmentation into grapheme clusters. I think this is the right step in improving the console subsystem, despite the fact that this may go against the existing code base and de facto “standards” in the terminal world, up to changing the terminal architecture (internal text buffer) and rewriting a large amount of functionality literally from scratch.

A console application may have its own interpretation of the Unicode standard, burdened by national peculiarities that the terminal cannot predict in any way, and it is necessary to somehow allow the console application to set the boundaries of grapheme clusters itself. If the application does not specify the boundaries of grapheme clusters and does not even make hints about this, then the terminal has the right to set the boundaries of the clusters as it pleases, either codepoint-wise or grapheme-wise.

The need for this becomes obvious in any attempt to create a more or less competitive console text editor or a panel file manager like TotalCommander, in which there is no point in rendering text in codepoint, since there it is necessary to display grapheme clusters in the same way as GUI editors/file managers display them . The point here is not only in the plain text of the edited document, which can still be somehow perceived in a codepoint manner, but in the file names that the user sees in GUI mode exclusively in grapheme representation. This applies to both the Unix world and the Windows world.

The fact that in the Unix world in terminals, text is displayed in codepoint, in my opinion this is a historical flaw associated with technical limitations, and sooner or later this flaw will be fixed. And it would be an even bigger mistake to rely on this codepoint behavior when developing a terminal emulator or console application.

Edit:
Due to the fact that GUI terminals do not produce grapheme-based output (even without support for grapheme halves), I have to write my own GUI frontend for my console applications to get around this. Perhaps this is my personal wish, and no one should rely on it when making decisions about the architecture of global systems that affect many people. Perhaps this is generally talking about some kind of dichotomous subsystem, which is neither a GUI nor a TUI, which should go its own way with grapheme clusters and support for keyboards with national layouts.

@DHowett
Copy link
Member

DHowett commented Mar 25, 2024

apparebit/demicode works fantastic ✅

/cc @apparebit 😄

@lhecker
Copy link
Member Author

lhecker commented Mar 25, 2024

image

There are 3 errors:

  • 2 times we fail to make a zero width character zero width. Unfortunately our architecture doesn't support storing such characters independently and a width of 1 as a minimum is enforced.
  • When encountering a VS16 we make any character wide, because this greatly simplifies the width measurement. I don't expect this to be problematic in practice.

@microsoft-github-policy-service microsoft-github-policy-service bot removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Jun 5, 2024
Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

51/54

src/cascadia/TerminalApp/TerminalPage.cpp Outdated Show resolved Hide resolved
}

THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), flags, &_inPipe, &_outPipe, &_hPC));
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), _flags, &_inPipe, &_outPipe, &_hPC));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah uh defterm handoff will always use the "default" value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is definitely rough.

  • We could add it as another property on TERMINAL_STARTUP_INFO
    • though, we weren't smart enough to add a dwSize, so we'd have to rev the ITerminalHandoff2, s.t. we can make sure that an old conhost
    • Wait, no, we don't. TerminalHandoff is from a OpenConsole to a Terminal in the same package. We can just bump that.
  • We could do the closeOnExit: graceful thing, where we know that because it's a defterm, it's going to always use the version from the registry
    • we'd need to cache that value when it's created
    • Wait, would it use the value from the registry? Or fall into the category of "because conpty, we didn't even load HKCU/Console"

default:
break;
}
CodepointWidthDetector::Singleton().Reset(mode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh double f-ck, conpty and terminal will disagree on the measurement mode during defterm handoff!!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! It's intentionally broken at the moment until I sort out IsGlyphFullWidth.

@@ -213,6 +222,7 @@ class Settings
std::wstring _LaunchFaceName;
bool _fAllowAltF4Close;
DWORD _dwVirtTermLevel;
SettingsTextMeasurementMode _textMeasurement = SettingsTextMeasurementMode::Graphemes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider having a different default for inbox (in 3 years?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to me. I'm sure in the 5 years before we get this inbox, we'll have any long tail of bugs sorted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion: I consider it unlikely that this will break any applications. After all, what use would printing a zero-width character have in conhost at the moment? In particular if we consider for how long our ExtTextOutW support was broken... I think this will be a non-issue.

If it does become an issue, we do have the registry key someone could set.

src/terminal/adapter/adaptDispatch.cpp Outdated Show resolved Hide resolved
src/tools/GraphemeTableGen/Program.cs Outdated Show resolved Hide resolved
src/tools/GraphemeTableGen/Program.cs Show resolved Hide resolved
src/types/CodepointWidthDetector.cpp Show resolved Hide resolved
src/types/GlyphWidth.cpp Show resolved Hide resolved
src/types/inc/CodepointWidthDetector.hpp Outdated Show resolved Hide resolved
@microsoft-github-policy-service microsoft-github-policy-service bot added Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something and removed Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something labels Jun 12, 2024
Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

49/54 but i see you're reviewing feedback now so here's basically me +1'ing dustin's thoughts

#define PSEUDOCONSOLE_GLYPH_WIDTH__MASK 0x18
#define PSEUDOCONSOLE_GLYPH_WIDTH_GRAPHEMES 0x08
#define PSEUDOCONSOLE_GLYPH_WIDTH_WCSWIDTH 0x10
#define PSEUDOCONSOLE_GLYPH_WIDTH_CONSOLE 0x18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝: 0x18 here is "both the PSEUDOCONSOLE_GLYPH_WIDTH bits", which means console. Clever way to pack three enum values into two flags

}

THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), flags, &_inPipe, &_outPipe, &_hPC));
THROW_IF_FAILED(_CreatePseudoConsoleAndPipes(til::unwrap_coord_size(dimensions), _flags, &_inPipe, &_outPipe, &_hPC));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is definitely rough.

  • We could add it as another property on TERMINAL_STARTUP_INFO
    • though, we weren't smart enough to add a dwSize, so we'd have to rev the ITerminalHandoff2, s.t. we can make sure that an old conhost
    • Wait, no, we don't. TerminalHandoff is from a OpenConsole to a Terminal in the same package. We can just bump that.
  • We could do the closeOnExit: graceful thing, where we know that because it's a defterm, it's going to always use the version from the registry
    • we'd need to cache that value when it's created
    • Wait, would it use the value from the registry? Or fall into the category of "because conpty, we didn't even load HKCU/Console"

@@ -213,6 +222,7 @@ class Settings
std::wstring _LaunchFaceName;
bool _fAllowAltF4Close;
DWORD _dwVirtTermLevel;
SettingsTextMeasurementMode _textMeasurement = SettingsTextMeasurementMode::Graphemes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to me. I'm sure in the 5 years before we get this inbox, we'll have any long tail of bugs sorted

src/cascadia/TerminalApp/TerminalPage.cpp Outdated Show resolved Hide resolved
break;
case DispatchTypes::ModeParams::XTERM_BracketedPasteMode:
enabled = _api.GetSystemMode(ITerminalApi::Mode::BracketedPaste);
state = _api.GetSystemMode(ITerminalApi::Mode::BracketedPaste) ? DispatchTypes::DECRPM_Enabled : DispatchTypes::DECRPM_Disabled;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Seems like we could just special case GCM_GraphemeClusterMode at the bottom of the switch.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think I get it. Let's go wild.

src/terminal/adapter/adaptDispatch.cpp Show resolved Hide resolved

switch (glyph.size())
// The _state is stored ~flipped, so that we can differentiate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like a boolean with more steps?

Copy link
Member Author

@lhecker lhecker Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wanted the struct to be zero-initializable with minimal overhead so that the calls are cheap. Initializing the member to ~0 = 255 by default would break that, as would adding a bool member (adds padding = bad).

There are ways to make all of this categorically better but I don't think it's time for that yet. First we need to get rid of the IsGlyphWide and ROW::_charOffsets glue. :)
Afterwards we can make it so that we segment entire grapheme runs at a time (instead of 1 grapheme at a time) which amortizes the call overhead to 0 and then it doesn't matter anymore how the struct looks like.

src/types/CodepointWidthDetector.cpp Outdated Show resolved Hide resolved

This comment has been minimized.

@lhecker lhecker added this pull request to the merge queue Jun 26, 2024
Merged via the queue into main with commit cb48bab Jun 26, 2024
20 checks passed
@lhecker lhecker deleted the dev/lhecker/8000-graphemes branch June 26, 2024 19:15
DHowett pushed a commit that referenced this pull request Dec 5, 2024
DHowett pushed a commit that referenced this pull request Dec 13, 2024
It seems I forgot to add this in #16916.

(cherry picked from commit c0d40c9)
Service-Card-Id: PVTI_lADOAF3p4s4AmhmQzgVUrMM
Service-Version: 1.22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants