-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display of Indic text (esp. Hindi in Devanagari script) #453
Comments
I tried it also with the beta version ConsoleZ.x64.1.18.2.17256, with similar results. |
I selected the text in the window and copy-pasted it into Notepad++. |
Hey. Is the text being chopped up into buffers before being fed to the underlying rendering layer? |
ConsoleZ displays chars from console buffer. Here, it seems cmd.exe and 'type' command have a curious behaviour. |
Hi, Christophe,
OK it may be some environment problem. I still don't know what.
To view the file with PowerShell, but I have to specify the encoding:
Get-Content -Encoding UTF8 udhr_hin.txt
otherwise it produces 8-bit garbage. This means something is different
between your setup and mine.
With it, the results look much better, but still not perfect.
And it's strange: the display has a glitch, *only* in the very last line of
the file! However, when I select the console contents and copy-paste it
into the text editor, the glitch is gone.
See attached -- there are dotted circles on the right-hand side.
What could this be? (Bear in mind, I have never had any luck displaying
complex scripts in Windows consoles, so I have no -- good -- experience to
draw on.)
A colleague in India is telling me that on her system, it looks good. I'm
looking into that.
|
My colleague gets perfect display from the DOS 'type' command. We are both using Windows 10, but mine is a German system, and hers -- It's English, near as I can tell. Can you recommend a command that would list relevant environmental settings? |
dotted circles appear when you cut a word (line return)
So you have more knowledge than me to resolve this issue. Check the Windows Console font used by your colleague
What I can see by googling:
|
Hi Christian, I am working on this, and I have more to report, but a couple of things aren't clear to me. I'll write about that separately -- here I'll just answer your remarks and questions. " dotted circles appear when you cut a word (line return) " Not exactly. They happen if you cut a word before a mark character which must be applied to the character that precedes it. And this is the root of the remaining problem: some software layer is incorrectly breaking up the text before the font rendering layer gets it. What is unclear to me, is which software layer is responsible for the problem. " Your colleague use ConsoleZ or not? " We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly. " What I can see by googling: " " you cannot type Hindi without Unicode " " Hindi has no ANSI codepage " " there is no monospace font " But note: our main problem that we see ConsoleZ is independent of the font. This is a software bug, either in ConsoleZ or in Windows software. |
Here is a shorter text page, just the last section of that UDHR file Here is how it looks using the Windows bash console on the same system (same font). But notice the text output routines are automatically wrapping this text, so that no word is by the line endings. What layer is responsible for this? |
We had a miscommunication. When she saw perfect Hindi output, she was looking at the Windows 10 bash console -- which works perfectly. Clarification When I run cmd in consolez, and type the text, it gets some characters in error. When I run bash from WSL in consolez, and cat the text, it displays ok So diff could also be between type and cat. I will check whether error happens in exact same place and check the unicode points of those words. |
I tried copying the words getting the error and looking at their codepoints in https://r12a.github.io/apps/conversion/ I tried looking at same text file in windows console directly in windows 10 and then also looked at the same via consolez in windows 10. Here are the results with images attached. windows console - couriernew - indic text as boxes windows console - with Steve's monospace font - input text rendered in devanagari - but the positioning of combining marks is incorrect windows console run via consolez - same indic text - rendered correctly but certain characters show up as ? / diamonds/ etc. powershell run via consolez - indic text displays correctly WSL bash under windows - run via consolez - indic text displays correctly |
In the above post , where I have said indic text is rendered correctly, I have not included the 'dotted circle' issue - that can probably be avoided by adding line breaks at word boundaries rather than in middle of word. |
Hi, Shree,
|
Christophe, I want to say, you are almost there, as far as display of Indic text is concerned.
The default behavior of Windows consoles regarding encoding is widely considered to be a bug. For instance: Our systems are already set for UTF-8 locale, we have set chcp 65001, and yet still in PowerShell, I had to explicitly set the encoding of Get-Content in order to obtain the right output. Once the encoding is set, as with the default system locale, it should just work. There should be no need to re-state the encoding, as we have done. If you can figure out what is the right thing to do, you will have a terminal that is very superior to anything else on Windows.
For the purpose of reading text, it is very bad to show the normally-hidden characters used to compose Indic words, or to mis-place vowel marks. Both of these are happening when you split a word at the edge of the screen. There are a couple of options. i) break lines only at white-space. This is an easy solution. For display of normal text, it is maybe the best-looking option. It wreaks havoc in applications that don't expect this behavior. (E.g. text editors such as vim.) I would suggest an application option for white-space line wrapping. That would settle the problem for most users. ii) in a monospace environment, break after the last rendered character. I haven't seen this done. If it's possible at all, this would be the preferred solution for some purposes. There may be special font-rendering API calls that facilitate this. If not, there may be an algorithm for finding the best place to break a word. It would take some thought -- although I can imagine a couple of ways this could be done. Cheers! |
Your file has no BOM, then there is no indication concerning the encoding.
You should contact Microsoft to debate this. ConsoleZ reads from the Win32 console buffer. Win32 console buffer is an array (column/line) of characters (with attibutes as color) : there is no indication of line breaking. |
Hi Christophe, I understand your objections. But let's distinguish between implementation details and how the thing should work. We are aware of the BOM issue. That is a non-standard Microsoft-only convention. The indication of the encoding ought to be taken from a default, which should be somewhere in the environment. This is failing to happen. Whose fault it is -- is immaterial to the user. The question is, can you do anything to improve the situation? That is still not clear to me. The encoding functionality is now broken -- it is not just my opinion. It it causes a lot of trouble for console users generally, and it impacts your users too. How it should work is, the default encoding should be set when the console app is launched, and cmdlets, or programs, launched within that should inherit the encoding. Maybe it is really impossible to do anything about it at the console level. If so, it would be helpful to document that somewhere, together with recommended work-arounds. (Documentation alone would set you apart from most of the competition!) The stuff about Win32 buffer is an implementation detail, which is important for you -- not so much for your users. Tell me: what software is responsible for breaking the line at the window edge? The "writer"? The font rendering layer? Or ConsoleZ? That software layer is the only one that can be responsible for breaking the text at a reasonable point. |
Microsoft did not invent the BOM. Unicode standard does.
As I explained : nothing. ConsoleZ doesn't read the file. ConsoleZ reads the Unicode characters from WIN32 console buffer. These characters are written by shell commands or console applications. If you want an overview of how Windows console works, you can consult MSDN WIN32 console functions documentation.
No this is not a detail. This is how ConsoleZ works and this is the only way to interact with Windows console.
A console has a fixed number of columns. When you have filled a line, the next character is written at the beginning of the next line. This is the same rule for every languages.
Yes, I would like to have the name or a link to a monospace font that supports Devanagari. |
Hi again Christope, I'm afraid we're misunderstanding one another on several points.
I didn't say Microsoft invented the BOM. I meant, only Microsoft products use the initial BOM to indicate encoding. It is not a practice recommended by the standards.
It does not. But we aren't talking about guarantees. We were talking about the "default" encoding. The effect, for example in unix-y systems, is that plain text with no other indication of encoding is treated as if it were of the default encoding. For example, most unix-y systems these days use UTF-8 as the default encoding. If I open most text files on my system with a text editor, they assume that encoding, and display the text well. Of course, sometimes I get a file that isn't UTF-8, and I have to figure out the encoding, and tell the editor explicitly what encoding to use. But that is beyond the scope of a console -- rather, what you call a "writer" would have to be told to use a non-default encoding.
I never said or thought that ConsoleZ reads the file. I'm afraid you've missed my point. The idea is to somehow instruct the shell commands/applications to use the default encoding, before they write. This has nothing to do with the buffer -- by that time, it's too late. In unix-y systems, the default encoding is set in the environment, and it works right. I understand that Windows is much more complicated in this regard -- and maybe it is simply broken, from what I read. But that doesn't mean there is no way to get around it. I only ask that you try to understand the problem, and keep your mind open to any solution that you come across.
Well maybe this is the root of our misunderstanding. All the Microsoft consoles are broken regarding encoding. That does not mean that all console programs that run in Windows must be broken. Depending on how much of the Windows API's your software uses, you might inherit the same problems that the Microsoft programs have. That is a matter of the programmer's choice.
I'm afraid I have failed to communicate what I mean by "implementation detail". I'll try again. Your user wants to use your product to look at an Indic-language file. Once the issue of encoding is settled, they do see some Indic text, but there's junk in it. They know how it should look, but that's not what they see. There are two questions:
When you talk about "Win32 buffer" and "Windows Console", you're talking about the particular APIs and services on the system that you employ to display the text. Of course, as a programmer, you could choose to use those APIs and services, or you could write the whole thing from scratch. These are programming decisions, and that's what I mean by "implementation detail". You write as though you identify "ConsoleZ" with this particular mechanism for displaying text. Well, it is your product, and you can define it as you like. But this does not mean it is impossible to display the text some other way. If you insist that ConsoleZ is a program that displays text on the screen using certain system APIs and services, you may indeed be stuck. On the other hand, if you shoot for the goal of displaying text as the user would like it, you may have to abandon some ways of doing things. Some people would love a challenge like that, others would prefer to stick with what they have. It's a choice. You might also double your number of users. (Which brings up another question: do you want more users?)
I suggested that your product could be given a mode, in which it does something smarter than that, for example, to optionally break the incoming text at the last white space, and shove the remaining text on the next line. (There are other possibilities, but as I said, this is the simplest.) This isn't a matter of possibility -- of course a programmer can find a way to achieve that effect -- it's a matter of your choice. If you want to do it, and if you are a programmer, you can do it.
I will send you a link. Thanks again! |
false
This is not a reason to presume that all files are UTF-8. Windows is an UNICODE OS (strings are encoded in UCS-2 - a subset of UTF-16 where each character takes 2 bytes in memory). When a text file is read by an application, file's content must be converted to UCS-2. Unix-y systems are mostly
I think you presume Unix-y system are best and you critic Windows without understanding how they works. I explain only facts. There are many reasons to critic Windows and Microsoft. But there are reasons to critic Unix-y system too 😉.
I have contributed to this project because I used it a long time and I need more features. |
I'm afraid you have misunderstood me on almost every point. I am sorry that I have been unable to explain the basic details to you. I do not mean to criticise you personally, and I only point out that the problem of encoding in Windows console environments is widely considered to be buggy. I provided references -- this is not only my idea. I too work with Windows professionally, as well as Linux, and I've been using both almost since they first came out. I'm a long-time programmer, too. Thirty years ago it was C, then C++ then Java, and meanwhile a lot of Perl and Python and so on and so on. I do know something about these things. Most files on unix are "C" -encoded -- well that isn't really an encoding -- it basically means the file is treated, as you said, as binary. That is surely true if you're talking about system files. But for users whose language can be encoded only with Unicode, will have their default encoding set to Unicode, and most of their own text files will be Unicode-encoded, so that the textual context can be interpreted as text in their spoken language. This is the situation I've been talking about. As we have demonstrated to you in images, Hindi text is mangled in one way or another by your program or by any other Windows terminal emulator that we can find. It is not right, sometimes so bad it's unreadable. It is possible to make a terminal emulator that does work well for Hindi and other complex scripts. It would take some thought and programming effort however. I applaud you for your time and unpaid effort on this free project. And I respect your decision not to spend time on issues that do not interest you. Thank you for your time! |
(Unicode is not a text encoding, it's a consortium providing computing industry standard. I have seen only one real bug in all your comments: (I guess text conversion is made per block. UTF-8, Characters don't have a fixed size. Then chars overlapping two blocks could be broken.) I don't understand why so much mystery to indicate where to find a monospace font that supports Devanagari... |
Hi angtany,
First, the Windows console is an old thing, just not set up for display of
Indic scripts. Other console applications exist, which do a better job.
None that I have found is perfect. (All I have tried screw up when a word
is wrapped at the wrong point.)
I found the Gnome program Konsole on Linux works pretty well: a Windows
program called
ConsoleZ
works pretty well on Windows.
(Also, if you want a text editor, Notepad ++ works pretty well.)
Second, there are several things that have to be right.
You need a monospaced Devanagari font. There aren't many. FreeMono, as in
the GNU FreeFont SVN, will work.
Install the font, and arrange for the console application to use it.
In the console, you have to change the code page
chcp 65001
In PowerShell, you have to do other things to get Unicode to properly
output to the console,
Get-Content -Encoding UTF8 filename.txt
Let me know if that helps!
…On Fri, Jul 6, 2018 at 10:34 AM, angtany ***@***.***> wrote:
Hi StevanWhite,
I can't be able to open up a hindi text file in command prompt in windows
10 .
Even after running a command: chcp 65001(for utf-8), the rectangular boxes
(as shown in the attached screenshot) are displayed instead of the Hindi
text.
[image: screenshot 36]
<https://user-images.githubusercontent.com/39870291/42368376-80d7a4f0-8124-11e8-81a1-6485d1f63eed.png>
Please suggest a solution.
Thanks in advance.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#453 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABgNub4HEJxrp_6CPsavIh7AkwuYZEV0ks5uDyEHgaJpZM4PmAAc>
.
|
In the display of Hindi text (in Devanagari script), I'm encouraged to see that much of the complex reordering of letters (Indic font shaping) is being carried out.
There are a couple of ugly bugs however. They both seem to occur randomly in the text. The effect is independent of the font being used for display (provided a font that provides Devanagari is installed!) I have been unable to identify anything in the text itself that consistenly triggers the problem -- the same word will look fine here, but show trash there.
This is not a matter of simple encoding -- otherwise no Hindi would appear at all. It appears to be some glitch in the software.
udhr_hin.txt
A lot of people are looking for a good replacement for the system consoles that displays Indic text correctly. And there aren't many options out there.
Expected Behavior
Nice clean Hindi text to be displayed
Actual Behavior
Most of the text looks pretty good. But there are glitches throughout. I see two kinds:
a sequence of two or three U+FFFD ("replacement character") glyphs appear, sometimes between words. I can only imagine this is some sort of bug.
a U+25CC ("dotted circle") appears, often within a word. Sometimes it appears to replace a letter... a letter which is displayed just fine in the next word.
The attached image is a screen shot of the end of the attached text file
This word appears several times under the heading अनुच्छेद ३०
किसी
In its second occurrance, there is a dotted circle before the i-glyph.
but the i-glyph is in the right place.
Under अनुच्छेद २९.
in part (१) the word
का
gets a couple of unknown characters before it. Don't see anything else broken.
In part (२) the word
प्रजातन्त्रात्मक
is broken: replaced by unknown characters, looks like the letter त is lost.
Steps to reproduce
Make sure some fonts are installed that support Devanagari, copy the attached file udhr_hin.txt to the main user directory, open ConsoleZ.
chcp 65001
type udhr_hin.txt
-- as above
Diagnostic Report
When reporting a bug you must provide a diagnostic report.
If you are not able to create a diagnostic report, explain why.
Privacy is not a valid explanation! The report is human readable and private data can be masked.
The text was updated successfully, but these errors were encountered: