Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force use utf-8 open README.md #76

Open
Aqaao opened this issue Dec 22, 2022 · 5 comments
Open

force use utf-8 open README.md #76

Aqaao opened this issue Dec 22, 2022 · 5 comments

Comments

@Aqaao
Copy link

Aqaao commented Dec 22, 2022

Otherwise encounter error

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\Aqaao\AppData\Local\Temp\pip-req-build-2dcr43hl\setup.py", line 8, in <module>
        README = fh.read()
    UnicodeDecodeError: 'gbk' codec can't decode byte 0x90 in position 757: illegal multibyte sequence
@Aqaao
Copy link
Author

Aqaao commented Dec 22, 2022

and here, "non-utf-8" codec error

raceback (most recent call last):
  File "autosub/main.py", line 170, in <module>
    main()
  File "autosub/main.py", line 161, in main
    ds_process_audio(ds, audio_segment_path, output_file_handle_dict, split_duration=args.split_duration)
  File "autosub/main.py", line 69, in ds_process_audio
    write_to_file(output_file_handle_dict, split_inferred_text, line_count, split_limits, cues)
  File "C:\env\python-venv\deepspeech\lib\site-packages\autosub\writeToFile.py", line 43, in write_to_file
    file_handle.write(inferred_text + "\n\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\udce9' in position 0: illegal multibyte sequence

——————————
edit:"utf-8" codec error too, idk why.

raceback (most recent call last):
  File "autosub/main.py", line 170, in <module>
    main()
  File "autosub/main.py", line 161, in main
    ds_process_audio(ds, audio_segment_path, output_file_handle_dict, split_duration=args.split_duration)
  File "autosub/main.py", line 69, in ds_process_audio
    write_to_file(output_file_handle_dict, split_inferred_text, line_count, split_limits, cues)
  File "C:\env\python-venv\deepspeech\lib\site-packages\autosub\writeToFile.py", line 43, in write_to_file
    file_handle.write(inferred_text + "\n\n")
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-262: surrogates not allowed

@abhirooptalasila
Copy link
Owner

Weird. Which language is your audio in?

@Aqaao
Copy link
Author

Aqaao commented Dec 23, 2022

Weird. Which language is your audio in?

mandarin, I found many people have the same problem in python.
mozilla/DeepSpeech#3557
but i didn't find a solution

@abhirooptalasila
Copy link
Owner

Aah yes. You'll need to add .decode('utf-8', 'ignore') and .encode(...) while writing to file/saving

@Aqaao
Copy link
Author

Aqaao commented Dec 23, 2022

Aah yes. You'll need to add .decode('utf-8', 'ignore') and .encode(...) while writing to file/saving

thk, it worked.

file_handle.write(inferred_text + "\n\n")

file_handle.write(inferred_text.decode('utf-8', 'ignore').encode('utf-8') + "\n\n")

output_file_handle_dict[format] = open(output_filename, "w")

output_file_handle_dict[format] = open(output_filename, "w", encoding='utf-8', errors='surrogateescape')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants