Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tradução da lição "OCR e Tradução Automática" #371

Open
joanacvp opened this issue May 25, 2021 · 32 comments
Open

Tradução da lição "OCR e Tradução Automática" #371

joanacvp opened this issue May 25, 2021 · 32 comments

Comments

@joanacvp
Copy link
Collaborator

joanacvp commented May 25, 2021

O Programming Historian em português recebeu a seguinte proposta de tradução da lição 'OCR e Tradução Automática' por @felipelmc.

A lição a traduzir está no link: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/pt/esbocos/traducoes/ocr-e-traducao-automatica.md

Esta lição está agora em revisão e pode ser lida em: http://programminghistorian.github.io/ph-submissions/pt/esbocos/traducoes/ocr-e-traducao-automatica

Para promover uma publicação rápida desta lição traduzida, eu @joanacvp irei procurar uma revisão da mesma junto de dois revisores o mais breve possível.

Se houver alguma preocupação dos tradutores, eles podem entrar em contato com o mediador do PH em português (Luís Ferla).

@joanacvp
Copy link
Collaborator Author

Hello @svmelton. Hopefully you can help us.
@felipelmc, who did the translation of this lesson, had some problems with it. He will detail you in this issue what happened.
I am also having problems in loadling the lesson. It always says there was a failure and I can't find why.
Thank you very much for your support in advance

@felipelmc
Copy link
Collaborator

Hello @svmelton!
As @joanacvp just said, I had some problems throughout the translation. For some reason, I could not make the translation API used in the lesson (Yandex) work. When I typed the code, I received many error messages ("Session is invalid", "Null response" or "Oops! Something went wrong and I can't translate it for you :(") as output. If I change the API to Google or Bing, as in the commands bellow, the translation goes on normally:

trans -e google :eng file://INPUT_FILENAME > OUTPUT_FILENAME
trans -e bing :eng file://INPUT_FILENAME > OUTPUT_FILENAME

The problem is that, by changing the API, the accuracy level of the translation gets reduced, and the translation itself suffers substantial changes. Also, some problems the author tries to fix during the lesson go away, which would make it impossible to change the API without changing the content of the lesson.

Other problems that had to be dealt with during the translation had to do with my perception that some commands did not seem to be doing what they were supposed to do. In the case of the Nano file bellow, I could not see any difference between the input image and the output image:

#!/bin/bash
read -p "enter file name: " fl;
convert $fl -despeckle -despeckle -despeckle -despeckle -despeckle $fl

Finally, I could not make the following codes work:

mkdir $folder"_ocr"
mkdir $folder"_translation"
mv *_ocr.txt *_ocr
mv *_trans.txt *_translation

and

#!/bin/bash 
read -p "enter archive name: " $archive_name;
read -p "enter  date of visit: " $visit;

ls -lt | awk '{if ($6$7==$visit) print $9}' >> list.txt
mkdir $archive_name

for i in $(cat list.txt);
do 
  mv $i $archive_name/$archive_name${i:3}; 
done

The first block of command should move transcribed/translated/edited files to different folders on my computer. The second one should change the name of the files according to the date of visit of an archive. Unfortunately I could not make any of them work.

By the way, I am working from a Macbook. I do not know if the commands did not work properly because of my computer, if I have done something wrong or anything else.

I would like to thank you in advance for your support and attention!

@rivaquiroga
Copy link
Member

@joanacvp, maybe the @programminghistorian/technical-team can help with the problems with the build.

@walshbr
Copy link
Contributor

walshbr commented May 26, 2021

The build should be working now. The issue was that you used double quotes inside of a caption for a couple images -

{% include figure.html filename="OCR-e-traducao-automatica-2.png" caption="Figura 2: A frase com "coruja" (owl) em russo" %}

The double quotes around coruja are a problem, because the script assumes they signal the end of the caption. I changed them to single quotes, and it should be working now - 35c4fd1

@DanielAlvesLABDH
Copy link
Contributor

Thank you @walshbr

@felipelmc
Copy link
Collaborator

Oh! That makes total sense. Thank you very much @walshbr

@joanacvp
Copy link
Collaborator Author

Thank you very much for your help @walshbr. I spent hours trying to understand what was wrong :). Who should we contact regarding the problems with the lesson @felipelmc stated? Thank you once again

@walshbr
Copy link
Contributor

walshbr commented May 27, 2021

If there were problems with the original lesson you'd probably ping the managing editor for that language publication. So @svmelton in this case I think? Unless I'm misunderstanding.

@rivaquiroga
Copy link
Member

@joanacvp, @felipelmc, you can open a ticket here to report the bug: https://github.com/programminghistorian/jekyll/issues

@svmelton
Copy link
Contributor

Thanks @walshbr! @hawc2, would you be able to do an initial assessment of the issue described above? It may be that we need to contact the author.

@hawc2
Copy link
Collaborator

hawc2 commented May 27, 2021

I took a first shot at this tutorial and am running into a lot of problems, including the ones cited by @felipelmc.

I fear the main issue may be this was composed for Linux and does not exactly translate to Unix, but the lesson introduction doesn't make this clear in any detail. Even installing Tesseract on Mac makes more sense to me using 'brew install' than how it is currently listed in the tutorial.

I'm happy to try to debug this in more detail, but reaching out to the author for clarification may help the process.

@joanacvp
Copy link
Collaborator Author

joanacvp commented Aug 5, 2021

@Anisa-ProgHist we also reported problems with this lesson. Should we proceed the same and open an issue in jekyll? Thank you very much for all your help

@anisa-hawes
Copy link
Contributor

Yes, please @joanacvp! The Lesson Maintenance workflow is explained step-by-step on our Wiki.

Open a new Issue within Jekyll including the following key information:

  • The full title of the lesson
  • The system you are using (Mac, Linux, Windows)
  • Version numbers of the relevant software you are using
  • The exact steps you took that caused the problem

Please add the label Lesson Maintenance, and I will keep an eye out for it.

It would be useful if you could reference this Issue #371 within the new one that you open in Jekyll, so that I can look back through this Conversation.

@hawc2 Did you make any steps forward with this? I can see in your note (above) that you were intending to assess and contact the author. Let me know if you have any notes to share.

@hawc2
Copy link
Collaborator

hawc2 commented Aug 5, 2021

@Anisa-ProgHist I never heard back from the author. Working with them to deal with the troubleshooting seems the right way to go. Happy to help when we hear back

@anisa-hawes
Copy link
Contributor

Thank you, @hawc2! That sounds sensible. I would much appreciate their + your support to solve this one (as it sounds tricky).

@DanielAlvesLABDH
Copy link
Contributor

Hi @akhlaghiandrew do you think you can help us understand what is the problem with our translation of your lesson? Are we missing something? Thanks in advance!

@rivaquiroga
Copy link
Member

rivaquiroga commented Oct 3, 2021

Hi, everyone! I gave this lesson a try yesterday to see if it was a good idea to translate it to Spanish.

I'm a Linux user. My impression is that the lesson was written for Mac, because the way to install tesseract in Ubuntu, for example, is slightly different, and because it uses MacPorts (i.e., the sudo port commands). I had to make a couple of changes to make the instructions work.

Regarding the problems @felipelmc found:

1.
Yandex didn't work for me either, and looking around, I found an open issue on the Translate Shell repository. It's from mid 2020, and it doesn't look like it has been solved. Maybe we should add a warning suggesting readers to use bing or google, and to expect that results might be a little different from the ones shown in the lesson.

2.
I also didn't perceive any difference after using the script that runs the despeckle command five times. I edited it, and only when using -despeckle around 30 times I start seeing some changes:

Captura de pantalla de 2021-10-02 21-36-52

3.

mkdir $folder"_ocr"
mkdir $folder"_translation"
mv *_ocr.txt *_ocr
mv *_trans.txt *_translation

To make this part work you have to run the script from your target folder in the Terminal (where the PDFs are). If you run it from the parent directory (or any other place), it won't work because the script creates the new folders in the working directory and will look for the .txt files there.

For example, in this case my folder structure was something like this:

OCR-lesson
      ¦--loop.sh
      ¦--pdf/
          ¦--120500.pdf
          ¦--119105.pdf

So I opened the Terminal in the pdf/ folder and ran the script with ../loop.sh (not ./loop.sh, as suggested) because it was in the parent folder (OCR-lesson/).

It might be a good idea to tell the readers that the script will prompt an error if you have files that are not OCRable in your working directory. It will work for the image files and pdfs, but will show you an error for some formats (for example, if you have a bash script there).

4.
It looks like that the author is not really expecting you to run the script for renaming the files with the images from the lesson, but with some other files you might have, and which are named according to the example he shows (something like IMG_XXXXX.png). My overall impression is that the script needs a little more explanation about how it works. For example, there is no information about the format in which you are supposed to provide the date or how to find out which format you need (in case you are not familiar with awk). For example, in my case it was oct2 (the three letters abbreviation of the month in Spanish and the day number). My first instinct was to try ISO 8601 format. If your images' current names are slightly different from the example (IMG_XXXXXX.png), the script won't work as expected. Currently it changes the first three characters of the current name with the archive name you provide. But if your image does not start with three letters it will result in something not as neat as archive_name_XXXXXXX.png. So you need to adapt this part of the script in case you have something different $archive_name${i:3}.


@anisa-hawes, maybe you want to discuss with @svmelton about changing the difficulty of the lesson from Medium to Advance. There are a lot of things that you are supposed to know how to do that you cannot learn just by following the two lessons that are suggested as preparation for this one. Also, the lesson expects you to infer some intermediate steps between one instruction and the next one.

BTW, Nano can be installed in Windows, not only in Mac and Linux. But it is a very old release (2017). Maybe we should point Windows users to that version or at least explain how they can work through the section "Putting it all together with a loop" without Nano (e.g., they can use a text editor and change the file extension to .sh).
And it might be a good idea to suggest people to have at least Image Magick version 7. With 6.x versions you might find an error when trying to convert PDFs.

@akhlaghiandrew
Copy link
Contributor

I'm happy to help with updating the lesson. I didn't know this was still an open issue.

@DanielAlvesLABDH
Copy link
Contributor

@akhlaghiandrew many thanks for your availability. Also @rivaquiroga your work in reviewing this is amazing. Many thanks. @joanacvp will you follow this, please? Thank you

@anisa-hawes anisa-hawes self-assigned this Oct 6, 2021
@joanacvp
Copy link
Collaborator Author

joanacvp commented Oct 6, 2021

@DanielAlvesLABDH yes, I will follow this

@anisa-hawes
Copy link
Contributor

Hello all,

Please note that as part of a reorganisation of the /pt directory, this lesson's .md file has been moved to a new location within our Submissions Repository.

It is now found here: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/pt/esbocos/traducoes/ocr-e-traducao-automatica.md

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/pt/esbocos/traducoes/ocr-e-traducao-automatica

Please let me know if you encounter any difficulties or have any questions.
Very best, Anisa

@anisa-hawes anisa-hawes moved this from 5 Revision 2 to 6 Sustainability + Accessibility in Active Lessons Apr 21, 2023
@DanielAlvesLABDH
Copy link
Contributor

@joanacvp temos possibilidade de avançar com esta tradução ou os erros apontados continuam a dificultar?

@joanacvp
Copy link
Collaborator Author

joanacvp commented May 2, 2023

@joanacvp temos possibilidade de avançar com esta tradução ou os erros apontados continuam a dificultar?

Boa tarde @felipelmc. Será que a API de tradução já está funcional para conseguirmos avançar com esta lição? Muito obrigada!

@anisa-hawes anisa-hawes moved this from 6 Sustainability + Accessibility to 5 Revision 2 in Active Lessons May 4, 2023
@DanielAlvesLABDH
Copy link
Contributor

Olá @joanacvp e @felipelmc. Espero que esteja tudo bem. Acham que podemos avançar com esta tradução e revisão brevemente? Obrigado

@DanielAlvesLABDH
Copy link
Contributor

Olá @joanacvp e @felipelmc, qual é o estado desta tradução? Acham que será melhor abandonar a sua publicação? Obrigado

@felipelmc
Copy link
Collaborator

Caros, espero que estejam bem! Peço desculpas pela demora para responder às mensagens aqui no GitHub. Infelizmente estou com pouco tempo para me dedicar aos testes que essa tradução demanda, mas fico à disposição para eventuais revisões.

@joanacvp
Copy link
Collaborator Author

Bom dia. Devemos aguardar a implementação das correções na versão original e abandonar para já a publicação, retomando mais tarde? Qual seria a sua opinião e disponibilidade @felipelmc ?

@felipelmc
Copy link
Collaborator

Acredito que esse seja o procedimento ideal, @joanacvp. Reforço que me mantenho à disposição para revisar a lição, mas infelizmente não conseguiria fazer a adaptação do conteúdo.

@DanielAlvesLABDH
Copy link
Contributor

Tendo em conta isso e agradecendo todo o trabalho já desenvolvido, acho que é melhor aguardar e colocar este processo de revisão em suspenso. Obrigado a todos/as

@joanacvp
Copy link
Collaborator Author

Hi @anisa-hawes! We decided to wait until the corrections are implemented in the original version. Should I close the issue or keep it open and label it as Sustainability + Accessibility? Thank you in advance :)

@DanielAlvesLABDH DanielAlvesLABDH moved this from 5 Revision 2 to 6 Sustainability + Accessibility in Active Lessons Dec 1, 2023
@ericbrasiln
Copy link
Member

@anisa-hawes in this case, wouldn't it be a good idea to create a label to inform that the publication process is on hold due to technical issues? Something like 'on hold' or 'paused'?

@charlottejmc
Copy link
Collaborator

Hello all,

I've opened Issue #3362 on Jekyll to discuss the possible retirement of the original lesson, /en/OCR-and-Machine-Translation.

If we decide to retire the lesson, this Portuguese translation will unfortunately have to be abandoned. However, if the author is able to revise his original lesson within the next two months, then we may be able to keep working towards a complete translation!

Thank you all very much for your patience ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 6 Sustainability + Accessibility
Development

No branches or pull requests