Skip to content

Latest commit

 

History

History
233 lines (164 loc) · 10.5 KB

README.md

File metadata and controls

233 lines (164 loc) · 10.5 KB

Overview

To sync changes with anki, see https://github.com/FooSoft/anki-connect

For docker-compose:

cd moonspeak
docker compose up

Open your web browser: http://localhost:80/

Contributors

  • justDOOVE (https://github.com/JustDOOVE)

    • Create the frequency module (with unit tests and all):
      • Extract kanji from URLs (website crawler or image OCR)
      • Extract kanji from text
  • Shtekovski (https://github.com/Shtekovski)

    • Create the telegram bot for dev tasks
      • Work with server load: cpu, disk, memory.
      • Work with docker metrics (number of containers, images)
  • BAStos525 (https://github.com/BAStos525)

    • Contribute to onyomi keywords tool
      • Read word phonetic from cmudict
      • Logic to find onyomi keyword candidates according to phonetic rules e.g. must have trailing vowels
      • Api to record the chosen keyword

Onyomi keywords

The files related to onyomi-keywords reside inside "onyomi-keywords/" directory. There is also a more detailed README in that directory.

Each onyomi needs a keyword.

  • it should NOT be one of 10000 most common english words.
  • it should be unique for each onyomi.
  • it should be written like the onyomi transcription.

Sample keywords for "な..." onyomi are below:

na  =  ナ  =  な  =  NApoleon
nai  =  ナイ  =  ない  =  NAIve
nan  =  ナン  =  なん  =  NANny
nei  =  ネイ  =  ねい  =  NEIghbourhood
nen  =  ネン  =  ねん  =  NEN (hunter x hunter)
netsu  =  ネツ  =  ねつ  =  NETScape

Results

The onyomi keywords in plain text format are in file:

  • onyomi-keywords.txt The results are also packaged as Anki decks (.txt and .apkg)

A web tool (backend in Python, frontend in React+Redux) to decide on the most fitting onyomi keyword. To run the tool (needs python 3.9 due to type annotations, and AnkiConnect plugin to sync with anki):

cd onyomi-keywords/backend
python main.py

And navigate to http://localhost:8080 in a browser, you should see the bellow screenshot:

Resources

Sources used to decide the most appropriate ONYOMI keywords are in "resources/" directory:

Kanji keywords

The files related to kanji-keywords reside inside "kanji-keywords/" directory. There is also a more detailed README in that directory.

Each kanji needs a keyword.

  • it should NOT be one of 10000 most common english words.
  • it should NOT be one of the onyomi keywords.
  • it should be unique for each kanji.
  • it should reflect meaning of the kanji.

Most common english words can be taken from 1/3 million of google english corpus (https://norvig.com/ngrams/count_1w.txt)

Previous works:

Results

A web tool (backend in Python, frontend in Elm) to decide on the most fitting kanji keyword. To run the tool (needs python 3.9 due to type annotations):

cd kanji-keywords/backend
python main.py

And navigate to http://localhost:9000 in a browser, you should see the bellow screenshot:

The tool works with the sqlite database file "kanji-keywords/kanji-keywords.db". It has a single table "kanjikeywords" to contain the results of using the tool. The table has three columns: kanji, keyword, additional text notes

Useful resources

Resources to compile a list of 1700 most commonly seen kanji:

Japanese expressions ordered by frequency (to find common expressions with a particular kanji):

English words ordered by frequency (to find less common words):

English thesaurus to find synonyms to keywords:

Existing lists of kanji keywords, they are used as suggestions for keywords:

  • keywords-kanjidic2-meanings.json list of kanji and possible meaning extracted from kanjidic2.xml by a custom script for this project.
  • keywords-scriptin-kanji-keys.json list of unique (!) kanji keywords assigned by scriptin (https://github.com/scriptin/kanji-keys)

Other:

To find names for kanji that are just made up junk consider using a drawing-to-keyword mapping software:

Kanji breakdown

Radical is kanji with no ONYOMI and no sub-kanji. Regarding the breakdown it dows not matter if the element is a kanji, a radical or a handmade drawing. So against common usage lets call everything just kanji.

Extremely useful:

Useful resources

Previous works on kanji breakdown:

  • List-of-200-radicals-used-in-Hanyu-Da-Cidian.pdf breakdown for chinese characters
  • kradfile-u this is like kradfile but improved breakdowns and done in unicode!

Methodology

There can be different ways to break kanji.

To find the best breakdown:

  1. Find possible breakdowns
  2. For each component in a breakdown find its possible appearances (as own kanji or sub-kanji in another kanji)
  3. For each appearance find how frequently this separate identity is enforced, i.e. frequency of own kanji, frequency of kanji with this redical.
  4. Sum up all the frequencies.
  5. The breakdown with highest frequency wins. Its members most often appear as components/kanji.

Useful links:

Example investigation

What is the best way to break up 勇 ? Is it (マ + 男) OR (甬 + 力)? Maybe its best not to break it up at all, e.g. when the kanji appears much more often than its parts?

Take the first possible breakdown (マ + 男)

Investigate components:

  • マ never appears on its own and in 甬
  • 男 appears on its own and in 虜

Frequency evaluation:

  • 甬 appears 9 times
  • 男 appears 95900 times
  • 虜 appears 995 times

In total components appear (9 + 95900 + 995) ~ 97000 times

Take the second possible breakdown (甬 + 力)

Investigate components:

  • 甬 appears on its own and in 踊, 桶, 勇(current investigation), 痛, 通
  • 力 appears on its own and in about 24 other kanji

Frequency evaluation:

  • 甬 appears 9 times
  • 踊 appears 8799 times
  • 痛 appears 20230 times
  • 通 appears 109080 times
  • 力 appears 112027 times
  • ....

In total components appear (9 + 8799 + 20230 + 109080 + 112027 + ...) ~ 240000 times

Take third possible breakdown (no breakdown)

Frequency evaluation:

  • 勇 appears 12432 times

Result

Looking at frequency evaluations, the most commonly seen pattern is breakdown with (甬 + 力), because its members get more apperances as kanji.