Releases: hyunwoongko/kss
v6.0.4
v6.0.2
- Add
alias()
function and fix some docs.
v6.0.1
- [hotfix] Rename idiom.txt in MANIFEST.in to idioms.txt
v6.0.0
KSS: Korean String processing Suite
KSS is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.
Usage
1. Basic Usage
All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.
from kss import Kss
module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)
2. Available Modules
If you want to check the available modules, you can use the available()
function.
from kss import Kss
Kss.available()
['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']
3. Checking the usage of each module
If you want to check the usage of each module, you can use the help()
function.
from kss import Kss
module = Kss("split_sentences")
module.help()
Split texts into sentences.
Args:
text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct' are supported
num_workers (Union[int, str]): the number of multiprocessing workers
strip (bool): strip all sentences or not
return_morphemes (bool): whether to return morphemes or not
ignores (List[str]): list of strings to ignore
Returns:
Union[List[str], List[List[str]]]: outputs of sentence splitting
Examples:
>>> from kss import Kss
>>> split_sentences = Kss("split_sentences")
>>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
>>> split_sentences(text)
['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']
4. Multiprocessing
If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel.
And you can set the number of processes to use by setting the num_workers
parameter.
If you input num_workers<2
, Kss will not use multiprocessing.
from kss import Kss
module = Kss("MODULE_NAME")
# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)
5. Backward Compatibility
The old version of Kss used functional usage. KSS also supports this for backward compatibility.
from kss import split_sentences
output = split_sentences("YOUR_INPUT_STRING", **kwargs)
Supported Modules
See here for more details.
v5.2.0
- Add
is_compliable()
function to check Cython implementation is available for the user environment.
def is_compilable():
try:
# 1. Try to compile csrc/sentence_splitter.cpp
extra_compile_args, extra_link_args = get_extra_compile_args()
compiler = new_compiler()
customize_compiler(compiler)
compiler.compile(['csrc/sentence_splitter.cpp'], extra_postargs=extra_compile_args)
return True
except:
# 2. Cannot compile csrc/sentence_splitter.cpp
return False
v5.1.0
The fast
backend
If you want to split sentences quickly, you can use the split_sentences
function with the backend='fast'
option from Kss 5.0.0. This method is based on the fast algorithm utilized in Kss versions prior to 3.0. It offers significantly faster processing compared to the mecab
backend, but less accurate. Therefore, This feature could be useful when you need to split sentences very quickly but don't need high accuracy. Furthermore, the fast
backend has been implemented in both Python and Cython.
- If your environment supports the installation of
Cython
, Kss will use the Cython implementation, which boasts the fastest performance (x600 faster thanmecab
). - Otherwise, it will use the Python implementation, which is slower than the Cython version but faster than the
mecab
backend (x4 faster thanmecab
).
Given the substantial speed advantage of the Cython implementation, it is strongly recommended over the Python alternative. Kss automatically detects the availability of Cython in your environment and will install it if feasible, so you don't need to worry about Cython and C++ dependencies.
Accuracy (Normalized F1)
Backend | blogs_ko | blogs_lee | nested | sample | tweets | v_ending | wikipedia |
---|---|---|---|---|---|---|---|
mecab |
0.8860 | 0.8887 | 0.9206 | 0.9682 | 0.8137 | 0.4815 | 1.0000 |
fast (Python) |
0.6281 | 0.7899 | 0.6899 | 0.7482 | 0.5315 | 0.1596 | 0.7358 |
fast (Cython) |
0.6545 | 0.8132 | 0.6372 | 0.8407 | 0.5892 | 0.1596 | 0.9566 |
Speed (msec)
Backend | blogs_ko | blogs_lee | nested | sample | tweets | v_ending | wikipedia |
---|---|---|---|---|---|---|---|
mecab |
538.10 | 293.31 | 225.05 | 56.35 | 184.91 | 20.55 | 899.99 |
fast (Python) |
146.75 | 70.94 | 52.84 | 12.11 | 37.80 | 4.69 | 255.90 |
fast (Cython) |
0.91 | 0.55 | 0.46 | 0.09 | 0.40 | 0.05 | 1.12 |
Please note that while the core algorithm in the fast
backend mirrors that of Kss C++ 1.3.1, several bugs identified in the original implementation have been rectified in Kss 5.0.0.
v4.5.4
v4.5.3
v4.5.2
v4.5.1
- Hotfix of some bugs in 4.5.0