30 Apr 18:54

hyunwoongko

34f0168

v6.0.4 Latest

Latest

Reimplement hanja module because it was not able to install in colab environment.
Fix information about morpheme analyzer backend in README and docs.

Assets 2

28 Apr 14:11

hyunwoongko

v6.0.2

a183292

v6.0.2

Add alias() function and fix some docs.

Assets 2

28 Apr 13:44

hyunwoongko

v6.0.1

0010fda

v6.0.1

[hotfix] Rename idiom.txt in MANIFEST.in to idioms.txt

Assets 2

27 Apr 14:23

hyunwoongko

v6.0.0

91355e2

v6.0.0

KSS: Korean String processing Suite

KSS is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.

Usage

1. Basic Usage

All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.

from kss import Kss

module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)

2. Available Modules

If you want to check the available modules, you can use the available() function.

from kss import Kss

Kss.available()

['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']

3. Checking the usage of each module

If you want to check the usage of each module, you can use the help() function.

from kss import Kss

module = Kss("split_sentences")
module.help()

Split texts into sentences.

Args:
    text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
    backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct' are supported
    num_workers (Union[int, str]): the number of multiprocessing workers
    strip (bool): strip all sentences or not
    return_morphemes (bool): whether to return morphemes or not
    ignores (List[str]): list of strings to ignore

Returns:
    Union[List[str], List[List[str]]]: outputs of sentence splitting

Examples:
    >>> from kss import Kss
    >>> split_sentences = Kss("split_sentences")
    >>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    >>> split_sentences(text)
    ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

4. Multiprocessing

If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel.
And you can set the number of processes to use by setting the num_workers parameter.
If you input num_workers<2, Kss will not use multiprocessing.

from kss import Kss

module = Kss("MODULE_NAME")

# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)

5. Backward Compatibility

The old version of Kss used functional usage. KSS also supports this for backward compatibility.

from kss import split_sentences

output = split_sentences("YOUR_INPUT_STRING", **kwargs)

Supported Modules

See here for more details.

Assets 2

02 Apr 00:34

hyunwoongko

v5.2.0

79b3956

v5.2.0

Add is_compliable() function to check Cython implementation is available for the user environment.

def is_compilable():
    try:
        # 1. Try to compile csrc/sentence_splitter.cpp
        extra_compile_args, extra_link_args = get_extra_compile_args()
        compiler = new_compiler()
        customize_compiler(compiler)
        compiler.compile(['csrc/sentence_splitter.cpp'], extra_postargs=extra_compile_args)
        return True
    except:
        # 2. Cannot compile csrc/sentence_splitter.cpp
        return False

Assets 2

31 Mar 21:23

hyunwoongko

v5.1.0

db440cf

v5.1.0

The `fast` backend

If you want to split sentences quickly, you can use the split_sentences function with the backend='fast' option from Kss 5.0.0. This method is based on the fast algorithm utilized in Kss versions prior to 3.0. It offers significantly faster processing compared to the mecab backend, but less accurate. Therefore, This feature could be useful when you need to split sentences very quickly but don't need high accuracy. Furthermore, the fast backend has been implemented in both Python and Cython.

If your environment supports the installation of Cython, Kss will use the Cython implementation, which boasts the fastest performance (x600 faster than mecab).
Otherwise, it will use the Python implementation, which is slower than the Cython version but faster than the mecab backend (x4 faster than mecab).

Given the substantial speed advantage of the Cython implementation, it is strongly recommended over the Python alternative. Kss automatically detects the availability of Cython in your environment and will install it if feasible, so you don't need to worry about Cython and C++ dependencies.

Accuracy (Normalized F1)

Backend	blogs_ko	blogs_lee	nested	sample	tweets	v_ending	wikipedia
`mecab`	0.8860	0.8887	0.9206	0.9682	0.8137	0.4815	1.0000
`fast` (Python)	0.6281	0.7899	0.6899	0.7482	0.5315	0.1596	0.7358
`fast` (Cython)	0.6545	0.8132	0.6372	0.8407	0.5892	0.1596	0.9566

Speed (msec)

Backend	blogs_ko	blogs_lee	nested	sample	tweets	v_ending	wikipedia
`mecab`	538.10	293.31	225.05	56.35	184.91	20.55	899.99
`fast` (Python)	146.75	70.94	52.84	12.11	37.80	4.69	255.90
`fast` (Cython)	0.91	0.55	0.46	0.09	0.40	0.05	1.12

Please note that while the core algorithm in the fast backend mirrors that of Kss C++ 1.3.1, several bugs identified in the original implementation have been rectified in Kss 5.0.0.

Assets 2

14 Jul 06:18

hyunwoongko

v4.5.4

5b0024c

v4.5.4

Fix multiprocessing by #67

Assets 2

17 May 03:09

hyunwoongko

v4.5.3

4b79728

v4.5.3

Add return_pos parameter to split_morphemes function. (#64)

Assets 2

16 May 18:11

hyunwoongko

v4.5.2

f3ba148

v4.5.2

Fix a bug reported from #60.

Assets 2

25 Jan 12:57

hyunwoongko

v4.5.1

1812f29

v4.5.1

Hotfix of some bugs in 4.5.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KSS: Korean String processing Suite

Usage

1. Basic Usage

2. Available Modules

3. Checking the usage of each module

4. Multiprocessing

5. Backward Compatibility

Supported Modules

The `fast` backend

Accuracy (Normalized F1)

Speed (msec)

Releases: hyunwoongko/kss

v6.0.4

v6.0.2

v6.0.1

v6.0.0

KSS: Korean String processing Suite

Usage

1. Basic Usage

2. Available Modules

3. Checking the usage of each module

4. Multiprocessing

5. Backward Compatibility

Supported Modules

v5.2.0

v5.1.0

The fast backend

Accuracy (Normalized F1)

Speed (msec)

v4.5.4

v4.5.3

v4.5.2

v4.5.1

The `fast` backend