Skip to content

hedges333/pg_cld2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pg_cld2 1.0.0

Synopsis

Use cld2 language detection from Postgres.

The caller must call the function in a way that expects a record result, matching the structure of the pg_cld2_language_detection composite type.

\x on
SELECT * FROM pg_cld2_detect_language('This is a sample text to detect the language.');

This will return a record with the structure:

Field Value
input_bytes 45
text_bytes 46
is_reliable t
valid_prefix_bytes 45
is_valid_utf8 f
mll_cld2_name ENGLISH
mll_language_cname ENGLISH
mll_language_code en
mll_primary_script_name Latin
mll_primary_script_code Latn
mll_script_names Latin
mll_script_codes Latn
mll_ts_name english
language_1_cld2_name ENGLISH
language_1_language_cname ENGLISH
language_1_language_code en
language_1_primary_script_name Latin
language_1_primary_script_code Latn
language_1_script_names Latin
language_1_script_codes Latn
language_1_percent 97
language_1_normalized_score 7.98e-321
language_1_ts_name english
language_2_cld2_name Unknown
language_2_language_cname UNKNOWN_LANGUAGE
language_2_language_code un
language_2_primary_script_name Latin
language_2_primary_script_code Latn
language_2_script_names Latin
language_2_script_codes Latn
language_2_percent 0
language_2_normalized_score 0
language_2_ts_name simple
language_3_cld2_name Unknown
language_3_language_cname UNKNOWN_LANGUAGE
language_3_language_code un
language_3_primary_script_name Latin
language_3_primary_script_code Latn
language_3_script_names Latin
language_3_script_codes Latn
language_3_percent 0
language_3_normalized_score 0
language_3_ts_name simple

This is the information provided by CLD2::ExtDetectLanguageSummaryCheckUTF8.

"MLL" = "Most Likely Language". This is the return value from the function, which is probably the same as language1. (But not guaranteed? I suppose if the probabilities of 1 and 2 were the same, it wouldn't be.) See the header file for public/compact_lang_det.h in CLD2 if you want to learn more.

The primary_script_name and primary_script_code fields contain the first pick of script names and codes. The subsequent fields contain all the found script names and codes in a comma-delimited string, omitting "None" and "Common"/"Zyyy".

It also makes an attempt to look up a match to corresponding configured languages in pg_catalog.pg_ts_config for tsvector search indexing. (*_ts_name)

Options

See SELECT pg_cld2_usage();

  return_record := pg_cld2_detect_language(
     text_to_analyze,         -- required
     is_plain_text,           -- boolean, default true. Parses HTML if false
     content_language_hint,   -- text. Ex: "mi,en" boosts Maori & English
     tld_hint,                -- text. Ex: "id" boosts Indonesian
     cld2_language_hint,      -- text, default NULL. Ex: "ITALIAN" boosts it. See pg_cld2_languages table.
     best_effort,             -- boolean, default true. Gives best-effort answer for short text instead of UNKNOWN.
     text_encoding,           -- text, default UTF8, will copy string if not, also sets encoding hint
     tsconfig_language_hint,  -- text, default NULL. Looks up in pg_cld2_languages table, overrides cld2_language_hint.
     locale_hint              -- text, 1st 2 chars, overrides tld_hint.
  );

YMMV.

Type definition of pg_cld2_language_detection

Here is the type definition with some more informative comments:

CREATE TYPE pg_cld2_language_detection AS (
    input_bytes                     INTEGER,            -- length of original text (after conversion to utf8)
    text_bytes                      INTEGER,            -- non-markup bytes
    is_reliable                     BOOLEAN,            -- CLD2's guess
    valid_prefix_bytes              INTEGER,            -- if != input_bytes: invalid UTF8 after that byte
    is_valid_utf8                   BOOLEAN,            -- short answer whether there are invalid utf8 bytes

    mll_cld2_name                   TEXT,       -- first language name, e.g. "ENGLISH" or "NEPALI"
    mll_language_cname              TEXT,       -- language name, e.g. "ENGLISH" or "NEPALI" (only minor differences)
    mll_language_code               TEXT,       -- language code, e.g. "en" or "ne"
    mll_primary_script_name         TEXT,       -- first pick of script names, e.g. "Latin" or "Devanagari"
    mll_primary_script_code         TEXT,       -- first pick of script codes, e.g. "Latn" or "Deva"
    mll_script_names                TEXT,       -- all possible script names, e.g. "Latin,Devanagari" or "Devanagari,Latin" (skips "Common")
    mll_script_codes                TEXT,       -- all possible script codes, e.g. "Latn,Deva" or "Deva,Latn" (skips "Zyyy")
    mll_ts_name                     TEXT,       -- guess from pg_catalog.pg_ts_config, e.g. "english" or "nepali"

    language_1_cld2_name            TEXT,       -- first language name, e.g. "ENGLISH" or "NEPALI"
    language_1_language_cname       TEXT,       -- language name, e.g. "ENGLISH" or "NEPALI" (only minor differences)
    language_1_language_code        TEXT,       -- language code, e.g. "en" or "ne"
    language_1_primary_script_name  TEXT,       -- script name, e.g. "Latin" or "Devanagari"
    language_1_primary_script_code  TEXT,       -- script code, e.g. "Latn" or "Deva"
    language_1_script_names         TEXT,       -- script names, e.g. "Latin,Devanagari" or "Devanagari,Latin"
    language_1_script_codes         TEXT,       -- script code, e.g. "Latn,Deva" or "Deva,Latn"
    language_1_percent              INTEGER,            -- how likely this language is
    language_1_normalized_score     DOUBLE PRECISION,   -- mumble mumble
    language_1_ts_name              TEXT,       -- guess from pg_catalog.pg_ts_config, e.g. "english" or "nepali"

    language_2_cld2_name            TEXT,       -- second likely language name
    language_2_language_cname       TEXT,       -- etc.
    language_2_language_code        TEXT,
    language_2_primary_script_name  TEXT,       -- script name, e.g. "Latin" or "Devanagari"
    language_2_primary_script_code  TEXT,       -- script code, e.g. "Latn" or "Deva"
    language_2_script_names         TEXT,
    language_2_script_codes         TEXT,
    language_2_percent              INTEGER,
    language_2_normalized_score     DOUBLE PRECISION,
    language_2_ts_name              TEXT,

    language_3_cld2_name            TEXT,       -- third likely language name
    language_3_language_cname       TEXT,       -- etc.
    language_3_language_code        TEXT,
    language_3_primary_script_name  TEXT,       -- script name, e.g. "Latin" or "Devanagari"
    language_3_primary_script_code  TEXT,       -- script code, e.g. "Latn" or "Deva"
    language_3_script_names         TEXT,
    language_3_script_codes         TEXT,
    language_3_percent              INTEGER,
    language_3_normalized_score     DOUBLE PRECISION,
    language_3_ts_name              TEXT

);

Requirements

The CLD2 libraries must be installed on your system.

Contributing

I tested it to the point that I determined it returned the results from the call to the CLD2 function. I figure that library tests itself well enough. If you'd like to add some more tests, please do a pull request.

Author

Mark Hedges.

Copyright and License

Unofficially, this is "Jobware." If it's useful to you, please help me find a job.

Officially:

MIT License

Copyright (c) 2024 Mark Hedges

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.