flb_encoding: charset encoding for input plugins #2420

bluebike · 2020-08-04T20:50:58Z

Adds library flb_encoding for doing charset encoding CHARSET => UTF8.

Uses lib/tutf8e-library.
Only 8-bit charsets are supported.
At first in_tail plugin is supported, later in_syslog and others.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

bluebike · 2020-08-04T20:54:41Z

I build this over #2326 (those commits seems to be included in PR), so I would like that to be merged before this.
TODO: example run, configuration, valgrind.
@nigels-com @edsiper would this be usable?

edsiper · 2020-08-07T23:15:53Z

#2326 was just merged, @bluebike @nigels-com can you remove the tutf8e pieces from this PR ?

in addition, ideally we want separate commits for the core interface "encoding" and for the plugins being improved.

bluebike · 2020-08-11T11:59:04Z

Rebased on master (with #2326 merged).
In two commits:

flb_encoding
in_tail changes

TODO(?)

add in_syslog support
test example

bluebike · 2020-08-11T17:55:31Z

Added encoding support to in_syslog .
Compiling with FLB_UTF8_ENCODER=No worked also... didn't see any related warnings in my env (macOS 10..13).

edsiper

thanks. I wrote some comments.

About the second commit, note that it must be prefixed with in_tail: .....

src/flb_encoding.c

edsiper · 2020-08-11T18:08:37Z

src/flb_encoding.c

+ *  windows-1251 windows-1252, ..
+ * 
+ *  <charset>                     - fail if bad chars
+ *  <charset>//IGNORE             - ignore bad chars


is the // just a separator or a common naming across encoding configs ?

iconv(3) library uses same convention to add parameters to charset.
implementations I know understand only //IGNORE and //TRANSLIT .

I'm a bit sceptical of following the iconv API for this.
I think a nullable invalid UTF8 string is sufficient for configuration purposes.
I don't think fluent-bit should be concerned with //TRANSLIT mode in particular:

ICONV_OPEN(3) Linux Programmer's Manual ... NAME iconv_open - allocate descriptor for character set conversion ... //TRANSLIT When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be rep‐ resented in the target character set, it can be approximated through one or several similarly looking characters. //IGNORE When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.```

//TRANSLIT has no meaning in our case.
//IGNORE is usefull...

So ... we are going to get some kind agreement of configuration?

I like idea of one configuration value with //(parameter), because that is flexible and "kind of" compatible with iconv.
Also it's useful to have multiple ways to handle bad input (ignore,fail,replace-with-something).

Well... that is this code doing... just putting everything to one parameter (like but not totally same as in iconv(3)).
But should this be an another parameter????

Then probably better that replacement is should be (json) escaped string.

Encoding iso-8859-2 => default... FAIL,INGORE , "?", "\uFFFD" ???? Encoding iso-8859-2 ? => Encoding iso-8859-3 <bad\20char> => "<bad char Encoding iso-8859-3 \uFFFD => Use unicode replacement character Maybe Encoding iso-8859-2 \I => IGNORE Encoding iso-8859-2 \F => FAIL Encoding iso-8859-2 \R => Replacement char...

Or we put optional second parameter to quotes "?" ...
hum... getting complicated.

Are getting anywhere with configuration???? @nigels-com

I'll take a fresh look. I know you came to this from an iconv perspective, I was hoping to leave that behind rather than carry that into fluent-bit. It's been a while since I was actively engaged on this one.

hum.. maybe have to change way of configuration...

Changed configuration use have parameters:

encoding CHARSET

encoding_replacement STRING

default is to use unicode replacement char (0xfffd)

bluebike · 2020-08-11T18:51:33Z

Did requested changes...

checking memory allocation
formatting checked.
rebased... to get clean 3 commits.
(... I'll add test examples soon)

bluebike · 2020-08-11T19:11:44Z

in_tail: simple test run in shell. input contains ä,Ö and € characters encoded in windows-1252 (cp1252).


$  echo $'Test data'    > huuhaa.txt
$  echo $'This contains a+dots: \xe4 O+dots: \xd6. trailing data' >> huuhaa.txt
$  echo $'This contains euro character: \x80' >> huuhaa.txt



$   bin/fluent-bit -v -i tail -p path=huuhaa.txt -p 'encoding=windows-1252'  -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:06:42] [ info] Configuration:
[2020/08/11 22:06:42] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:06:42] [ info]  grace          | 5 seconds
[2020/08/11 22:06:42] [ info]  daemon         | 0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  inputs:
[2020/08/11 22:06:42] [ info]      tail
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  filters:
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  outputs:
[2020/08/11 22:06:42] [ info]      stdout.0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  collectors:
[2020/08/11 22:06:42] [ info] [engine] started (pid=47950)
[2020/08/11 22:06:42] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:06:42] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/11 22:06:42] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:06:42] [ info] [storage] in-memory
[2020/08/11 22:06:42] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scanning path huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 appended as huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scan_glob add(): huuhaa.txt, inode 13109344
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] 1 new files found on path 'huuhaa.txt'
[2020/08/11 22:06:42] [debug] [router] default match rule tail.0:stdout.0
[2020/08/11 22:06:42] [ info] [sp] stream processor started
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 file=huuhaa.txt promote to TAIL_EVENT
[2020/08/11 22:06:47] [debug] [task] created task=0x7f9b36d00a30 id=0 OK
[0] tail.0: [1597172802.983562000, {"log"=>"Test data"}]
[1] tail.0: [1597172802.983579000, {"log"=>"This contains a+dots: ä O+dots: Ö. trailing data"}]
[2] tail.0: [1597172802.983580000, {"log"=>"This contains euro character: €"}]
^C[engine] caught signal (SIGINT)
[2020/08/11 22:06:52] [ info] [input] pausing tail.0
[2020/08/11 22:06:52] [debug] [input:tail:tail.0] inode=13109344 removing file name huuhaa.txt

bluebike · 2020-08-11T19:20:54Z

in_syslog: test run using UDP syslog messages.

# start fluent-bit first.. then send these in different terminal (+ bash shell)

$ echo $'<135>Aug 11 20:27:22 myhost test: nothing'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test: euro: \x80'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test:  tama: t\xe4m\xe4'   | nc -w 1 -u 127.0.0.1  7700



$ bin/fluent-bit -v -R ../conf/parsers.conf -i syslog -p mode=udp  -p port=7700 -p Parser=syslog-rfc3164-local  -p encoding=windows-1252 -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:13:51] [ info] Configuration:
[2020/08/11 22:13:51] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:13:51] [ info]  grace          | 5 seconds
[2020/08/11 22:13:51] [ info]  daemon         | 0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  inputs:
[2020/08/11 22:13:51] [ info]      syslog
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  filters:
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  outputs:
[2020/08/11 22:13:51] [ info]      stdout.0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  collectors:
[2020/08/11 22:13:51] [ info] [engine] started (pid=48082)
[2020/08/11 22:13:51] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:13:51] [debug] [storage] [cio stream] new stream registered: syslog.0
[2020/08/11 22:13:51] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:13:51] [ info] [storage] in-memory
[2020/08/11 22:13:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:13:51] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2020/08/11 22:13:51] [ info] [in_syslog] UDP server binding 0.0.0.0:7700
[2020/08/11 22:13:51] [debug] [router] default match rule syslog.0:stdout.0
[2020/08/11 22:13:51] [ info] [sp] stream processor started
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"nothing"}
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"euro: €"}]
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"tama: tämä"}]

src/flb_encoding.c

bluebike · 2020-10-28T18:56:22Z

Added documentation PR fluent/fluent-bit-docs#410
and fixed possible memory leak if allocation fails in opening in flb_encoding_open

plugins/in_tail/tail.c

github-actions · 2021-04-28T01:52:57Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

bluebike · 2021-07-29T18:55:11Z

thanks. I wrote some comments.

About the second commit, note that it must be prefixed with in_tail: .....

Changed commit message.

kingjan1999 · 2021-11-22T20:46:26Z

Any news on this? Would love to see this merged!

edsiper · 2021-12-13T00:09:21Z

assigned to @nokute78 for review

Add flb_encoding functions for charset encodings to utf8. * Uses lib/tutf8e-library. * Only 8-bit source charsets are supported. * Support for replacement string (in case of bad chars) * This commit doesn't add support to any input plugin. Signed-off-by: Jukka Pihl <[email protected]>

* Adds new option: "encoding" to in_tail. * If encoding fails. message is skipped. Signed-off-by: Jukka Pihl <[email protected]>

* encoding uses flb_encoding * refactoring syslog_prot_process and syslog_prot_process_udp to use syslog_prot_process_msg per message. Signed-off-by: Jukka Pihl <[email protected]>

hpernu · 2023-04-18T11:37:40Z

Could we get this included in mainline ASAP? I assume I am not the only one with non-UTF8 log entries .

As a user, I do not even really care how this is implemented but as for configuration, input seems the most convenient i.e. tail-plugin.

bluebike requested review from edsiper, fujimotos and koleini as code owners August 4, 2020 20:50

nigels-com mentioned this pull request Aug 5, 2020

lib: tutf8e: Updated API to support conversion of invalid input input #2326

Merged

bluebike force-pushed the flb_encoding_utf8 branch from 145814e to 2bdd04c Compare August 11, 2020 11:15

bluebike mentioned this pull request Aug 11, 2020

flb_iconv: charset decoding/encoding #1180

Closed

edsiper requested changes Aug 11, 2020

View reviewed changes

bluebike force-pushed the flb_encoding_utf8 branch from 3a2523c to dea7701 Compare August 11, 2020 18:46

nigels-com reviewed Sep 18, 2020

View reviewed changes

src/flb_encoding.c Outdated Show resolved Hide resolved

nigels-com reviewed Sep 18, 2020

View reviewed changes

src/flb_encoding.c Show resolved Hide resolved

bluebike force-pushed the flb_encoding_utf8 branch from dea7701 to b9d6c72 Compare October 28, 2020 18:51

bluebike mentioned this pull request Oct 29, 2020

in_tail: document "Encoding" configuration fluent/fluent-bit-docs#410

Closed

bluebike force-pushed the flb_encoding_utf8 branch from b9d6c72 to 103066a Compare March 23, 2021 18:11

bluebike mentioned this pull request Mar 23, 2021

workflows: allow the use of an underscore ('_') in commit message #3270

Merged

nigels-com reviewed Mar 28, 2021

View reviewed changes

plugins/in_tail/tail.c Outdated Show resolved Hide resolved

github-actions bot added the Stale label Apr 28, 2021

bluebike force-pushed the flb_encoding_utf8 branch from 103066a to acfc758 Compare May 26, 2021 19:50

github-actions bot added the docs-required label May 26, 2021

bluebike force-pushed the flb_encoding_utf8 branch 3 times, most recently from ef5529f to 5b08bf1 Compare July 29, 2021 18:13

bluebike force-pushed the flb_encoding_utf8 branch from 5b08bf1 to e777888 Compare July 29, 2021 18:49

bluebike changed the title ~~flb_encoding: charset encoding for input plugins~~ flb_encoding: charset encoding for input plugins Jul 29, 2021

github-actions bot removed the Stale label Aug 3, 2021

edsiper assigned nokute78 Dec 13, 2021

edsiper removed the docs-required label Dec 13, 2021

bluebike added 3 commits December 19, 2021 17:15

in_tail: add flb_encoding support to in_tail-plugin

870a1bf

* Adds new option: "encoding" to in_tail. * If encoding fails. message is skipped. Signed-off-by: Jukka Pihl <[email protected]>

in_syslog: support for input encoding to utf8

33badf9

* encoding uses flb_encoding * refactoring syslog_prot_process and syslog_prot_process_udp to use syslog_prot_process_msg per message. Signed-off-by: Jukka Pihl <[email protected]>

bluebike force-pushed the flb_encoding_utf8 branch from e777888 to 33badf9 Compare December 19, 2021 15:38

bluebike requested a review from niedbalski as a code owner December 19, 2021 15:38

edsiper requested review from patrick-stephens, celalettin1286 and leonardo-albertovich as code owners August 14, 2024 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flb_encoding: charset encoding for input plugins #2420

flb_encoding: charset encoding for input plugins #2420

bluebike commented Aug 4, 2020 •

edited

Loading

bluebike commented Aug 4, 2020

edsiper commented Aug 7, 2020

bluebike commented Aug 11, 2020 •

edited

Loading

bluebike commented Aug 11, 2020

edsiper left a comment

edsiper Aug 11, 2020

bluebike Aug 11, 2020

nigels-com Sep 18, 2020

bluebike Oct 28, 2020

bluebike Nov 17, 2020

bluebike Jan 1, 2021 •

edited

Loading

bluebike Jan 14, 2021

nigels-com Mar 28, 2021

bluebike May 5, 2021

bluebike Jul 29, 2021

bluebike commented Aug 11, 2020 •

edited

Loading

bluebike commented Aug 11, 2020

bluebike commented Aug 11, 2020

bluebike commented Oct 28, 2020

github-actions bot commented Apr 28, 2021

bluebike commented Jul 29, 2021

kingjan1999 commented Nov 22, 2021

edsiper commented Dec 13, 2021

hpernu commented Apr 18, 2023

flb_encoding: charset encoding for input plugins #2420

Are you sure you want to change the base?

flb_encoding: charset encoding for input plugins #2420

Conversation

bluebike commented Aug 4, 2020 • edited Loading

bluebike commented Aug 4, 2020

edsiper commented Aug 7, 2020

bluebike commented Aug 11, 2020 • edited Loading

bluebike commented Aug 11, 2020

edsiper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluebike Jan 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluebike commented Aug 11, 2020 • edited Loading

bluebike commented Aug 11, 2020

bluebike commented Aug 11, 2020

bluebike commented Oct 28, 2020

github-actions bot commented Apr 28, 2021

bluebike commented Jul 29, 2021

kingjan1999 commented Nov 22, 2021

edsiper commented Dec 13, 2021

hpernu commented Apr 18, 2023

bluebike commented Aug 4, 2020 •

edited

Loading

bluebike commented Aug 11, 2020 •

edited

Loading

bluebike Jan 1, 2021 •

edited

Loading

bluebike commented Aug 11, 2020 •

edited

Loading