Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flb_encoding: charset encoding for input plugins #2420

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bluebike
Copy link
Contributor

@bluebike bluebike commented Aug 4, 2020

Adds library flb_encoding for doing charset encoding CHARSET => UTF8.

  • Uses lib/tutf8e-library.
  • Only 8-bit charsets are supported.
  • At first in_tail plugin is supported, later in_syslog and others.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@bluebike
Copy link
Contributor Author

bluebike commented Aug 4, 2020

I build this over #2326 (those commits seems to be included in PR), so I would like that to be merged before this.
TODO: example run, configuration, valgrind.
@nigels-com @edsiper would this be usable?

@edsiper
Copy link
Member

edsiper commented Aug 7, 2020

#2326 was just merged, @bluebike @nigels-com can you remove the tutf8e pieces from this PR ?

in addition, ideally we want separate commits for the core interface "encoding" and for the plugins being improved.

@bluebike
Copy link
Contributor Author

bluebike commented Aug 11, 2020

Rebased on master (with #2326 merged).
In two commits:

  1. flb_encoding
  2. in_tail changes

TODO(?)

  • add in_syslog support
  • test example

@bluebike
Copy link
Contributor Author

Added encoding support to in_syslog .
Compiling with FLB_UTF8_ENCODER=No worked also... didn't see any related warnings in my env (macOS 10..13).

Copy link
Member

@edsiper edsiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I wrote some comments.

About the second commit, note that it must be prefixed with in_tail: .....

src/flb_encoding.c Outdated Show resolved Hide resolved
src/flb_encoding.c Outdated Show resolved Hide resolved
* windows-1251 windows-1252, ..
*
* <charset> - fail if bad chars
* <charset>//IGNORE - ignore bad chars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the // just a separator or a common naming across encoding configs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iconv(3) library uses same convention to add parameters to charset.
implementations I know understand only //IGNORE and //TRANSLIT .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit sceptical of following the iconv API for this.
I think a nullable invalid UTF8 string is sufficient for configuration purposes.
I don't think fluent-bit should be concerned with //TRANSLIT mode in particular:

ICONV_OPEN(3)                                               Linux Programmer's Manual                                               
...
NAME
       iconv_open - allocate descriptor for character set conversion

...
       //TRANSLIT
              When the string "//TRANSLIT" is appended to tocode, transliteration is activated.  This means that when a character cannot be  rep‐
              resented in the target character set, it can be approximated through one or several similarly looking characters.

       //IGNORE
              When  the  string  "//IGNORE"  is  appended  to  tocode,  characters that cannot be represented in the target character set will be
              silently discarded.```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TRANSLIT has no meaning in our case.
//IGNORE is usefull...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ... we are going to get some kind agreement of configuration?

I like idea of one configuration value with //(parameter), because that is flexible and "kind of" compatible with iconv.
Also it's useful to have multiple ways to handle bad input (ignore,fail,replace-with-something).

Copy link
Contributor Author

@bluebike bluebike Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... that is this code doing... just putting everything to one parameter (like but not totally same as in iconv(3)).
But should this be an another parameter????

Then probably better that replacement is should be (json) escaped string.

Encoding  iso-8859-2               => default... FAIL,INGORE ,  "?", "\uFFFD"  ????
Encoding  iso-8859-2  ?            => 
Encoding  iso-8859-3  <bad\20char> => "<bad char
Encoding  iso-8859-3  \uFFFD       => Use unicode replacement character

Maybe  

Encoding  iso-8859-2  \I            =>  IGNORE
Encoding  iso-8859-2  \F            =>  FAIL
Encoding  iso-8859-2  \R            =>  Replacement char...

Or we put optional second parameter to quotes "?" ...
hum... getting complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are getting anywhere with configuration???? @nigels-com

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a fresh look. I know you came to this from an iconv perspective, I was hoping to leave that behind rather than carry that into fluent-bit. It's been a while since I was actively engaged on this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum.. maybe have to change way of configuration...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed configuration use have parameters:

  • encoding CHARSET
  • encoding_replacement STRING

default is to use unicode replacement char (0xfffd)

@bluebike
Copy link
Contributor Author

bluebike commented Aug 11, 2020

Did requested changes...

  • checking memory allocation
  • formatting checked.
  • rebased... to get clean 3 commits.
    (... I'll add test examples soon)

@bluebike
Copy link
Contributor Author

in_tail: simple test run in shell. input contains ä,Ö and € characters encoded in windows-1252 (cp1252).


$  echo $'Test data'    > huuhaa.txt
$  echo $'This contains a+dots: \xe4 O+dots: \xd6. trailing data' >> huuhaa.txt
$  echo $'This contains euro character: \x80' >> huuhaa.txt



$   bin/fluent-bit -v -i tail -p path=huuhaa.txt -p 'encoding=windows-1252'  -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:06:42] [ info] Configuration:
[2020/08/11 22:06:42] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:06:42] [ info]  grace          | 5 seconds
[2020/08/11 22:06:42] [ info]  daemon         | 0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  inputs:
[2020/08/11 22:06:42] [ info]      tail
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  filters:
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  outputs:
[2020/08/11 22:06:42] [ info]      stdout.0
[2020/08/11 22:06:42] [ info] ___________
[2020/08/11 22:06:42] [ info]  collectors:
[2020/08/11 22:06:42] [ info] [engine] started (pid=47950)
[2020/08/11 22:06:42] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:06:42] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/11 22:06:42] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:06:42] [ info] [storage] in-memory
[2020/08/11 22:06:42] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scanning path huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 appended as huuhaa.txt
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] scan_glob add(): huuhaa.txt, inode 13109344
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] 1 new files found on path 'huuhaa.txt'
[2020/08/11 22:06:42] [debug] [router] default match rule tail.0:stdout.0
[2020/08/11 22:06:42] [ info] [sp] stream processor started
[2020/08/11 22:06:42] [debug] [input:tail:tail.0] inode=13109344 file=huuhaa.txt promote to TAIL_EVENT
[2020/08/11 22:06:47] [debug] [task] created task=0x7f9b36d00a30 id=0 OK
[0] tail.0: [1597172802.983562000, {"log"=>"Test data"}]
[1] tail.0: [1597172802.983579000, {"log"=>"This contains a+dots: ä O+dots: Ö. trailing data"}]
[2] tail.0: [1597172802.983580000, {"log"=>"This contains euro character: €"}]
^C[engine] caught signal (SIGINT)
[2020/08/11 22:06:52] [ info] [input] pausing tail.0
[2020/08/11 22:06:52] [debug] [input:tail:tail.0] inode=13109344 removing file name huuhaa.txt

@bluebike
Copy link
Contributor Author

in_syslog: test run using UDP syslog messages.

# start fluent-bit first.. then send these in different terminal (+ bash shell)

$ echo $'<135>Aug 11 20:27:22 myhost test: nothing'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test: euro: \x80'   | nc -w 1 -u 127.0.0.1  7700
$ echo $'<135>Aug 11 20:27:22 myhost test:  tama: t\xe4m\xe4'   | nc -w 1 -u 127.0.0.1  7700



$ bin/fluent-bit -v -R ../conf/parsers.conf -i syslog -p mode=udp  -p port=7700 -p Parser=syslog-rfc3164-local  -p encoding=windows-1252 -o stdout

Fluent Bit v1.6.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/11 22:13:51] [ info] Configuration:
[2020/08/11 22:13:51] [ info]  flush time     | 5.000000 seconds
[2020/08/11 22:13:51] [ info]  grace          | 5 seconds
[2020/08/11 22:13:51] [ info]  daemon         | 0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  inputs:
[2020/08/11 22:13:51] [ info]      syslog
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  filters:
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  outputs:
[2020/08/11 22:13:51] [ info]      stdout.0
[2020/08/11 22:13:51] [ info] ___________
[2020/08/11 22:13:51] [ info]  collectors:
[2020/08/11 22:13:51] [ info] [engine] started (pid=48082)
[2020/08/11 22:13:51] [debug] [engine] coroutine stack size: 12288 bytes (12.0K)
[2020/08/11 22:13:51] [debug] [storage] [cio stream] new stream registered: syslog.0
[2020/08/11 22:13:51] [ info] [storage] version=1.0.5, initializing...
[2020/08/11 22:13:51] [ info] [storage] in-memory
[2020/08/11 22:13:51] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/11 22:13:51] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2020/08/11 22:13:51] [ info] [in_syslog] UDP server binding 0.0.0.0:7700
[2020/08/11 22:13:51] [debug] [router] default match rule syslog.0:stdout.0
[2020/08/11 22:13:51] [ info] [sp] stream processor started
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"nothing"}
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"euro: €"}]
[0] syslog.0: [1597177642.000000000, {"pri"=>"135", "time"=>"Aug 11 20:27:22", "ident"=>"myhost", "message"=>"tama: tämä"}]

src/flb_encoding.c Outdated Show resolved Hide resolved
@bluebike
Copy link
Contributor Author

Added documentation PR fluent/fluent-bit-docs#410
and fixed possible memory leak if allocation fails in opening in flb_encoding_open

plugins/in_tail/tail.c Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Apr 28, 2021
@bluebike bluebike force-pushed the flb_encoding_utf8 branch from 103066a to acfc758 Compare May 26, 2021 19:50
@bluebike bluebike force-pushed the flb_encoding_utf8 branch 3 times, most recently from ef5529f to 5b08bf1 Compare July 29, 2021 18:13
@bluebike bluebike force-pushed the flb_encoding_utf8 branch from 5b08bf1 to e777888 Compare July 29, 2021 18:49
@bluebike
Copy link
Contributor Author

thanks. I wrote some comments.

About the second commit, note that it must be prefixed with in_tail: .....

Changed commit message.

@bluebike bluebike changed the title flb_encoding: charset encoding for input plugins flb_encoding: charset encoding for input plugins Jul 29, 2021
@github-actions github-actions bot removed the Stale label Aug 3, 2021
@kingjan1999
Copy link

Any news on this? Would love to see this merged!

@edsiper
Copy link
Member

edsiper commented Dec 13, 2021

assigned to @nokute78 for review

Add flb_encoding functions for charset encodings to utf8.

* Uses lib/tutf8e-library.
* Only 8-bit source charsets are supported.
* Support for replacement string (in case of bad chars)
* This commit doesn't add support to any input plugin.

Signed-off-by: Jukka Pihl <[email protected]>
* Adds new option:  "encoding" to in_tail.
* If encoding fails. message is skipped.

Signed-off-by: Jukka Pihl <[email protected]>
* encoding uses flb_encoding
* refactoring syslog_prot_process and syslog_prot_process_udp to
use syslog_prot_process_msg per message.

Signed-off-by: Jukka Pihl <[email protected]>
@hpernu
Copy link

hpernu commented Apr 18, 2023

Could we get this included in mainline ASAP? I assume I am not the only one with non-UTF8 log entries .

As a user, I do not even really care how this is implemented but as for configuration, input seems the most convenient i.e. tail-plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants