-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flb_encoding: charset encoding for input plugins #2420
base: master
Are you sure you want to change the base?
Conversation
I build this over #2326 (those commits seems to be included in PR), so I would like that to be merged before this. |
#2326 was just merged, @bluebike @nigels-com can you remove the tutf8e pieces from this PR ? in addition, ideally we want separate commits for the core interface "encoding" and for the plugins being improved. |
145814e
to
2bdd04c
Compare
Rebased on master (with #2326 merged).
TODO(?)
|
Added encoding support to in_syslog . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I wrote some comments.
About the second commit, note that it must be prefixed with in_tail: .....
src/flb_encoding.c
Outdated
* windows-1251 windows-1252, .. | ||
* | ||
* <charset> - fail if bad chars | ||
* <charset>//IGNORE - ignore bad chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the //
just a separator or a common naming across encoding configs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iconv(3) library uses same convention to add parameters to charset.
implementations I know understand only //IGNORE and //TRANSLIT .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit sceptical of following the iconv API for this.
I think a nullable invalid UTF8 string is sufficient for configuration purposes.
I don't think fluent-bit should be concerned with //TRANSLIT mode in particular:
ICONV_OPEN(3) Linux Programmer's Manual
...
NAME
iconv_open - allocate descriptor for character set conversion
...
//TRANSLIT
When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be rep‐
resented in the target character set, it can be approximated through one or several similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be
silently discarded.```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TRANSLIT has no meaning in our case.
//IGNORE is usefull...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ... we are going to get some kind agreement of configuration?
I like idea of one configuration value with //(parameter), because that is flexible and "kind of" compatible with iconv.
Also it's useful to have multiple ways to handle bad input (ignore,fail,replace-with-something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... that is this code doing... just putting everything to one parameter (like but not totally same as in iconv(3)).
But should this be an another parameter????
Then probably better that replacement is should be (json) escaped string.
Encoding iso-8859-2 => default... FAIL,INGORE , "?", "\uFFFD" ????
Encoding iso-8859-2 ? =>
Encoding iso-8859-3 <bad\20char> => "<bad char
Encoding iso-8859-3 \uFFFD => Use unicode replacement character
Maybe
Encoding iso-8859-2 \I => IGNORE
Encoding iso-8859-2 \F => FAIL
Encoding iso-8859-2 \R => Replacement char...
Or we put optional second parameter to quotes "?" ...
hum... getting complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are getting anywhere with configuration???? @nigels-com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a fresh look. I know you came to this from an iconv perspective, I was hoping to leave that behind rather than carry that into fluent-bit. It's been a while since I was actively engaged on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum.. maybe have to change way of configuration...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed configuration use have parameters:
- encoding CHARSET
- encoding_replacement STRING
default is to use unicode replacement char (0xfffd)
3a2523c
to
dea7701
Compare
Did requested changes...
|
in_tail: simple test run in shell. input contains ä,Ö and € characters encoded in windows-1252 (cp1252).
|
in_syslog: test run using UDP syslog messages.
|
dea7701
to
b9d6c72
Compare
Added documentation PR fluent/fluent-bit-docs#410 |
b9d6c72
to
103066a
Compare
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
103066a
to
acfc758
Compare
ef5529f
to
5b08bf1
Compare
5b08bf1
to
e777888
Compare
Changed commit message. |
Any news on this? Would love to see this merged! |
assigned to @nokute78 for review |
Add flb_encoding functions for charset encodings to utf8. * Uses lib/tutf8e-library. * Only 8-bit source charsets are supported. * Support for replacement string (in case of bad chars) * This commit doesn't add support to any input plugin. Signed-off-by: Jukka Pihl <[email protected]>
* Adds new option: "encoding" to in_tail. * If encoding fails. message is skipped. Signed-off-by: Jukka Pihl <[email protected]>
* encoding uses flb_encoding * refactoring syslog_prot_process and syslog_prot_process_udp to use syslog_prot_process_msg per message. Signed-off-by: Jukka Pihl <[email protected]>
e777888
to
33badf9
Compare
Could we get this included in mainline ASAP? I assume I am not the only one with non-UTF8 log entries . As a user, I do not even really care how this is implemented but as for configuration, input seems the most convenient i.e. tail-plugin. |
Adds library flb_encoding for doing charset encoding CHARSET => UTF8.
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Documentation
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.