Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: regex: Do not skip empty regex group matches #1913

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nigels-com
Copy link
Contributor

Regular Expression Parser is skipping empty values #1486

Unlike the other parses, empty regex groups are omitted from the output.

Sample setup:

$ cat sample.in 
{"log": "{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}

$ cat sample.conf 
[SERVICE]
    Flush                     5
    Parsers_File              parsers.conf

[INPUT]
    Name         stdin

[FILTER]
    Name         parser
    Parser       json_regex
    Match        *
    Key_Name     log
    Reserve_Data On
    Preserve_Key On

[OUTPUT]
    Name            stdout
    Format          json_lines

$ cat parsers.conf 
[PARSER]
    Name   json_regex
    Format regex
    Regex  ^{"time_local":"(?<time_local>.*?)","client_ip":"(?<client_ip>.*?)"}$

Output with this patch applied:

$ cat sample.in | bin/fluent-bit -c sample.conf -p parsers.conf 
Fluent Bit v1.4.0
Copyright (C) Treasure Data

[2020/01/27 10:21:55] [ info] [storage] initializing...
[2020/01/27 10:21:55] [ info] [storage] in-memory
[2020/01/27 10:21:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/01/27 10:21:55] [ info] [engine] started (pid=8468)
[2020/01/27 10:21:55] [ info] [sp] stream processor started
[2020/01/27 10:21:55] [ warn] [in_stdin] end of file (stdin closed by remote end)
[2020/01/27 10:21:55] [ info] [input] pausing stdin.0
{"date":1580084515.652593,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
[2020/01/27 10:21:55] [ warn] [engine] service will stop in 5 seconds
[2020/01/27 10:21:59] [ info] [engine] service stopped

Without this change the "client_ip":"" would be missing from the output.

@nigels-com
Copy link
Contributor Author

I think a hazard of this change is that we can't tell which groups are empty versus omitted.

For example:

$ cat parsers2.conf 
[PARSER]
    Name   json_regex
    Format regex
    Regex  ^{"time_local":"(?<time_local>.*?)"(,"client_ip":"(?<client_ip>.*?)")?}$

$ cat sample2.in 
{"log": "{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
{"log": "{\"time_local\":\"2019-07-31T21:17:15\"}"}

$ cat sample2.in | bin/fluent-bit -c sample2.conf -p parsers2.conf 
Fluent Bit v1.4.0
Copyright (C) Treasure Data

[2020/01/27 10:31:24] [ info] [storage] initializing...
[2020/01/27 10:31:24] [ info] [storage] in-memory
[2020/01/27 10:31:24] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/01/27 10:31:24] [ info] [engine] started (pid=10386)
[2020/01/27 10:31:24] [ info] [sp] stream processor started
[2020/01/27 10:31:24] [ warn] [in_stdin] end of file (stdin closed by remote end)
[2020/01/27 10:31:24] [ info] [input] pausing stdin.0
{"date":1580085084.179838,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\",\"client_ip\":\"\"}"}
{"date":1580085084.179842,"time_local":"2019-07-31T21:17:15","client_ip":"","log":"{\"time_local\":\"2019-07-31T21:17:15\"}"}
[2020/01/27 10:31:24] [ warn] [engine] service will stop in 5 seconds
[2020/01/27 10:31:28] [ info] [engine] service stopped

@edsiper
Copy link
Member

edsiper commented May 5, 2020

hmmm I suggest to introduce a new configuration property to the parsers called Skip_Empty_Keys set to true by default. So your patch can work if the property is set to false. On that way, we won't break other deloyments.

@edsiper edsiper self-assigned this May 5, 2020
@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label May 5, 2020
@edsiper
Copy link
Member

edsiper commented Jun 30, 2020

ping

@nigels-com
Copy link
Contributor Author

Oh, thanks for the ping. Had completely forgotten about this one.

@nigels-com nigels-com force-pushed the regex-empty-not-skipped branch from 534fd75 to 2745390 Compare September 18, 2020 05:31
@nigels-com
Copy link
Contributor Author

Updated the PR with Skip_Empty_Keys configuration property. Will go ahead and do a documentation update also.

@edsiper
Copy link
Member

edsiper commented Dec 13, 2020

@nigels-com

  • pls fix conflicts
  • add DCO

@nokute78
Copy link
Collaborator

@nigels-com How about this PR ?
If you forget this one, is it OK that I will create another PR in the same way ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants