Skip to content

in_tail: process non utf8 encodings with conversion engine #10542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jul 4, 2025

This PR is a subsequent PR for #10464.
For a long time, we just processed UTF-8 and recently adding UTF-16 series of character encoding set.
But, in the real world, we need to handle more encoding set such as GBK, ShiftJIS, GB18030, BIG5, UHC and other europian and middle east characters.

In this PR, I added a handler to process such encodings in in_tail plugin.
With this PR, we'll start to process these encodings with generic.encoding.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[INPUT]
    Name         tail
    Tag          dummy.local
    Path         /path/to/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log
    generic.encoding GBK
    Read_From_Head true
[OUTPUT]
    Name  stdout
    Match *
  • Debug log output from testing the change
Fluent Bit v4.0.4
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/07/04 18:40:09] [ info] Configuration:
[2025/07/04 18:40:09] [ info]  flush time     | 1.000000 seconds
[2025/07/04 18:40:09] [ info]  grace          | 5 seconds
[2025/07/04 18:40:09] [ info]  daemon         | 0
[2025/07/04 18:40:09] [ info] ___________
[2025/07/04 18:40:09] [ info]  inputs:
[2025/07/04 18:40:09] [ info]      tail
[2025/07/04 18:40:09] [ info] ___________
[2025/07/04 18:40:09] [ info]  filters:
[2025/07/04 18:40:10] [ info] ___________
[2025/07/04 18:40:10] [ info]  outputs:
[2025/07/04 18:40:10] [ info]      stdout.0
[2025/07/04 18:40:10] [ info] ___________
[2025/07/04 18:40:10] [ info]  collectors:
[2025/07/04 18:40:10] [ info] [fluent bit] version=4.0.4, commit=908b11d7f1, pid=1515015
[2025/07/04 18:40:10] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2025/07/04 18:40:10] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/07/04 18:40:10] [ info] [simd    ] disabled
[2025/07/04 18:40:10] [ info] [cmetrics] version=1.0.3
[2025/07/04 18:40:10] [ info] [ctraces ] version=0.6.6
[2025/07/04 18:40:10] [ info] [input:tail:tail.0] initializing
[2025/07/04 18:40:10] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2025/07/04 18:40:10] [debug] [tail:tail.0] created event channels: read=25 write=26
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] inotify watch fd=31
[2025/07/04 18:40:10] [ info] [output:stdout:stdout.0] worker #0 started
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] inode=43131037 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log, inode 43131037
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log'
[2025/07/04 18:40:10] [debug] [stdout:stdout.0] created event channels: read=33 write=34
[2025/07/04 18:40:10] [ info] [sp] stream processor started
[2025/07/04 18:40:10] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] [static files] processed 46b
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] inode=43131037 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log promote to TAIL_EVENT
[2025/07/04 18:40:10] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43131037 watch_fd=1 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/generic_enc_gbk.log
[2025/07/04 18:40:10] [debug] [input:tail:tail.0] [static files] processed 0b, done
[0] dummy.local: [[1751622010.154549011, {}], {"log"=>"你好"}]
[1] dummy.local: [[1751622010.162688831, {}], {"log"=>"谢谢"}]
[2] dummy.local: [[1751622010.162775758, {}], {"log"=>"再见"}]
[3] dummy.local: [[1751622010.162811553, {}], {"log"=>"中国"}]
[4] dummy.local: [[1751622010.162843340, {}], {"log"=>"猫"}]
[5] dummy.local: [[1751622010.162873918, {}], {"log"=>"狗"}]
[2025/07/04 18:40:10] [debug] [task] created task=0x61694a0 id=0 OK
[6] dummy.local: [[1751622010.162914570, {}], {"log"=>"吃"}]
[7] dummy.local: [[1751622010.162944805, {}], {"log"=>"喝"}]
[2025/07/04 18:40:10] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[8] dummy.local: [[1751622010.162974608, {}], {"log"=>"天"}]
[9] dummy.local: [[1751622010.163004281, {}], {"log"=>"海"}]
[10] dummy.local: [[1751622010.163034517, {}], {"log"=>"月亮"}]
[11] dummy.local: [[1751622010.163064845, {}], {"log"=>"花"}]
[2025/07/04 18:40:10] [debug] [out flush] cb_destroy coro_id=0
[2025/07/04 18:40:10] [debug] [task] destroy task=0x61694a0 (task_id=0)
^C[2025/07/04 18:40:11] [engine] caught signal (SIGINT)
[2025/07/04 18:40:11] [ warn] [engine] service will shutdown in max 5 seconds
[2025/07/04 18:40:11] [ info] [input] pausing tail.0
[2025/07/04 18:40:11] [ info] [engine] service has stopped (0 pending tasks)
[2025/07/04 18:40:11] [ info] [input] pausing tail.0
[2025/07/04 18:40:11] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/07/04 18:40:11] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2025/07/04 18:40:11] [debug] [input:tail:tail.0] inode=43131037 removing file name /media/Data3/
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1515015== 
==1515015== HEAP SUMMARY:
==1515015==     in use at exit: 0 bytes in 0 blocks
==1515015==   total heap usage: 3,261 allocs, 3,261 frees, 1,077,833 bytes allocated
==1515015== 
==1515015== All heap blocks were freed -- no leaks are possible
==1515015== 
==1515015== For lists of detected and suppressed errors, rerun with: -s
==1515015== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1870

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-process-non-UTF8-encodings-with-conversion-engine branch from f1be65f to c453678 Compare July 4, 2025 17:12
@cosmo0920 cosmo0920 marked this pull request as ready for review July 4, 2025 17:12
@cosmo0920 cosmo0920 added this to the Fluent Bit v4.0.4 milestone Jul 5, 2025
@edsiper edsiper merged commit 2b3f468 into master Jul 6, 2025
91 of 98 checks passed
@edsiper edsiper deleted the cosmo0920-process-non-UTF8-encodings-with-conversion-engine branch July 6, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants