Bug 1560358
Summary: | Looks like fluentd is keeping open deleted journal files that have been rotated | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> | ||||
Component: | Logging | Assignee: | ewolinet | ||||
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.9.0 | CC: | anli, aos-bugs, dmoessne, ecrosby, ewolinet, florian.herzog, jcantril, jlee, jupierce, mharri, mifiedle, nhosoi, pportant, rmeggins, vwalek | ||||
Target Milestone: | --- | Keywords: | OpsBlocker, Reopened | ||||
Target Release: | 3.10.z | Flags: | jcantril:
needinfo-
|
||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: The underlying library that fluentd uses for reading in journal files did not properly handle rolled over files.
Consequence: When a journal file was rolled over, fluentd would hold these files even after it was done reading from them.
Fix: Updated code in the underlying fluent-plugin-systemd plugin and updated the version in the fluentd dockerfile.
Result: We no longer see fluentd holding rolled over journal files.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1664744 (view as bug list) | Environment: | |||||
Last Closed: | 2019-01-09 14:56:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1664744 | ||||||
Attachments: |
|
Description
Peter Portante
2018-03-26 02:15:24 UTC
*** Bug 1561121 has been marked as a duplicate of this bug. *** If the issue is observed for the log-driver=journald, fluent-plugin-systemd depends on the C native library for opening/closing journal files. It might be an RFE for the C library? https://www.freedesktop.org/software/systemd/man/sd_journal_open.html It looks bz1561121 is facing the problem with the json file input. The input plugin in_tail uses cool.io for managing the journal files. I wonder if any errors from cool.io could be found in the fluentd log? In_tail/cool.io refreshes the watch list every refresh_interval interval, by default 60 seconds with rotate_wait 5 seconds. If that many files are left opened despite of deleted (longer than the refresh_interval), there should be some problems which prevent the refresh... It does not appear to be a problem with json files, as the list of deleted files still open by fluentd are in /var/log/journal. Next time I see this, I'll capture the list. The partial fix is to be sure the use of the sd_journal client API includes a fetch of the file descriptor for inotify immediately after calling sd_journal_open(). The final fix is really in the sd_journal APIs themselves. My ruby fu is not good, but from what I can tell the ruby systemd code does not call sd_journal_get_fd() immediately after sd_journal_open(). If a client calls sd_journal_get_fd() as soon as possible after calling sd_journal_open() the window for leaking FDs is closed significantly, but not entirely. The only way to close that gap is with a change to the journal APIs where sd_journal_open() creates the inotify FD at the time of the open. Can somebody double check me here? Closing and re-opening the journal is not the right fix. This led to duplicate messages at a major customer, as that is what rsyslog used to do. See https://github.com/rsyslog/rsyslog/commit/34b2d733699954e2ac770e8607c1694aacb51862 The fix wasn't merged to latest openshift3/logging-fluentd/images/v3.10.0-0.47.0.0. @Anping It looks like the fix should be available in openshift3/logging-fluentd/images/v3.10.0-0.50.0.0 It seems the fix wasn't merged or won't work. 299 deleted journal files are keeping open. openshift3/logging-fluentd/images/v3.10.0-0.53.0.0" [root@shared-ug-wmeng-master-etcd-zone2-1 fluentd]# lsof -p 95570 | grep deleted | grep -F "/var/log" | wc -l 299 systemd-219-57.el7.x86_64 [Journal] Storage=persistent Compress=True SyncIntervalSec=1s RateLimitInterval=1s RateLimitBurst=10000 SystemMaxUse=8G SystemMaxFileSize=10M MaxRetentionSec=1month ForwardToSyslog=False ForwardToWall=False Filed PR for this upstream: https://github.com/reevoo/fluent-plugin-systemd/pull/65 Adding documentation: # docker ps | grep fluent and then to find the correct pid: # ps -ef | grep of container hash for the one running "sh run.sh" Created attachment 1448991 [details]
The fluentd openfile watch logs
Both openshift3/logging-fluentd:v3.10.0-0.64.0.0 and logging-fluentd:journalpatch closed the rotated journal files. It took about 2 minutes to close 18 Deleted files.
Steps:
1. Watch the journal file and the fluentd opened deleted file
while true; do
echo "##1 ls -lR /var/log/journal |wc -l"
ls -lR /var/log/journal |wc -l
echo "##2 ls -lR /var/log/journal"
ls -lR /var/log/journal
fluent_id=$(ps -ef|grep 'fluentd --no-supervisor' | awk '{ if (NR==1) print $2}')
echo "##3 suck journal logs"
lsof -p $fluent_id | grep deleted
echo done
done
2. Reduce the journal files
journalctl --vacuum-size=1
Moving to ON_QA per #c25 with no intentions to apply the supplemental change proposed in #c24 Anping Li, can you use the mechanism in #c25 to show that it does occur with the currently released image? If you can, great. If you can't, we'll likely need to fix the reproducer. @peter, The deleted file are clothed when use the release image logging-fluentd/images/v3.7.52. I had tried both #c25 and made the journal log rotated in load testing. Any suggestion to 100% reproduced it? Maybe I need to generate more complicated records to make fluentd busy. ##1 ls -l /var/log/journal/5ee9c09efa004272a03befb58bf25779 |wc -l 232 ##2 ls -lh /var/log/journal/5ee9c09efa004272a03befb58bf25779 total 1.9G ............... ................ ##3 suck journal logs fluentd 26488 root 441r REG 253,0 8388608 8635532 /var/log/journal/5ee9c09efa004272a03befb58bf25779/system (deleted) fluentd 26488 root 442r REG 253,0 8388608 8635533 /var/log/journal/5ee9c09efa004272a03befb58bf25779/system (deleted) fluentd 26488 root 443r REG 253,0 8388608 8635534 /var/log/journal/5ee9c09efa004272a03befb58bf25779/system (deleted) @mike, Can you watch that in performance testing? I have not seen this issue. Will continue to monitor it in a new logging long run starting this week. With /logging-fluentd/images/v3.10.9-1, The deleted file still be watched. [root@starter-ca-central-1-master-692e9 ~]# lsof -p 39279 | grep deleted | grep -F "/var/log" fluentd 39279 root 97r REG 253,3 10485760 16801226 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 98r REG 253,3 10485760 16800995 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 99r REG 253,3 10485760 16800319 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 101r REG 253,3 10485760 16799243 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 102r REG 253,3 10485760 16799244 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 103r REG 253,3 10485760 16799245 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) fluentd 39279 root 104r REG 253,3 10485760 16807680 /var/log/journal/d52c597d0f1a42aeb01b5a7d71e63f24/system (deleted) @anli logging-fluentd 3.10.18 contains the gems in comment 34. I am unable to reproduce the issue with that version of logging-fluentd. Based on #c35 and your observation in #c32, doesn't that mean you have verified the issue is resolved and we should move to close it out? I think @anli is suggesting a final test in Online. If not, we can close it out and re-open if it occurs in Online - if so, go ahead and move to ON_QA. @Jeff, @ Mike, Yes, A final test in Online is expected. Moved to verified as comment 25, 34, 41. if still hit this issue. feel free to reopen this bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 Reopening since this issue has been observed on clusters running 3.11.44. I don't believe it has been fully resolved. [root@starter-us-east-1-node-compute-7edda log]# df -h /var Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-var 16G 14G 2.5G 85% /var [root@starter-us-east-1-node-compute-7edda log]# du -shx /var 5.6G /var The next place I looked was at *open* files, that are in a *deleted* state. Fluentd is holding onto 803 of these files. [root@starter-us-east-1-node-compute-7edda log]# ps -ef | grep /usr/bin/fluentd root 13551 13530 0 Jan03 ? 00:02:19 /usr/bin/ruby /usr/bin/fluentd --no-supervisor root 29585 9099 0 16:57 pts/1 00:00:00 grep --color=auto /usr/bin/fluentd [root@starter-us-east-1-node-compute-7edda log]# lsof -p 13551 | grep deleted | wc -l 803 (In reply to Justin Pierce from comment #45) > Reopening since this issue has been observed on clusters running 3.11.44. I > don't believe it has been fully resolved. > > [root@starter-us-east-1-node-compute-7edda log]# df -h /var > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/rootvg-var 16G 14G 2.5G 85% /var > [root@starter-us-east-1-node-compute-7edda log]# du -shx /var > 5.6G /var > > The next place I looked was at *open* files, that are in a *deleted* state. > Fluentd is holding onto 803 of these files. > > [root@starter-us-east-1-node-compute-7edda log]# ps -ef | grep > /usr/bin/fluentd > root 13551 13530 0 Jan03 ? 00:02:19 /usr/bin/ruby > /usr/bin/fluentd --no-supervisor > root 29585 9099 0 16:57 pts/1 00:00:00 grep --color=auto > /usr/bin/fluentd > [root@starter-us-east-1-node-compute-7edda log]# lsof -p 13551 | grep > deleted | wc -l > 803 I thought we could not REOPEN bz's that have been CLOSED RELEASE in an errata . . . if this correct, please close this bz and clone it *** Bug 1817059 has been marked as a duplicate of this bug. *** |