Hide Forgot
The workaround is to periodically restart fluentd by labelling and unlabelling the nodes: oc label node -l logging-infra-fluentd=true logging-infra-fluentd=false # wait for fluentd to shutdown, then oc label node -l logging-infra-fluentd=false logging-infra-fluentd=true I think that should unblock Ops until we can get a better fix
This issue seems like it's related to this - https://github.com/ledbettj/systemd-journal/issues/70.
The only known work around at this time is to cycle the fluent nodes. This [1] describes a method by which one can create a cronjob to do just that. It does not take into account 'Ready' pods like it probably should. Additionally, there is a balancing act of how often to check, backpressure, etc. [1] https://github.com/openshift/origin-aggregated-logging/pull/1508
User's will need to deploy cronjob as suggested by [1]. Created JIRA card to add into cluster-logging-operator [2]. Closing as won't fix [1] https://github.com/openshift/origin-aggregated-logging/blob/master/docs/troubleshooting.md#fluentd-is-holding-onto-deleted-journald-files-that-have-been-rotated [2] https://jira.coreos.com/browse/LOG-348
We should confirm that https://access.redhat.com/solutions/3958661 solves this issue. If done so - we should move this over to a docs bug to get fixed as a known issue in docs. After we QE this bug and get sign off https://access.redhat.com/solutions/3958661 can be published while we wait on docs to be included.
I tested https://access.redhat.com/solutions/3958661 which works as is but I did find one improvement I hoped to make. The cron job will continually delete the fluentd pod regardless of whether the fluentd process is retaining deleted journal files. In other words, it makes no difference why /var/log exceeds threshold the fluentd pod will be deleted. I had hoped to put in an lsof based check for retained journal files in the cron but since the fluentd process is ran by root it results permission errors. I will add a note to the solution warning of this possibility along with a suggested action of either raising the threshold or finding the current source of the excessive /var/log growth.
(In reply to Jack Ottofaro from comment #8) > I tested https://access.redhat.com/solutions/3958661 which works as is but > I did find one improvement I hoped to make. The cron job will continually > delete the fluentd pod regardless of whether the fluentd process is > retaining deleted journal files. In other words, it makes no difference why > /var/log exceeds threshold the fluentd pod will be deleted. I had hoped to > put in an lsof based check for retained journal files in the cron but since > the fluentd process is ran by root it results permission errors. The source of this solution did this exact check but I thought I determined the OCP image does not contain 'lsof'. I would love to proven wrong. > > I will add a note to the solution warning of this possibility along with a > suggested action of either raising the threshold or finding the current > source of the excessive /var/log growth. Isn't the excessive growth the fact that the collector fluentd is incapable of keeping up with the log generation? The solution is to log less which, for instance, you reduce your log level. Do you have debug level on?
(In reply to Jeff Cantrill from comment #9) > (In reply to Jack Ottofaro from comment #8) > > I tested https://access.redhat.com/solutions/3958661 which works as is but > > I did find one improvement I hoped to make. The cron job will continually > > delete the fluentd pod regardless of whether the fluentd process is > > retaining deleted journal files. In other words, it makes no difference why > > /var/log exceeds threshold the fluentd pod will be deleted. I had hoped to > > put in an lsof based check for retained journal files in the cron but since > > the fluentd process is ran by root it results permission errors. > > The source of this solution did this exact check but I thought I determined > the OCP image does not contain 'lsof'. I would love to proven wrong. lsof is there but upon closer inspection my issue is I cannot get the fluentd pid. > > I will add a note to the solution warning of this possibility along with a > > suggested action of either raising the threshold or finding the current > > source of the excessive /var/log growth. > > Isn't the excessive growth the fact that the collector fluentd is incapable > of keeping up with the log generation? The solution is to log less which, > for instance, you reduce your log level. Do you have debug level on? In general I suppose you are correct about the underlying reason. For my testing, I had only configured my journal settings on one node for testing purposes and to get it to start building up deleted files. However I had a second node on which I did not do this. There's only one threshold setting. The fluentd pod on the second node will continually be restarted because the node is exceeding the threshold and restarting fluentd pod doesn't help because deleted files are not being retained. I think just a note that the threshold is cluster wide and therefore you need to configure the journal on each node the same.
(In reply to Jack Ottofaro from comment #10) > (In reply to Jeff Cantrill from comment #9) > > (In reply to Jack Ottofaro from comment #8) > > > I tested https://access.redhat.com/solutions/3958661 which works as is but > > > I did find one improvement I hoped to make. The cron job will continually > > > delete the fluentd pod regardless of whether the fluentd process is > > > retaining deleted journal files. In other words, it makes no difference why > > > /var/log exceeds threshold the fluentd pod will be deleted. I had hoped to > > > put in an lsof based check for retained journal files in the cron but since > > > the fluentd process is ran by root it results permission errors. > > > > The source of this solution did this exact check but I thought I determined > > the OCP image does not contain 'lsof'. I would love to proven wrong. > > lsof is there but upon closer inspection my issue is I cannot get the > fluentd pid. From which context are you trying to get the fluentd pid? If you are on the node, trying to get the pid of the fluentd running in a pod on that node, see https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/test/zzz-duplicate-entries.sh#L35 > > > > > I will add a note to the solution warning of this possibility along with a > > > suggested action of either raising the threshold or finding the current > > > source of the excessive /var/log growth. > > > > Isn't the excessive growth the fact that the collector fluentd is incapable > > of keeping up with the log generation? The solution is to log less which, > > for instance, you reduce your log level. Do you have debug level on? > > In general I suppose you are correct about the underlying reason. For my > testing, I had only configured my journal settings on one node for testing > purposes and to get it to start building up deleted files. However I had a > second node on which I did not do this. There's only one threshold setting. > The fluentd pod on the second node will continually be restarted because the > node is exceeding the threshold and restarting fluentd pod doesn't help > because deleted files are not being retained. You might glean some ideas from https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/test/zzz-duplicate-entries.sh > > I think just a note that the threshold is cluster wide and therefore you > need to configure the journal on each node the same.
QE couldn't reproduce this issue. Is this related to special journald version? I am using systemd-219-62.el7_6.5.x86_64 and openshift3/ose-logging-fluentd/images/v3.11.88-2 (fluent-plugin-systemd (0.0.11) , systemd-journal (1.3.3) )?
(In reply to Anping Li from comment #12) > QE couldn't reproduce this issue. Is this related to special journald > version? I am using systemd-219-62.el7_6.5.x86_64 and > openshift3/ose-logging-fluentd/images/v3.11.88-2 (fluent-plugin-systemd > (0.0.11) , systemd-journal (1.3.3) )? Was QE able to reproduce the problem with a previous version of OCP or logging or systemd-journald?
Work in progress: https://github.com/openshift/openshift-docs/pull/14449
The doc LGTM
Looks good.
Content is now published: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.11/html/release_notes/release-notes-ocp-3-11-release-notes#ocp-311-known-issues