Description of problem:
I have a cluster that was on 4.1.23 (upgraded continuously from about 4.1.4).
The upgrade to either 4.1.24 or 4.1.25 fails with a download error:
info: An upgrade is in progress. Unable to apply 4.1.25: could not download the update
oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.23 True True 44h Unable to apply 4.1.25: could not download the update
All Cluster Operators show on 4.1.23.
oc adm must-gather is at https://drive.google.com/open?id=18mqD6BpEwAQbApb1_5MD9j-cMBckBIA6
I can provide a kubeconfig as well to poke around there.
(In reply to Wolfgang Kulhanek from comment #0)
> oc adm must-gather is at
Can you open this up to redhat.com users, if necessary moving it to our account?
Done. I think. :-)
all the pods that are used to download the release image are failing due to DiskPressure.
- apiVersion: v1
- apiVersion: batch/v1
- mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /manifests /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ/manifests
&& mkdir -p /etc/cvo/updatepayloads/fbbe7taUDiPpcveIOxoGqQ && mv /release-manifests
- mountPath: /etc/cvo/updatepayloads
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
- name: default-dockercfg-94rsb
- effect: NoExecute
- key: node-role.kubernetes.io/master
- effect: NoExecute
- name: default-token-82cr5
message: 'Pod The node had condition: [DiskPressure]. '
And the node is actually under disk pressure
- effect: NoSchedule
- effect: NoSchedule
is it possible to find out what's the taking up that space, and what's the disk size?
This is a regular IPI install on AWS. That node that these pods are running on is a master (master-1 to be exact). It's a regular m5.2xlarge instance.
ebs for root device, 120GB
Root disk shows 86% full
$ du -hc / | sort -rh | head -20
Biggest offender seems to be:
sh-4.4# du -hc . | sort -rh | head -20
Any idea what to do here? Delete everything in that directory? Reboot the master (from the AWS Console)?
Logging folks might know what to do about Fluentd disk space.
(In reply to W. Trevor King from comment #5)
> Logging folks might know what to do about Fluentd disk space.
This is directly related to the following from fluent mounting the host and relying on it for buffering:
1. Fluent will buffer container logs to disk if it is unable to push them off the node.
2. Fluent writes its own logs to disk to avoid a feedback loop for ingestion. These logs are supposed to be rotated by fluent's log library
The only work around at the moment is to truncate the directory from the node to remove the buffered logs
Is this abnormally high levels of logging or would one expect this of all normal clusters?
(In reply to Scott Dodson from comment #7)
> Is this abnormally high levels of logging or would one expect this of all
> normal clusters?
If ES is unable to ingest the logs or refuses and fluent is unable to push and clear it's buffers then I expect this can potentially occur. High levels of logging (e.g. volume, DEBUG) could lead to this in any cluster.
So this is interesting. I checked this morning wanting to go and clean out the fluentd directories - and sure enough over the weekend my cluster had healed itself. The FluentD directories are now only 8k. And the upgrade actually ran through to 4.1.25.
But this still doesn't explain the 80G of FluentD data on Friday. Where did those come from? Where did they go? Is it possible the curator ran, cleaned up Elasticsearch and then fluentd was able to send things over? Although Elasticsearch hadn't shown any abnormalities last week (as in error messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the newer one got created Friday and is the currently used. Time stamp is consistent with the upgrade.
This cluster isn't super busy quite yet. It's actually one of our less used clusters.
I'm at a bit of a loss as to what to tell you guys and where to go from here...
(In reply to Wolfgang Kulhanek from comment #10)
> So this is interesting. I checked this morning wanting to go and clean out
> the fluentd directories - and sure enough over the weekend my cluster had
> healed itself. The FluentD directories are now only 8k. And the upgrade
> actually ran through to 4.1.25.
Fluent will push all its buffers if it is not receiving back pressure from ES. Assuming the rate it pushes is faster then the node log rate it will eventually clear it's buffers.
> But this still doesn't explain the 80G of FluentD data on Friday. Where did
> those come from? Where did they go? Is it possible the curator ran, cleaned
> up Elasticsearch and then fluentd was able to send things over? Although
> Elasticsearch hadn't shown any abnormalities last week (as in error
> messages). Elasticsearch has a 200Gi PVC. Although I see two of them - the
> newer one got created Friday and is the currently used. Time stamp is
> consistent with the upgrade.
We have an internal thread discussing this issue. One point that was brought up related to fluent performance was that if ES is not properly sized it could have issues with ingestion and thus rejecting. In regards to curation, maybe you were at the threshhold limits where ES will no longer index data thus causing backpressure. Curator could indeed have removed indices, dropping you below the threshold and allowing ES to index more. We are unable to know unless we had a snapshot of the environment or an understand of the state of the disk.
> This cluster isn't super busy quite yet. It's actually one of our less used
> I'm at a bit of a loss as to what to tell you guys and where to go from
If this is standard OCP 4.x installation then a lot of internal ES cluster statistics are captured in Prometheus. We should definitely take advance of the data recorded in Prometheus to troubleshoot cases like this.
Created https://bugzilla.redhat.com/show_bug.cgi?id=1796133 to address writing fluent's logs to disk. This will provide partial relief of this issue.
I believe by moving the container logs back to std out  and the correction  to how we calculate fluent's buffers, this issue would be resolved for 4.3 and 4.4. For cluster logging that is only writing to our managed instance of Elasticsearch, I would expect fluent to stop reading files when it reached a total of 256M of buffers, which is the value we provide for FILE_BUFFER_LIMIT. There is the main pipeline and the retry pipeline which equates to 2 outputs. The buffers get chunked to 8M each so the total number between main and retry pipeline should result in 32 chunks max. I would expect fluent to stop reading files when we get to this point.
If we can verify for these releases, we likely have to fix the calculation for 4.1 and prior where the number of outputs was not properly considered.
I'm asking you verify we have resolved this issue in 4.4. I think you could deploy a cluster logging instance and make sure logs flow and then remove the ES instance. I expect once you get to the max number of chunks that fluent will stop collecting logs and you will see no additional buffers created.
The buffer growed up to 7.7 G. fluentd image: openshift/ose-logging-fluentd:v4.4.0-202001310654
[anli@preserve-docker-slave 79939]$ oc get pod fluentd-z8fx7 -o yaml |grep BUFFER -A 2
- name: BUFFER_QUEUE_LIMIT
- name: BUFFER_SIZE_LIMIT
- name: FILE_BUFFER_LIMIT
- name: FLUENTD_CPU_LIMIT
oc exec fluentd-z8fx7 -- du -sh /var/lib/fluentd/clo_default_output_es/
*** Bug 1819964 has been marked as a duplicate of this bug. ***
*** Bug 1824427 has been marked as a duplicate of this bug. ***
Verified and pass using
Tue May 5 21:05:08 EDT 2020
Tue May 5 21:06:10 EDT 2020
Tue May 5 21:07:11 EDT 2020
Tue May 5 21:08:13 EDT 2020
Tue May 5 21:09:14 EDT 2020
Tue May 5 21:31:50 EDT 2020
Tue May 5 22:02:42 EDT 2020
Tue May 5 22:16:05 EDT 2020
oc logs fluentd-ff8nn
2020-05-06 02:22:18 +0000 [warn]: [clo_default_output_es] failed to write data into buffer by buffer overflow action=:block
(In reply to Oscar Casal Sanchez from comment #23)
> Is going this issue to be backported to OCP 4.2?
We will not be backporting this change to 4.2
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.