Jan 06 22:29:36.188 - 999ms W node/ip-10-0-225-225.us-east-2.compute.internal node is not ready https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984 is testing the new "upgrade then run e2e tests". While looking through the logs I saw that about 30m after the upgrade completed, during the e2e run, the specified node went unready. Nodes should never go unready in our e2e tests, so we need to identify why this node went unready during the e2e part of the run (upgrade completed around 22:00 which included node rollout as per our standard test). The upgrade may be unrelated (in which case the bug just may be 4.7 nodes go unready, which is obviously same severity).
Seeing a small amount of other jobs fail with similar messages in their event log https://search.ci.openshift.org/?search=node+is+not+ready&maxAge=48h&context=1&type=bug%2Bjunit&name=.*-master.*&maxMatches=5&maxBytes=20971520&groupBy=job
I don't see why this causes a node to go unready after the upgrade? The linked PR can cause API disruption, but that doesn't explain an 8m outage of the node. I need more context before I agree this is a dupe (it's certainly complicated by the other issue.
225 is a worker, there is NO reason an apiserver being down or nongracefully terminated should cause a substantial readiness outage. Jan 06 22:34:31.188 - 215s W node/ip-10-0-225-225.us-east-2.compute.internal node is not ready That's 4 minutes. Even an an apiserver being down should be unable to cause this AFTER upgrade is complete.
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14765/rehearse-14765-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1350692829357150208 Upgrade itself fails due to unavailability of a node blocking rollout: * Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes https://search.ci.openshift.org/?search=got+1+unavailable+nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job 3% of 4.6-4.7 upgrade runs are failing, but if this is happening randomly post upgrade as the rehearse job showed the true impact is much higher.
Ravig suspects the issue here is related to Prometheus-induced memory pressure. Pointing [1] at comment 0's [2] shows the two prometheus-k8s pods consuming under 2GB for 4.6.8 and around 5GB for the 4.7 CI release. Of course, there is also more data being indexed by that point late in the install. But however it is being generated, the compute-node memory pressure lead to the kernel killing a process: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/events.json | jq -r '.items[] | select(.reason == "SystemOOM") | .firstTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .message' | sort 2021-01-06T23:28:37Z 1 ip-10-0-225-225.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 8975 opm is part of marketplace operation: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/pods.json | jq -r '.items[] | .metadata as $p | .spec.containers[] | . as $c | select(.command[0] == "opm") | $p.namespace + " " + $p.name + " " + $c.name' openshift-marketplace e6b2a21b99ccbb386c67a8804cb024b0e0a9f43a1b0671f064232dc1defh2lb extract But marketplace memory use remains reasonable throughout the update, so it was probable just an innocent bystander. [1]: https://promecieus.svc.ci.openshift.org/ [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984
Might be related to these two issues opened under prometheus: https://github.com/prometheus/prometheus/issues/8310 https://github.com/prometheus/prometheus/issues/8357 Clayton mentioned in slack that it looks like this happens after 30 minutes of a Node being Ready. perhaps it's a compaction issue?
As long as prometheus is suspected here it's worth mentioning I don't think we ever got to the bottom of this bug where replaying the WAL causes prometheus memory to baloon. https://bugzilla.redhat.com/show_bug.cgi?id=1810111
Created attachment 1748648 [details] prometheus-k8s container_memory_working_set_bytes
Created attachment 1748649 [details] number of series in the head
Created attachment 1748651 [details] scrape_series_added
Created attachment 1748652 [details] wal replay duration
Created attachment 1748653 [details] head chunks
From what I can tell, the increase in Prometheus memory usage is due to the series churn because of pods being rescheduled. Attachment 1748649 [details] shows that the number of series goes from 200k to more than 1M after the upgrade. From attachment 1748651 [details], the main contributors are kubelet, kubernetes API/controller and kube-state-metrics. From the metrics, there's no compaction happening. @pkrupa mentioned out of band that the OOM is for Thanos querier which is a hint at bug 1907496.
Created attachment 1748656 [details] Prometheus data screenshot This seems to be related to [1] where thanos querier is OOMing and killing the node. Prometheus data [2] seem to initially confirm it. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1906496 [2]: (see attachment "Prometheus data screenshot")
Expectation is that an OOM on the node doesn't kill the node. Current discussion is about identifying why a simple prometheus or thanos querier OOM kill doesn't resolve the issue, but the node instead goes unready.
Changing the ownership to node team to evaluate the next steps.
@rnetser was observing this in CNV testing
@rphillips Can we close bug 1927836 as duplicate of this one?
*** Bug 1927836 has been marked as a duplicate of this bug. ***
*** Bug 1929159 has been marked as a duplicate of this bug. ***
*** Bug 1934163 has been marked as a duplicate of this bug. ***
I've looked at recent CI failures mentioning "node is not ready" [1] and I couldn't find evidence that Prometheus was involved (see below for the details). We already have bug 1918683 to track Prometheus OOMs so I don't think that there's anything left for the monitoring team regarding this bug. For the time being, we have decreased the priority class for the monitoring components (see bug 1929277 and bug 1929278) which means that nodes stay ready even when Prometheus uses an excessive amount of memory. @Ryan: do you agree if I reassign the bug to the Node team? I guess it should be moved to MODIFIED. * https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt/1366796210756653056 The cluster has only one worker node (16Gb RAM) which went down after 1h. We have no logs from the node. * https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944 node All nodes are up at the end of the job. There is an occurrence of "ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv node is not ready" in build-log.txt around 05:14:52 [2]. From [3], node ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv got drained at 05:12:36 because of an update from machine-config-daemon. The node rebooted and recorded itself as ready at 05:17:19. * https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784 All nodes are up at the end of the job. build-log.txt [4] shows that nodes are being reported as not ready. For instance ip-10-0-157-184.us-west-2.compute.internal is reported not ready around 06:53:17. Looking at the node's journal [5], it matches the time when the node got drained/restarted by machine-config-daemon. * https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984 All nodes are up at the end of the job. build-log.txt [6] shows that ip-10-0-225-225.us-east-2.compute.internal is reported as not ready around 21:40:41. Again it matches the time when this node was drained/restarted by machine-config-daemon [7]. [1] https://search.ci.openshift.org/?search=node+is+not+ready&maxAge=168h&context=1&type=junit&name=.*-master.*&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944/artifacts/e2e-azure-ovn-upgrade/openshift-e2e-test/build-log.txt [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/nodes/ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv/journal [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt [5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-157-184.us-west-2.compute.internal/journal [6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/openshift-e2e-test/e2e.log [7] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/nodes/ip-10-0-225-225.us-east-2.compute.internal/journal
After discussing offline with Ryan, we agree that the issue described here should be fixed hence moving to MODIFIED.
did not find "node is not ready" related to monitoring https://search.ci.openshift.org/?search=node+is+not+ready&maxAge=168h&context=1&type=junit&name=.*-master.*&maxMatches=5&maxBytes=20971520&groupBy=job close it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438