Bug 1913532 - After a 4.6 to 4.7 upgrade, a node went unready
Summary: After a 4.6 to 4.7 upgrade, a node went unready
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1927836 1929159 (view as bug list)
Depends On:
Blocks: 1927373
TreeView+ depends on / blocked
 
Reported: 2021-01-07 02:10 UTC by Clayton Coleman
Modified: 2024-12-20 19:30 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-node] Managed cluster should report ready nodes the entire duration of the test run [Late]
Last Closed: 2021-07-27 22:36:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prometheus-k8s container_memory_working_set_bytes (144.12 KB, image/png)
2021-01-19 08:22 UTC, Simon Pasquier
no flags Details
number of series in the head (115.45 KB, image/png)
2021-01-19 08:24 UTC, Simon Pasquier
no flags Details
scrape_series_added (81.82 KB, image/png)
2021-01-19 08:25 UTC, Simon Pasquier
no flags Details
wal replay duration (107.30 KB, image/png)
2021-01-19 08:26 UTC, Simon Pasquier
no flags Details
head chunks (113.63 KB, image/png)
2021-01-19 08:27 UTC, Simon Pasquier
no flags Details
Prometheus data screenshot (138.68 KB, image/png)
2021-01-19 08:54 UTC, Pawel Krupa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1055 0 None open Bug 1929277: Adjust priority class of core platform prometheus 2021-02-21 07:17:50 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:27 UTC

Description Clayton Coleman 2021-01-07 02:10:06 UTC
Jan 06 22:29:36.188 - 999ms W node/ip-10-0-225-225.us-east-2.compute.internal node is not ready

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984

is testing the new "upgrade then run e2e tests".  While looking through the logs I saw that about 30m after the upgrade completed, during the e2e run, the specified node went unready.  Nodes should never go unready in our e2e tests, so we need to identify why this node went unready during the e2e part of the run (upgrade completed around 22:00 which included node rollout as per our standard test).

The upgrade may be unrelated (in which case the bug just may be 4.7 nodes go unready, which is obviously same severity).

Comment 1 Clayton Coleman 2021-01-07 02:11:25 UTC
Seeing a small amount of other jobs fail with similar messages in their event log

https://search.ci.openshift.org/?search=node+is+not+ready&maxAge=48h&context=1&type=bug%2Bjunit&name=.*-master.*&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 5 Clayton Coleman 2021-01-13 19:40:17 UTC
I don't see why this causes a node to go unready after the upgrade?  The linked PR can cause API disruption, but that doesn't explain an 8m outage of the node.  I need more context before I agree this is a dupe (it's certainly complicated by the other issue.

Comment 6 Clayton Coleman 2021-01-13 19:44:31 UTC
225 is a worker, there is NO reason an apiserver being down or nongracefully terminated should cause a substantial readiness outage.

Jan 06 22:34:31.188 - 215s  W node/ip-10-0-225-225.us-east-2.compute.internal node is not ready

That's 4 minutes.  Even an an apiserver being down should be unable to cause this AFTER upgrade is complete.

Comment 7 Clayton Coleman 2021-01-17 15:28:51 UTC
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14765/rehearse-14765-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1350692829357150208

Upgrade itself fails due to unavailability of a node blocking rollout:

    * Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes

https://search.ci.openshift.org/?search=got+1+unavailable+nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job

3% of 4.6-4.7 upgrade runs are failing, but if this is happening randomly post upgrade as the rehearse job showed the true impact is much higher.

Comment 9 W. Trevor King 2021-01-19 00:50:03 UTC
Ravig suspects the issue here is related to Prometheus-induced memory pressure.  Pointing [1] at comment 0's [2] shows the two prometheus-k8s pods consuming under 2GB for 4.6.8 and around 5GB for the 4.7 CI release.  Of course, there is also more data being indexed by that point late in the install.  But however it is being generated, the compute-node memory pressure lead to the kernel killing a process:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/events.json | jq -r '.items[] | select(.reason == "SystemOOM") | .firstTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .message' | sort
  2021-01-06T23:28:37Z 1 ip-10-0-225-225.us-east-2.compute.internal System OOM encountered, victim process: opm, pid: 8975

opm is part of marketplace operation:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/pods.json | jq -r '.items[] | .metadata as $p | .spec.containers[] | . as $c | select(.command[0] == "opm") | $p.namespace + " " + $p.name + " " + $c.name'
  openshift-marketplace e6b2a21b99ccbb386c67a8804cb024b0e0a9f43a1b0671f064232dc1defh2lb extract

But marketplace memory use remains reasonable throughout the update, so it was probable just an innocent bystander.

[1]: https://promecieus.svc.ci.openshift.org/
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984

Comment 10 Ryan Phillips 2021-01-19 01:00:23 UTC
Might be related to these two issues opened under prometheus: https://github.com/prometheus/prometheus/issues/8310 https://github.com/prometheus/prometheus/issues/8357

Clayton mentioned in slack that it looks like this happens after 30 minutes of a Node being Ready. perhaps it's a compaction issue?

Comment 11 Scott Dodson 2021-01-19 01:25:47 UTC
As long as prometheus is suspected here it's worth mentioning I don't think we ever got to the bottom of this bug where replaying the WAL causes prometheus memory to baloon.

https://bugzilla.redhat.com/show_bug.cgi?id=1810111

Comment 12 Simon Pasquier 2021-01-19 08:22:56 UTC
Created attachment 1748648 [details]
prometheus-k8s container_memory_working_set_bytes

Comment 13 Simon Pasquier 2021-01-19 08:24:27 UTC
Created attachment 1748649 [details]
number of series in the head

Comment 14 Simon Pasquier 2021-01-19 08:25:08 UTC
Created attachment 1748651 [details]
scrape_series_added

Comment 15 Simon Pasquier 2021-01-19 08:26:39 UTC
Created attachment 1748652 [details]
wal replay duration

Comment 16 Simon Pasquier 2021-01-19 08:27:29 UTC
Created attachment 1748653 [details]
head chunks

Comment 17 Simon Pasquier 2021-01-19 08:35:12 UTC
From what I can tell, the increase in Prometheus memory usage is due to the series churn because of pods being rescheduled. Attachment 1748649 [details] shows that the number of series goes from 200k to more than 1M after the upgrade. From attachment 1748651 [details], the main contributors are kubelet, kubernetes API/controller and kube-state-metrics.
From the metrics, there's no compaction happening.

@pkrupa mentioned out of band that the OOM is for Thanos querier which is a hint at bug 1907496.

Comment 18 Pawel Krupa 2021-01-19 08:54:49 UTC
Created attachment 1748656 [details]
Prometheus data screenshot

This seems to be related to [1] where thanos querier is OOMing and killing the node. Prometheus data [2] seem to initially confirm it.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1906496
[2]: (see attachment "Prometheus data screenshot")

Comment 20 Clayton Coleman 2021-02-12 22:13:28 UTC
Expectation is that an OOM on the node doesn't kill the node.  Current discussion is about identifying why a simple prometheus or thanos querier OOM kill doesn't resolve the issue, but the node instead goes unready.

Comment 21 ravig 2021-02-12 22:17:02 UTC
Changing the ownership to node team to evaluate the next steps.

Comment 22 Fabian Deutsch 2021-02-15 08:50:43 UTC
@rnetser was observing this in CNV testing

Comment 23 Ruth Netser 2021-02-15 09:07:50 UTC
@rphillips Can we close bug 1927836 as duplicate of this one?

Comment 25 Ryan Phillips 2021-02-16 14:49:04 UTC
*** Bug 1927836 has been marked as a duplicate of this bug. ***

Comment 26 Ryan Phillips 2021-02-16 15:33:32 UTC
*** Bug 1929159 has been marked as a duplicate of this bug. ***

Comment 29 Sergiusz Urbaniak 2021-03-03 08:39:04 UTC
*** Bug 1934163 has been marked as a duplicate of this bug. ***

Comment 30 Simon Pasquier 2021-03-03 10:44:49 UTC
I've looked at recent CI failures mentioning "node is not ready" [1] and I couldn't find evidence that Prometheus was involved (see below for the details).

We already have bug 1918683 to track Prometheus OOMs so I don't think that there's anything left for the monitoring team regarding this bug. For the time being, we have decreased the priority class for the monitoring components (see bug 1929277 and bug 1929278) which means that nodes stay ready even when Prometheus uses an excessive amount of memory.

@Ryan: do you agree if I reassign the bug to the Node team? I guess it should be moved to MODIFIED.

* https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-ovirt/1366796210756653056
The cluster has only one worker node (16Gb RAM) which went down after 1h. We have no logs from the node.

* https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944
node 
All nodes are up at the end of the job.
There is an occurrence of "ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv node is not ready" in build-log.txt around 05:14:52 [2].
From [3], node ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv got drained at 05:12:36 because of an update from machine-config-daemon. The node rebooted and recorded itself as ready at 05:17:19.

* https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784
All nodes are up at the end of the job.
build-log.txt [4] shows that nodes are being reported as not ready. For instance ip-10-0-157-184.us-west-2.compute.internal is reported not ready around 06:53:17. Looking at the node's journal [5], it matches the time when the node got drained/restarted by machine-config-daemon.

* https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984
All nodes are up at the end of the job.
build-log.txt [6] shows that ip-10-0-225-225.us-east-2.compute.internal is reported as not ready around 21:40:41. Again it matches the time when this node was drained/restarted by machine-config-daemon [7].

[1] https://search.ci.openshift.org/?search=node+is+not+ready&maxAge=168h&context=1&type=junit&name=.*-master.*&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944/artifacts/e2e-azure-ovn-upgrade/openshift-e2e-test/build-log.txt
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade/1366230606018514944/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/nodes/ci-op-7hkdtdcj-2969a-slrfl-worker-centralus1-8bvnv/journal
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt
[5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4691/pull-ci-openshift-installer-master-e2e-aws-upgrade/1364806458763382784/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-157-184.us-west-2.compute.internal/journal
[6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/openshift-e2e-test/e2e.log
[7] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/14666/rehearse-14666-release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1346913352063913984/artifacts/e2e-aws-upgrade/gather-extra/nodes/ip-10-0-225-225.us-east-2.compute.internal/journal

Comment 31 Simon Pasquier 2021-03-04 10:42:48 UTC
After discussing offline with Ryan, we agree that the issue described here should be fixed hence moving to MODIFIED.

Comment 37 errata-xmlrpc 2021-07-27 22:36:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.