1996187 – upgrades failing in "Cluster should remain functional during upgrade" because of cluster operator monitoring is not available

Bug 1996187 - upgrades failing in "Cluster should remain functional during upgrade" because of cluster operator monitoring is not available

Summary: upgrades failing in "Cluster should remain functional during upgrade" because...

Keywords:
Status:	CLOSED DUPLICATE of bug 1993980
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-20 18:47 UTC by jamo luhrsen
Modified:	2021-08-27 01:07 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-27 01:07:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jamo luhrsen 2021-08-20 18:47:58 UTC

The upgrade jobs frequently fail the test case about "Cluster should remain functional during upgrade"
because "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring
is not available". example here:

  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928

That test case can fail for many different reasons, but this specific monitor operator reason seems to
be one of the top reasons.

briefly digging in to the artifacts:


'oc get pods' shows this:

openshift-monitoring                               node-exporter-5bvsr                                          2/2     Running     0              3h6m    10.0.205.43   ip-10-0-205-43.ec2.internal   <none>           <none>
openshift-monitoring                               node-exporter-75w29                                          0/2     Init:0/1    0              175m    10.0.177.20   ip-10-0-177-20.ec2.internal   <none>           <none>
openshift-monitoring                               node-exporter-gzc8z                                          2/2     Running     0              173m    10.0.157.65   ip-10-0-157-65.ec2.internal   <none>           <none>
openshift-monitoring                               node-exporter-vk4zx                                          2/2     Running     0              178m    10.0.217.91   ip-10-0-217-91.ec2.internal   <none>           <none>
openshift-monitoring                               node-exporter-wnxpv                                          2/2     Running     0              3h6m    10.0.136.24   ip-10-0-136-24.ec2.internal   <none>           <none>
openshift-monitoring                               node-exporter-x7n47                                          2/2     Running     0              3h6m    10.0.138.18   ip-10-0-138-18.ec2.internal   <none>           <none>


'oc get operators' shows this:

monitoring                                 4.9.0-0.nightly-2021-08-18-144658   False       True          True       157m    Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.

in the cluster-operator-monitoring pod log [0], this line shows up a few times:

W0819 16:28:40.593676       1 tasks.go:71] task 7 of 15: Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 6 ready pods for "node-exporter" daemonset, got 5 

[0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-dfbf6d699-lzklv_cluster-monitoring-operator.log

Comment 1 Arunprasad Rajkumar 2021-08-23 07:33:35 UTC

Hint from @simon,

> hey! looking quickly at the CI logs, I've found that one pod instance couldn't be started on node ip-10-0-177-20.ec2.internal
the events say

        {
            "apiVersion": "v1",
            "count": 807,
            "eventTime": null,
            "firstTimestamp": "2021-08-19T13:37:58Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "node-exporter-75w29",
                "namespace": "openshift-monitoring",
                "resourceVersion": "22157",
                "uid": "6af24349-66a6-4c67-a759-062e2fa67242"
            },
            "kind": "Event",
            "lastTimestamp": "2021-08-19T16:32:44Z",
            "message": "unable to ensure pod container exists: failed to create container for [kubepods burstable pod6af24349-66a6-4c67-a759-062e2fa67242] : Unit kubepods-burstable-pod6af24349_66a6_4c67_a759_062e2fa67242.slice already exists.",
            "metadata": {
                "creationTimestamp": "2021-08-19T13:38:00Z",
                "name": "node-exporter-75w29.169cb8bae3eb3430",
                "namespace": "openshift-monitoring",
                "resourceVersion": "94761",
                "uid": "c2bc0ab7-5ca9-404c-9c1c-8f61b36ecdf4"
            },
            "reason": "FailedCreatePodContainer",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "ip-10-0-177-20.ec2.internal"
            },
            "type": "Warning"
        },

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428340402481532928/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json

Assigning it to node team for their evaluation.

Comment 4 Kir Kolyshkin 2021-08-27 01:07:21 UTC


*** This bug has been marked as a duplicate of bug 1993980 ***

Note You need to log in before you can comment on or make changes to this bug.