1840885 – 4.3->4.3 update stuck on UpdatingnodeExporterFailed for at least an hour

Bug 1840885 - 4.3->4.3 update stuck on UpdatingnodeExporterFailed for at least an hour

Summary: 4.3->4.3 update stuck on UpdatingnodeExporterFailed for at least an hour

Keywords:
Status:	CLOSED DUPLICATE of bug 1818806
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-27 19:20 UTC by W. Trevor King
Modified:	2020-05-28 09:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-28 08:46:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2020-05-27 19:20:44 UTC

4.3.10 -> 4.3.22 update CI job [1] hung on monitoring for at least an hour:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring") | .status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort
2020-05-21T05:08:40Z Available=False -: -
2020-05-21T05:08:40Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 3, ready: 5, unavailable: 1)
2020-05-21T06:10:14Z Progressing=True RollOutInProgress: Rolling out the stack.
2020-05-21T06:10:14Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes.

A few issues:

* Going degraded on a DaemonSet may be a DaemonSet controller bug.  For example, see bug 1790989.  This UpdatingnodeExporterFailed reason might be bug 1765064, but now we have CI's must-gather to work with.
* Going Available=False without a reason and message is bad.  We should explain the outage to the cluster admin.
* "UpdatingnodeExporterFailed" should probably be "UpdatingNodeExporterFailed" (capitalizing the N).
* Setting RollOutInProgress for Upgradeable, as discussed in bug 1837832 and now tracked in Jira.

This bug should be a about the first of those, but RFE Jiras for the middle two would make sense to me too.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422

Comment 1 Pawel Krupa 2020-05-28 08:46:30 UTC

We are aware of this and Lili already have a PoC to fix this issue [1].

Since this is not a bug but RFE and we already have 2 JIRA issues for this [2][3] I am closing this report.

[1]: https://github.com/lilic/cluster-monitoring-operator/commit/9f8a60afc2d4b37840c201762c2e583948ccf9d4
[2]: https://issues.redhat.com/browse/MON-1139
[3]: https://issues.redhat.com/browse/MON-1126

Comment 2 Junqi Zhao 2020-05-28 08:58:59 UTC

from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/events.json
*******************************************
            "lastTimestamp": null,
            "message": "0/6 nodes are available: 2 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.",
            "metadata": {
                "creationTimestamp": "2020-05-21T05:04:36Z",
                "name": "node-exporter-ldshn.1610f2a931f48bf4",
                "namespace": "openshift-monitoring",
                "resourceVersion": "27543",
                "selfLink": "/api/v1/namespaces/openshift-monitoring/events/node-exporter-ldshn.1610f2a931f48bf4",
                "uid": "0ec33f81-5646-4a92-ac5c-18f0904a1640"
            },
*******************************************

cpu resource is limited, use a larger instance will not meet this error.
we also have one enhancement to reduce cpu request for monitoring:
bug 1818806

Comment 3 W. Trevor King 2020-05-28 09:13:14 UTC

I'm going to close this as a dup of 1818806 then.

*** This bug has been marked as a duplicate of bug 1818806 ***

Note You need to log in before you can comment on or make changes to this bug.