Bug 1840885 - 4.3->4.3 update stuck on UpdatingnodeExporterFailed for at least an hour
Summary: 4.3->4.3 update stuck on UpdatingnodeExporterFailed for at least an hour
Keywords:
Status: CLOSED DUPLICATE of bug 1818806
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-27 19:20 UTC by W. Trevor King
Modified: 2020-05-28 09:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-28 08:46:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-05-27 19:20:44 UTC
4.3.10 -> 4.3.22 update CI job [1] hung on monitoring for at least an hour:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring") | .status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort
2020-05-21T05:08:40Z Available=False -: -
2020-05-21T05:08:40Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 3, ready: 5, unavailable: 1)
2020-05-21T06:10:14Z Progressing=True RollOutInProgress: Rolling out the stack.
2020-05-21T06:10:14Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes.

A few issues:

* Going degraded on a DaemonSet may be a DaemonSet controller bug.  For example, see bug 1790989.  This UpdatingnodeExporterFailed reason might be bug 1765064, but now we have CI's must-gather to work with.
* Going Available=False without a reason and message is bad.  We should explain the outage to the cluster admin.
* "UpdatingnodeExporterFailed" should probably be "UpdatingNodeExporterFailed" (capitalizing the N).
* Setting RollOutInProgress for Upgradeable, as discussed in bug 1837832 and now tracked in Jira.

This bug should be a about the first of those, but RFE Jiras for the middle two would make sense to me too.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422

Comment 1 Pawel Krupa 2020-05-28 08:46:30 UTC
We are aware of this and Lili already have a PoC to fix this issue [1].

Since this is not a bug but RFE and we already have 2 JIRA issues for this [2][3] I am closing this report.

[1]: https://github.com/lilic/cluster-monitoring-operator/commit/9f8a60afc2d4b37840c201762c2e583948ccf9d4
[2]: https://issues.redhat.com/browse/MON-1139
[3]: https://issues.redhat.com/browse/MON-1126

Comment 2 Junqi Zhao 2020-05-28 08:58:59 UTC
from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/events.json
*******************************************
            "lastTimestamp": null,
            "message": "0/6 nodes are available: 2 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.",
            "metadata": {
                "creationTimestamp": "2020-05-21T05:04:36Z",
                "name": "node-exporter-ldshn.1610f2a931f48bf4",
                "namespace": "openshift-monitoring",
                "resourceVersion": "27543",
                "selfLink": "/api/v1/namespaces/openshift-monitoring/events/node-exporter-ldshn.1610f2a931f48bf4",
                "uid": "0ec33f81-5646-4a92-ac5c-18f0904a1640"
            },
*******************************************

cpu resource is limited, use a larger instance will not meet this error.
we also have one enhancement to reduce cpu request for monitoring:
bug 1818806

Comment 3 W. Trevor King 2020-05-28 09:13:14 UTC
I'm going to close this as a dup of 1818806 then.

*** This bug has been marked as a duplicate of bug 1818806 ***


Note You need to log in before you can comment on or make changes to this bug.