4.3.10 -> 4.3.22 update CI job [1] hung on monitoring for at least an hour: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "monitoring") | .status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort 2020-05-21T05:08:40Z Available=False -: - 2020-05-21T05:08:40Z Degraded=True UpdatingnodeExporterFailed: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 3, ready: 5, unavailable: 1) 2020-05-21T06:10:14Z Progressing=True RollOutInProgress: Rolling out the stack. 2020-05-21T06:10:14Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes. A few issues: * Going degraded on a DaemonSet may be a DaemonSet controller bug. For example, see bug 1790989. This UpdatingnodeExporterFailed reason might be bug 1765064, but now we have CI's must-gather to work with. * Going Available=False without a reason and message is bad. We should explain the outage to the cluster admin. * "UpdatingnodeExporterFailed" should probably be "UpdatingNodeExporterFailed" (capitalizing the N). * Setting RollOutInProgress for Upgradeable, as discussed in bug 1837832 and now tracked in Jira. This bug should be a about the first of those, but RFE Jiras for the middle two would make sense to me too. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422
We are aware of this and Lili already have a PoC to fix this issue [1]. Since this is not a bug but RFE and we already have 2 JIRA issues for this [2][3] I am closing this report. [1]: https://github.com/lilic/cluster-monitoring-operator/commit/9f8a60afc2d4b37840c201762c2e583948ccf9d4 [2]: https://issues.redhat.com/browse/MON-1139 [3]: https://issues.redhat.com/browse/MON-1126
from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/422/artifacts/launch/events.json ******************************************* "lastTimestamp": null, "message": "0/6 nodes are available: 2 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.", "metadata": { "creationTimestamp": "2020-05-21T05:04:36Z", "name": "node-exporter-ldshn.1610f2a931f48bf4", "namespace": "openshift-monitoring", "resourceVersion": "27543", "selfLink": "/api/v1/namespaces/openshift-monitoring/events/node-exporter-ldshn.1610f2a931f48bf4", "uid": "0ec33f81-5646-4a92-ac5c-18f0904a1640" }, ******************************************* cpu resource is limited, use a larger instance will not meet this error. we also have one enhancement to reduce cpu request for monitoring: bug 1818806
I'm going to close this as a dup of 1818806 then. *** This bug has been marked as a duplicate of bug 1818806 ***