From 4.2.23 -> 4.3.10 [1] and 4.2.27 -> 4.3.10 [2]. Both died like: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating and had monitoring ClusterOperator conditions like [3,4]: - lastTransitionTime: "2020-04-01T23:15:16Z" message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 6, updated: 5, ready: 5, unavailable: 1)' reason: UpdatingnodeExporterFailed status: "True" type: Degraded Seems to be a CPU shortage: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message' node-exporter-mqj26 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector. $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message' node-exporter-8mkjw 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml
Not really monitoring's fault that there's not enough CPU to go around. Moving to the installer where default node sizes can be bumped (although not for existing 4.2 releases). Would also be good to figure out who's consuming all the CPU. And might be worth pulling edges to protect Azure clusters until we have a plan.
https://bugzilla.redhat.com/show_bug.cgi?id=1812709 related
I'll just go ahead and make bug 1812709 a formal blocker, and this can be the 4.3.z backport.
(In reply to W. Trevor King from comment #3) > I'll just go ahead and make bug 1812709 a formal blocker, and this can be > the 4.3.z backport. Why don't see this with 4.3 greenfield clusters as well? Did we change Azure machine pool specs between 4.2 and 4.3?
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Set 4.3.z if this is tracking the backport of the depends on bug.
Dropping the bug 1821291 series set up to backport cluster-network-operator#530, since that has its own 4.3.z bug 1821294.
*** Bug 1807148 has been marked as a duplicate of this bug. ***
*** Bug 1798224 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1812583 ***
*** This bug has been marked as a duplicate of bug 1822770 ***