Bug 1820432

Summary: Azure: Insufficient cpu: Failed 4.2 -> 4.3.10 with UpdatingnodeExporterFailed: Cluster operator monitoring is still updating
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: alegrand, anpicker, dougrt, erooth, kakkoyun, lcosic, lmohanty, mloibl, pkrupa, sdodson, surbania, vburton
Version: 4.3.zKeywords: Upgrades
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-13 18:00:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1812709    
Bug Blocks:    

Description W. Trevor King 2020-04-03 04:57:21 UTC
From 4.2.23 -> 4.3.10 [1] and 4.2.27 -> 4.3.10 [2].  Both died like:

  Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating

and had monitoring ClusterOperator conditions like [3,4]:

  - lastTransitionTime: "2020-04-01T23:15:16Z"
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter
      is not ready. status: (desired: 6, updated: 5, ready: 5, unavailable: 1)'
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded

Seems to be a CPU shortage:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message'
node-exporter-mqj26 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message'
node-exporter-8mkjw 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml

Comment 1 W. Trevor King 2020-04-03 04:59:03 UTC
Not really monitoring's fault that there's not enough CPU to go around.  Moving to the installer where default node sizes can be bumped (although not for existing 4.2 releases).  Would also be good to figure out who's consuming all the CPU.  And might be worth pulling edges to protect Azure clusters until we have a plan.

Comment 2 Scott Dodson 2020-04-03 12:27:39 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1812709 related

Comment 3 W. Trevor King 2020-04-03 14:28:31 UTC
I'll just go ahead and make bug 1812709 a formal blocker, and this can be the 4.3.z backport.

Comment 4 Scott Dodson 2020-04-03 14:32:11 UTC
(In reply to W. Trevor King from comment #3)
> I'll just go ahead and make bug 1812709 a formal blocker, and this can be
> the 4.3.z backport.

Why don't see this with 4.3 greenfield clusters as well? Did we change Azure machine pool specs between 4.2 and 4.3?

Comment 5 Lalatendu Mohanty 2020-04-03 15:02:35 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 Scott Dodson 2020-04-03 20:06:12 UTC
Set 4.3.z if this is tracking the backport of the depends on bug.

Comment 8 W. Trevor King 2020-04-06 21:00:43 UTC
Dropping the bug 1821291 series set up to backport cluster-network-operator#530, since that has its own 4.3.z bug 1821294.

Comment 9 Mrunal Patel 2020-04-07 14:46:07 UTC
*** Bug 1807148 has been marked as a duplicate of this bug. ***

Comment 10 Ryan Phillips 2020-04-08 18:31:58 UTC
*** Bug 1798224 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2020-04-13 18:00:20 UTC

*** This bug has been marked as a duplicate of bug 1812583 ***

Comment 12 Scott Dodson 2020-04-13 18:00:50 UTC

*** This bug has been marked as a duplicate of bug 1822770 ***