1820432 – Azure: Insufficient cpu: Failed 4.2 -> 4.3.10 with UpdatingnodeExporterFailed: Cluster operator monitoring is still updating

Bug 1820432 - Azure: Insufficient cpu: Failed 4.2 -> 4.3.10 with UpdatingnodeExporterFailed: Cluster operator monitoring is still updating

Summary: Azure: Insufficient cpu: Failed 4.2 -> 4.3.10 with UpdatingnodeExporterFailed...

Keywords:
Status:	CLOSED DUPLICATE of bug 1822770
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1807148 (view as bug list)
Depends On:	1812709
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-03 04:57 UTC by W. Trevor King
Modified:	2020-04-13 18:00 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-13 18:00:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2020-04-03 04:57:21 UTC

From 4.2.23 -> 4.3.10 [1] and 4.2.27 -> 4.3.10 [2].  Both died like:

  Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator monitoring is still updating

and had monitoring ClusterOperator conditions like [3,4]:

  - lastTransitionTime: "2020-04-01T23:15:16Z"
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter
      is not ready. status: (desired: 6, updated: 5, ready: 5, unavailable: 1)'
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded

Seems to be a CPU shortage:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message'
node-exporter-mqj26 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/namespaces/openshift-monitoring/core/pods.yaml | yaml2json | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name + " " + .status.conditions[].message'
node-exporter-8mkjw 0/6 nodes are available: 1 Insufficient cpu, 5 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match node selector.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/122/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/120/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-7b4b7938d33cc9f86d822cf6a4dbffb99444566ef93f45a313f600cad11cfee3/cluster-scoped-resources/config.openshift.io/clusteroperators/monitoring.yaml

Comment 1 W. Trevor King 2020-04-03 04:59:03 UTC

Not really monitoring's fault that there's not enough CPU to go around.  Moving to the installer where default node sizes can be bumped (although not for existing 4.2 releases).  Would also be good to figure out who's consuming all the CPU.  And might be worth pulling edges to protect Azure clusters until we have a plan.

Comment 2 Scott Dodson 2020-04-03 12:27:39 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1812709 related

Comment 3 W. Trevor King 2020-04-03 14:28:31 UTC

I'll just go ahead and make bug 1812709 a formal blocker, and this can be the 4.3.z backport.

Comment 4 Scott Dodson 2020-04-03 14:32:11 UTC

(In reply to W. Trevor King from comment #3)
> I'll just go ahead and make bug 1812709 a formal blocker, and this can be
> the 4.3.z backport.

Why don't see this with 4.3 greenfield clusters as well? Did we change Azure machine pool specs between 4.2 and 4.3?

Comment 5 Lalatendu Mohanty 2020-04-03 15:02:35 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 Scott Dodson 2020-04-03 20:06:12 UTC

Set 4.3.z if this is tracking the backport of the depends on bug.

Comment 8 W. Trevor King 2020-04-06 21:00:43 UTC

Dropping the bug 1821291 series set up to backport cluster-network-operator#530, since that has its own 4.3.z bug 1821294.

Comment 9 Mrunal Patel 2020-04-07 14:46:07 UTC

*** Bug 1807148 has been marked as a duplicate of this bug. ***

Comment 10 Ryan Phillips 2020-04-08 18:31:58 UTC

*** Bug 1798224 has been marked as a duplicate of this bug. ***

Comment 11 Scott Dodson 2020-04-13 18:00:20 UTC


*** This bug has been marked as a duplicate of bug 1812583 ***

Comment 12 Scott Dodson 2020-04-13 18:00:50 UTC


*** This bug has been marked as a duplicate of bug 1822770 ***

Note You need to log in before you can comment on or make changes to this bug.