Bug 2034367

Summary:	OCP 4.9.X cluster not able to upgrade to latest 4.10 nightly or rc on Azure
Product:	OpenShift Container Platform	Reporter:	Mohit Sheth <msheth>
Component:	Cluster Version Operator	Assignee:	Over the Air Updates <aos-team-ota>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Johnny Liu <jialiu>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	aos-bugs, bleanhar, jack.ottofaro, lmohanty, rsevilla, wking
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-21 13:42:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mohit Sheth 2021-12-20 19:05:49 UTC

Description of problem:
PerfScale team runs cluster-density with 4000 projects on a 250 node cluster and then upgrades the cluster to latest nightly. This was unsuccessful as the cluster would be unreachable during an upgrade. The stdout while performing an upgrade
```
2021-12-20T04:50:56Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 598 of 765 done (78% complete)
2021-12-20T04:51:07Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-12-18-034942: an unknown error has occurred: MultipleErrors
2021-12-20T04:54:05Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 614 of 765 done (80% complete)
2021-12-20T05:41:43Z - WARNING  - MainProcess - trigger_upgrade: 504
Reason: Gateway Timeout

```
Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always, I have tried it twice 

Steps to Reproduce:
1. Scale Azure cluster to 250 nodes
2. Run cluster-density with 4000 projects and do not clean the objects
3. Upgrade to the latest nightly

Actual results:
Upgrade fails

Expected results:
Upgrades to latest nightly

Comment 1 W. Trevor King 2021-12-20 19:40:05 UTC

[1] is a happy CI update from 4.9.11 to 4.10.0-0.nightly-2021-12-18-034942.  If you have issues with your update, we're going to need more information like a link to a must-gather or something, because one MultipleErrors aggregate reason is not all that much to go on.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1472052766292578304

Comment 3 Scott Dodson 2022-01-11 18:40:45 UTC

Need to have a must-gather. Given it's been a month since this happened please reproduce, obtain a must-gather, attach it to this bug, and re-open the bug.

Comment 4 Mohit Sheth 2022-01-12 20:35:09 UTC

The cluster was unreachable at that point, so couldn't grab a must-gather.
Is there any particular pod logs that I could stream which would be the most useful to debug this kind issue?

Comment 5 Jack Ottofaro 2022-01-12 22:17:47 UTC

(In reply to Mohit Sheth from comment #4)
> The cluster was unreachable at that point, so couldn't grab a must-gather.
> Is there any particular pod logs that I could stream which would be the most
> useful to debug this kind issue?

 We can start with cluster-version operator pod log in namespace "openshift-cluster-version".

Comment 6 Mohit Sheth 2022-01-17 15:21:12 UTC

I have been able to reproduce the failure. The upgrade path was 4.9.15 -> 4.10.0-fc.1
cluster-version-operator pod logs http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/msheth/410-upgrades/410-azure-upgrades.tar.gz

Comment 7 W. Trevor King 2022-01-17 18:16:09 UTC

Looking at those logs, it confirms with more detail comment 0's suggestion that the issues happen while the machine-config operator is attempting to update:

$ tar xOz az-upgrades/cluster-version-pod2.log <410-azure-upgrades.tar.gz | grep 'Result of work'
I0114 21:11:18.584180       1 task_graph.go:546] Result of work: [Cluster operator kube-controller-manager is updating versions Cluster operator kube-scheduler is updating versions]
I0114 21:18:44.456536       1 task_graph.go:546] Result of work: [Cluster operator machine-api is updating versions]
I0114 21:26:32.262660       1 task_graph.go:546] Result of work: [Cluster operator openshift-apiserver is updating versions Cluster operator openshift-controller-manager is updating versions]

Perhaps there is sufficient load on the control plane nodes that they fall over when we drain/reboot one and are briefly down to two nodes?  If so, I would expect to see some high-resource-consumption alerts going off before the update is initiated.  A must-gather from before the update kicks off might be useful; I think those contain information about firing alerts now...

Comment 10 Mohit Sheth 2022-03-21 13:42:23 UTC

Upgrade from 4.9.25 to 4.10.5 was successful with 4k cluster-density projects. It took 2h21m for the upgrade