Bug 2034367

Summary: OCP 4.9.X cluster not able to upgrade to latest 4.10 nightly or rc on Azure
Product: OpenShift Container Platform Reporter: Mohit Sheth <msheth>
Component: Cluster Version OperatorAssignee: Over the Air Updates <aos-team-ota>
Status: CLOSED CURRENTRELEASE QA Contact: Johnny Liu <jialiu>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.10CC: aos-bugs, bleanhar, jack.ottofaro, lmohanty, rsevilla, wking
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-21 13:42:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mohit Sheth 2021-12-20 19:05:49 UTC
Description of problem:
PerfScale team runs cluster-density with 4000 projects on a 250 node cluster and then upgrades the cluster to latest nightly. This was unsuccessful as the cluster would be unreachable during an upgrade. The stdout while performing an upgrade
```
2021-12-20T04:50:56Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 598 of 765 done (78% complete)
2021-12-20T04:51:07Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-12-18-034942: an unknown error has occurred: MultipleErrors
2021-12-20T04:54:05Z - INFO     - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 614 of 765 done (80% complete)
2021-12-20T05:41:43Z - WARNING  - MainProcess - trigger_upgrade: 504
Reason: Gateway Timeout

```
Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always, I have tried it twice 

Steps to Reproduce:
1. Scale Azure cluster to 250 nodes
2. Run cluster-density with 4000 projects and do not clean the objects
3. Upgrade to the latest nightly

Actual results:
Upgrade fails

Expected results:
Upgrades to latest nightly

Comment 1 W. Trevor King 2021-12-20 19:40:05 UTC
[1] is a happy CI update from 4.9.11 to 4.10.0-0.nightly-2021-12-18-034942.  If you have issues with your update, we're going to need more information like a link to a must-gather or something, because one MultipleErrors aggregate reason is not all that much to go on.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1472052766292578304

Comment 3 Scott Dodson 2022-01-11 18:40:45 UTC
Need to have a must-gather. Given it's been a month since this happened please reproduce, obtain a must-gather, attach it to this bug, and re-open the bug.

Comment 4 Mohit Sheth 2022-01-12 20:35:09 UTC
The cluster was unreachable at that point, so couldn't grab a must-gather.
Is there any particular pod logs that I could stream which would be the most useful to debug this kind issue?

Comment 5 Jack Ottofaro 2022-01-12 22:17:47 UTC
(In reply to Mohit Sheth from comment #4)
> The cluster was unreachable at that point, so couldn't grab a must-gather.
> Is there any particular pod logs that I could stream which would be the most
> useful to debug this kind issue?

 We can start with cluster-version operator pod log in namespace "openshift-cluster-version".

Comment 6 Mohit Sheth 2022-01-17 15:21:12 UTC
I have been able to reproduce the failure. The upgrade path was 4.9.15 -> 4.10.0-fc.1
cluster-version-operator pod logs http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/msheth/410-upgrades/410-azure-upgrades.tar.gz

Comment 7 W. Trevor King 2022-01-17 18:16:09 UTC
Looking at those logs, it confirms with more detail comment 0's suggestion that the issues happen while the machine-config operator is attempting to update:

$ tar xOz az-upgrades/cluster-version-pod2.log <410-azure-upgrades.tar.gz | grep 'Result of work'
I0114 21:11:18.584180       1 task_graph.go:546] Result of work: [Cluster operator kube-controller-manager is updating versions Cluster operator kube-scheduler is updating versions]
I0114 21:18:44.456536       1 task_graph.go:546] Result of work: [Cluster operator machine-api is updating versions]
I0114 21:26:32.262660       1 task_graph.go:546] Result of work: [Cluster operator openshift-apiserver is updating versions Cluster operator openshift-controller-manager is updating versions]

Perhaps there is sufficient load on the control plane nodes that they fall over when we drain/reboot one and are briefly down to two nodes?  If so, I would expect to see some high-resource-consumption alerts going off before the update is initiated.  A must-gather from before the update kicks off might be useful; I think those contain information about firing alerts now...

Comment 10 Mohit Sheth 2022-03-21 13:42:23 UTC
Upgrade from 4.9.25 to 4.10.5 was successful with 4k cluster-density projects. It took 2h21m for the upgrade