Description of problem: PerfScale team runs cluster-density with 4000 projects on a 250 node cluster and then upgrades the cluster to latest nightly. This was unsuccessful as the cluster would be unreachable during an upgrade. The stdout while performing an upgrade ``` 2021-12-20T04:50:56Z - INFO - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 598 of 765 done (78% complete) 2021-12-20T04:51:07Z - INFO - MainProcess - trigger_upgrade: info: An upgrade is in progress. Unable to apply 4.10.0-0.nightly-2021-12-18-034942: an unknown error has occurred: MultipleErrors 2021-12-20T04:54:05Z - INFO - MainProcess - trigger_upgrade: info: An upgrade is in progress. Working towards 4.10.0-0.nightly-2021-12-18-034942: 614 of 765 done (80% complete) 2021-12-20T05:41:43Z - WARNING - MainProcess - trigger_upgrade: 504 Reason: Gateway Timeout ``` Version-Release number of selected component (if applicable): 4.10 How reproducible: Always, I have tried it twice Steps to Reproduce: 1. Scale Azure cluster to 250 nodes 2. Run cluster-density with 4000 projects and do not clean the objects 3. Upgrade to the latest nightly Actual results: Upgrade fails Expected results: Upgrades to latest nightly
[1] is a happy CI update from 4.9.11 to 4.10.0-0.nightly-2021-12-18-034942. If you have issues with your update, we're going to need more information like a link to a must-gather or something, because one MultipleErrors aggregate reason is not all that much to go on. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1472052766292578304
Need to have a must-gather. Given it's been a month since this happened please reproduce, obtain a must-gather, attach it to this bug, and re-open the bug.
The cluster was unreachable at that point, so couldn't grab a must-gather. Is there any particular pod logs that I could stream which would be the most useful to debug this kind issue?
(In reply to Mohit Sheth from comment #4) > The cluster was unreachable at that point, so couldn't grab a must-gather. > Is there any particular pod logs that I could stream which would be the most > useful to debug this kind issue? We can start with cluster-version operator pod log in namespace "openshift-cluster-version".
I have been able to reproduce the failure. The upgrade path was 4.9.15 -> 4.10.0-fc.1 cluster-version-operator pod logs http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/msheth/410-upgrades/410-azure-upgrades.tar.gz
Looking at those logs, it confirms with more detail comment 0's suggestion that the issues happen while the machine-config operator is attempting to update: $ tar xOz az-upgrades/cluster-version-pod2.log <410-azure-upgrades.tar.gz | grep 'Result of work' I0114 21:11:18.584180 1 task_graph.go:546] Result of work: [Cluster operator kube-controller-manager is updating versions Cluster operator kube-scheduler is updating versions] I0114 21:18:44.456536 1 task_graph.go:546] Result of work: [Cluster operator machine-api is updating versions] I0114 21:26:32.262660 1 task_graph.go:546] Result of work: [Cluster operator openshift-apiserver is updating versions Cluster operator openshift-controller-manager is updating versions] Perhaps there is sufficient load on the control plane nodes that they fall over when we drain/reboot one and are briefly down to two nodes? If so, I would expect to see some high-resource-consumption alerts going off before the update is initiated. A must-gather from before the update kicks off might be useful; I think those contain information about firing alerts now...
Upgrade from 4.9.25 to 4.10.5 was successful with 4k cluster-density projects. It took 2h21m for the upgrade