Bug 2034367
Summary: | OCP 4.9.X cluster not able to upgrade to latest 4.10 nightly or rc on Azure | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mohit Sheth <msheth> |
Component: | Cluster Version Operator | Assignee: | Over the Air Updates <aos-team-ota> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Johnny Liu <jialiu> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.10 | CC: | aos-bugs, bleanhar, jack.ottofaro, lmohanty, rsevilla, wking |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-21 13:42:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mohit Sheth
2021-12-20 19:05:49 UTC
[1] is a happy CI update from 4.9.11 to 4.10.0-0.nightly-2021-12-18-034942. If you have issues with your update, we're going to need more information like a link to a must-gather or something, because one MultipleErrors aggregate reason is not all that much to go on. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade/1472052766292578304 Need to have a must-gather. Given it's been a month since this happened please reproduce, obtain a must-gather, attach it to this bug, and re-open the bug. The cluster was unreachable at that point, so couldn't grab a must-gather. Is there any particular pod logs that I could stream which would be the most useful to debug this kind issue? (In reply to Mohit Sheth from comment #4) > The cluster was unreachable at that point, so couldn't grab a must-gather. > Is there any particular pod logs that I could stream which would be the most > useful to debug this kind issue? We can start with cluster-version operator pod log in namespace "openshift-cluster-version". I have been able to reproduce the failure. The upgrade path was 4.9.15 -> 4.10.0-fc.1 cluster-version-operator pod logs http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/msheth/410-upgrades/410-azure-upgrades.tar.gz Looking at those logs, it confirms with more detail comment 0's suggestion that the issues happen while the machine-config operator is attempting to update: $ tar xOz az-upgrades/cluster-version-pod2.log <410-azure-upgrades.tar.gz | grep 'Result of work' I0114 21:11:18.584180 1 task_graph.go:546] Result of work: [Cluster operator kube-controller-manager is updating versions Cluster operator kube-scheduler is updating versions] I0114 21:18:44.456536 1 task_graph.go:546] Result of work: [Cluster operator machine-api is updating versions] I0114 21:26:32.262660 1 task_graph.go:546] Result of work: [Cluster operator openshift-apiserver is updating versions Cluster operator openshift-controller-manager is updating versions] Perhaps there is sufficient load on the control plane nodes that they fall over when we drain/reboot one and are briefly down to two nodes? If so, I would expect to see some high-resource-consumption alerts going off before the update is initiated. A must-gather from before the update kicks off might be useful; I think those contain information about firing alerts now... Upgrade from 4.9.25 to 4.10.5 was successful with 4k cluster-density projects. It took 2h21m for the upgrade |