Bug 2082962
| Summary: | Upgrade Fails on Cluster that was Deployed with Day1 NetworkConfig | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Adina Wolff <awolff> |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Jacob Anders <janders> |
| Bare Metal Hardware Provisioning sub component: | cluster-baremetal-operator | QA Contact: | Amit Ugol <augol> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, janders, shardy, yliu1 |
| Version: | 4.11 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-06-14 14:07:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Adina Wolff
2022-05-08 18:56:38 UTC
Triage notes: @Adina - do you know if this is specific to these specific version combination, or will it happen while attempting to upgrade from any 4.10 to any 4.11 build (e.g. the current latest ones) as long as static IPs are used on the 4.10 cluster? This will help us correctly assess the severity of this and assess whether it is likely to be a blocker BZ. Setting it to priority/severity high as there is a possibility this may lead to breaking upgrades. @janders I don't remember if I've tried other versions but I definitely have no reason to assume the issue is only on these specific issues. @awolff noted. I will start a thread on Slack to work out the details, I haven't looked at the logs but this may be worth reproducing. As discussed, I re-ran this scenario with later versions. The same results are seen. from version: 4.10.0-0.nightly-2022-06-06-184250 to: 4.11.0-0.nightly-2022-06-06-025509 [kni@provisionhost-0-0 ~]$ oc adm upgrade info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-06-06-025509: 104 of 802 done (12% complete) Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.10 warning: Cannot display available updates: Reason: RemoteFailed Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.10&id=31935b5d-04a7-4d1f-805f-826fbddbfd7b&version=4.11.0-0.nightly-2022-06-06-025509": dial tcp 52.204.165.161:443: connect: connection refused [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-06-06-184250 True True 98m Working towards 4.11.0-0.nightly-2022-06-06-025509: 104 of 802 done (12% complete) [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 17h v1.23.5+3afdacb master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 17h v1.23.5+3afdacb master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com NotReady,SchedulingDisabled master 17h v1.23.5+3afdacb worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com NotReady,SchedulingDisabled worker 16h v1.23.5+3afdacb worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready worker 16h v1.23.5+3afdacb [kni@provisionhost-0-0 ~]$ Thank you @awolff Noting the failure happened between the current latest 4.10 and 4.11 nightlies (as per the 8th of June 2022) in addition to the original versions reported against (which adds to the severity). Will discuss with the team. must-gather from latest failure: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.144738023270642634.tar.gz This was reproduced with update from 4.10.13 to 4.10.17 [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com NotReady,SchedulingDisabled master 3h37m v1.23.5+b463d71 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 3h33m v1.23.5+b463d71 master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 3h33m v1.23.5+b463d71 worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com NotReady,SchedulingDisabled worker 3h6m v1.23.5+b463d71 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready worker 3h6m v1.23.5+b463d71 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.13 True True 147m Working towards 4.10.17: 95 of 771 done (12% complete) [kni@provisionhost-0-0 ~]$ oc adm upgrade info: An upgrade is in progress. Working towards 4.10.17: 95 of 771 done (12% complete) Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream: http://192.168.123.68:80/yp-graph Channel: stable-4.10 No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss. [kni@provisionhost-0-0 ~]$ looking at logs
cluster-scoped-resources/config.openshift.io/clusteroperators.yaml:
At least one master seems down (in line with the reports) along with the ingress (that may be a consequence not the cause):
1048 kind: ClusterOperator
(...)
1086 name: etcd
1087 ownerReferences:
1088 - apiVersion: config.openshift.io/v1
1089 kind: ClusterVersion
1090 name: version
1091 uid: ac48dfab-714f-4645-a314-40b6a3919adc
1092 resourceVersion: "105801"
1093 uid: 5614968d-7e26-4ec9-9ec2-29ffacd16b44
1094 spec: {}
1095 status:
1096 conditions:
1097 - lastTransitionTime: "2022-05-08T14:23:16Z"
1098 reason: ControllerStarted
1099 status: Unknown
1100 type: RecentBackup
1101 - lastTransitionTime: "2022-05-08T17:52:46Z"
1102 message: |-
1103 ClusterMemberControllerDegraded: unhealthy members found during reconciling members
1104 DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1105 EtcdMembersDegraded: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1106 NodeControllerDegraded: The master nodes not ready: node "master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com" not ready since 2022-05-08 17:50:57 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
1107 reason: ClusterMemberController_SyncError::DefragController_Error::EtcdMembers_UnhealthyMembers::NodeController_MasterNodesReady
1108 status: "True"
1109 type: Degraded
1110 - lastTransitionTime: "2022-05-08T17:01:38Z"
1111 message: |-
1112 NodeInstallerProgressing: 3 nodes are at revision 9
1113 EtcdMembersProgressing: No unstarted etcd members found
1114 reason: AsExpected
1115 status: "False"
1116 type: Progressing
1117 - lastTransitionTime: "2022-05-08T14:28:30Z"
1118 message: |-
1119 StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9
1120 EtcdMembersAvailable: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1121 reason: AsExpected
1122 status: "True"
1123 type: Available
1124 - lastTransitionTime: "2022-05-08T14:23:16Z"
1125 message: All is well
1126 reason: AsExpected
1127 status: "True"
1128 type: Upgradeable
(...)
1291 name: ingress
1292 ownerReferences:
1293 - apiVersion: config.openshift.io/v1
1294 kind: ClusterVersion
1295 name: version
1296 uid: ac48dfab-714f-4645-a314-40b6a3919adc
1297 resourceVersion: "107450"
1298 uid: 65bc52b9-e3bd-4a78-8da8-a5ce98995f57
1299 spec: {}
1300 status:
1301 conditions:
1302 - lastTransitionTime: "2022-05-08T14:39:13Z"
1303 message: The "default" ingress controller reports Available=True.
1304 reason: IngressAvailable
1305 status: "True"
1306 type: Available
1307 - lastTransitionTime: "2022-05-08T17:21:08Z"
1308 message: desired and current number of IngressControllers are equal
1309 reason: AsExpected
1310 status: "False"
1311 type: Progressing
1312 - lastTransitionTime: "2022-05-08T17:56:45Z"
1313 message: 'The "default" ingress controller reports Degraded=True: DegradedConditions:
1314 One or more other status conditions indicate a degraded state: PodsScheduled=False
1315 (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-86cc7db48b-88sxc"
1316 cannot be scheduled: 0/5 nodes are available: 1 node(s) didn''t have free
1317 ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master:
1318 }, that the pod didn''t tolerate, 2 node(s) were unschedulable. Make sure
1319 you have sufficient worker nodes.)'
1320 reason: IngressDegraded
1321 status: "True"
1322 type: Degraded
1323 - lastTransitionTime: "2022-05-08T14:23:40Z"
1324 reason: IngressControllersUpgradeable
1325 status: "True"
1326 type: Upgradeable
(...)
1514 name: kube-apiserver
1515 ownerReferences:
1516 - apiVersion: config.openshift.io/v1
1517 kind: ClusterVersion
1518 name: version
1519 uid: ac48dfab-714f-4645-a314-40b6a3919adc
1520 resourceVersion: "105902"
1521 uid: 856cab6c-c920-418f-818d-1fda4d796de1
1522 spec: {}
1523 status:
1524 conditions:
1525 - lastTransitionTime: "2022-05-08T17:52:57Z"
1526 message: 'NodeControllerDegraded: The master nodes not ready: node "master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com"
1527 not ready since 2022-05-08 17:50:57 +0000 UTC because NodeStatusUnknown (Kubelet
1528 stopped posting node status.)'
1529 reason: NodeController_MasterNodesReady
1530 status: "True"
1531 type: Degraded
1532 - lastTransitionTime: "2022-05-08T17:06:39Z"
1533 message: 'NodeInstallerProgressing: 3 nodes are at revision 9'
1534 reason: AsExpected
1535 status: "False"
1536 type: Progressing
1537 - lastTransitionTime: "2022-05-08T14:33:23Z"
1538 message: 'StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9'
1539 reason: AsExpected
1540 status: "True"
1541 type: Available
1542 - lastTransitionTime: "2022-05-08T14:23:29Z"
1543 message: 'KubeletMinorVersionUpgradeable: Kubelet and API server minor versions
1544 are synced.'
1545 reason: AsExpected
1546 status: "True"
1547 type: Upgradeable
some errors pointing to machine-config:
2119 name: machine-config
2120 ownerReferences:
2121 - apiVersion: config.openshift.io/v1
2122 kind: ClusterVersion
2123 name: version
2124 uid: ac48dfab-714f-4645-a314-40b6a3919adc
2125 resourceVersion: "120692"
2126 uid: bb575edc-d8c2-4d4a-ab0f-50b6db480b11
2127 spec: {}
2128 status:
2129 conditions:
2130 - lastTransitionTime: "2022-05-08T17:42:38Z"
2131 message: Working towards 4.11.0-0.ci-2022-05-07-211700
2132 status: "True"
2133 type: Progressing
2134 - lastTransitionTime: "2022-05-08T17:46:33Z"
2135 message: One or more machine config pools are updating, please see `oc get mcp`
2136 for further details
2137 reason: PoolUpdating
2138 status: "False"
2139 type: Upgradeable
2140 - lastTransitionTime: "2022-05-08T17:59:49Z"
2141 message: 'Unable to apply 4.11.0-0.ci-2022-05-07-211700: failed to apply machine
2142 config daemon manifests: error during waitForDaemonsetRollout: [timed out
2143 waiting for the condition, daemonset machine-config-daemon is not ready. status:
2144 (desired: 5, updated: 5, ready: 3, unavailable: 2)]'
2145 reason: MachineConfigDaemonFailed
2146 status: "True"
2147 type: Degraded
2148 - lastTransitionTime: "2022-05-08T18:09:51Z"
2149 message: Cluster not available for [{operator 4.10.0-0.nightly-2022-05-07-205137}]
2150 status: "False"
2151 type: Available
(...)
prometheus is upset, too (most certainly consequence):
2324 name: monitoring
2325 ownerReferences:
2326 - apiVersion: config.openshift.io/v1
2327 kind: ClusterVersion
2328 name: version
2329 uid: ac48dfab-714f-4645-a314-40b6a3919adc
2330 resourceVersion: "119949"
2331 uid: 89f0aca6-6110-4511-9594-d7d979047156
2332 spec: {}
2333 status:
2334 conditions:
2335 - lastTransitionTime: "2022-05-08T14:45:03Z"
2336 status: "True"
2337 type: Upgradeable
2338 - lastTransitionTime: "2022-05-08T18:02:23Z"
2339 message: Rollout of the monitoring stack failed and is degraded. Please investigate
2340 the degraded status error.
2341 reason: UpdatingPrometheusOperatorFailed
2342 status: "False"
2343 type: Available
2344 - lastTransitionTime: "2022-05-08T18:37:24Z"
2345 message: Rolling out the stack.
2346 reason: RollOutInProgress
2347 status: "True"
2348 type: Progressing
2349 - lastTransitionTime: "2022-05-08T18:02:23Z"
2350 message: 'Failed to rollout the stack. Error: updating prometheus operator:
2351 reconciling Prometheus Operator Admission Webhook Deployment failed: updating
2352 Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook:
2353 got 1 unavailable replicas'
2354 reason: UpdatingPrometheusOperatorFailed
2355 status: "True"
2356 type: Degraded
The question is - what was the cause of the failing master to go down - and is this related to static IPs?
Indeed, What we saw was a result of BZ2036677 After applying the workaround for this bug (https://access.redhat.com/articles/6865841), the upgrade completes successfully. Right now I'm checking if we need to apply the workaround again post upgrade. After further investigation, this bug is a manifestation of bz2036677 and will be closed as such. *** This bug has been marked as a duplicate of bug 2036677 *** |