Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2082962

Summary:	Upgrade Fails on Cluster that was Deployed with Day1 NetworkConfig
Product:	OpenShift Container Platform	Reporter:	Adina Wolff <awolff>
Component:	Bare Metal Hardware Provisioning	Assignee:	Jacob Anders <janders>
Bare Metal Hardware Provisioning sub component:	cluster-baremetal-operator	QA Contact:	Amit Ugol <augol>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, janders, shardy, yliu1
Version:	4.11	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-14 14:07:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Adina Wolff 2022-05-08 18:56:38 UTC

Description of problem:
Upgrade to 4.11 fails on a cluster that was deployed as 4.10 with networkConfig  (static IPs) set via install-config.


Version-Release number of selected component (if applicable):
4.11.0-0.ci-2022-05-07-211700

How reproducible:
2/2


Steps to Reproduce:
1.Deploy OCP version 4.10.0-0.nightly-2022-05-07-205137 with static IPs for the nodes set via networkConfig in install-config.
2.Run upgrade to version 4.11.0-0.ci-2022-05-07-211700

Actual results:
Upgrade fails. Two of the nodes are not schedulable.


Expected results:
Upgrade succeeds


Additional info:
must gather is at: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.1367829659851044494.tar.gz

install config is at: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/install-config.4.10.yaml

Comment 1 Jacob Anders 2022-06-07 05:35:59 UTC

Triage notes:

@Adina - do you know if this is specific to these specific version combination, or will it happen while attempting to upgrade from any 4.10 to any 4.11 build (e.g. the current latest ones) as long as static IPs are used on the 4.10 cluster?

This will help us correctly assess the severity of this and assess whether it is likely to be a blocker BZ.

Setting it to priority/severity high as there is a possibility this may lead to breaking upgrades.

Comment 2 Adina Wolff 2022-06-07 09:38:37 UTC

@janders I don't remember if I've tried other versions but I definitely have no reason to assume the issue is only on these specific issues.

Comment 3 Jacob Anders 2022-06-07 10:57:52 UTC

@awolff noted. I will start a thread on Slack to work out the details, I haven't looked at the logs but this may be worth reproducing.

Comment 4 Adina Wolff 2022-06-08 11:40:58 UTC

As discussed, I re-ran this scenario with later versions. The same results are seen.
from version: 4.10.0-0.nightly-2022-06-06-184250
to: 4.11.0-0.nightly-2022-06-06-025509


[kni@provisionhost-0-0 ~]$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-06-06-025509: 104 of 802 done (12% complete)

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.10
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.10&id=31935b5d-04a7-4d1f-805f-826fbddbfd7b&version=4.11.0-0.nightly-2022-06-06-025509": dial tcp 52.204.165.161:443: connect: connection refused

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-06-06-184250   True        True          98m     Working towards 4.11.0-0.nightly-2022-06-06-025509: 104 of 802 done (12% complete)
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME                                              STATUS                        ROLES    AGE   VERSION
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         master   17h   v1.23.5+3afdacb
master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         master   17h   v1.23.5+3afdacb
master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   NotReady,SchedulingDisabled   master   17h   v1.23.5+3afdacb
worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   NotReady,SchedulingDisabled   worker   16h   v1.23.5+3afdacb
worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         worker   16h   v1.23.5+3afdacb
[kni@provisionhost-0-0 ~]$

Comment 5 Jacob Anders 2022-06-08 12:08:44 UTC

Thank you @awolff 

Noting the failure happened between the current latest 4.10 and 4.11 nightlies (as per the 8th of June 2022) in addition to the original versions reported against (which adds to the severity).

Will discuss with the team.

Comment 6 Adina Wolff 2022-06-08 12:52:10 UTC

must-gather from latest failure: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.144738023270642634.tar.gz

Comment 7 Adina Wolff 2022-06-13 06:51:46 UTC

This was reproduced with update from 4.10.13 to 4.10.17

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME                                              STATUS                        ROLES    AGE     VERSION
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   NotReady,SchedulingDisabled   master   3h37m   v1.23.5+b463d71
master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         master   3h33m   v1.23.5+b463d71
master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         master   3h33m   v1.23.5+b463d71
worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   NotReady,SchedulingDisabled   worker   3h6m    v1.23.5+b463d71
worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready                         worker   3h6m    v1.23.5+b463d71
[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.13   True        True          147m    Working towards 4.10.17: 95 of 771 done (12% complete)
[kni@provisionhost-0-0 ~]$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.10.17: 95 of 771 done (12% complete)

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: http://192.168.123.68:80/yp-graph
Channel: stable-4.10
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and may result in downtime or data loss.
[kni@provisionhost-0-0 ~]$

Comment 8 Jacob Anders 2022-06-14 05:28:52 UTC

looking at logs

cluster-scoped-resources/config.openshift.io/clusteroperators.yaml:

At least one master seems down (in line with the reports) along with the ingress (that may be a consequence not the cause):


1048   kind: ClusterOperator
(...)
1086     name: etcd
1087     ownerReferences:
1088     - apiVersion: config.openshift.io/v1
1089       kind: ClusterVersion
1090       name: version
1091       uid: ac48dfab-714f-4645-a314-40b6a3919adc
1092     resourceVersion: "105801"
1093     uid: 5614968d-7e26-4ec9-9ec2-29ffacd16b44
1094   spec: {}
1095   status:
1096     conditions:
1097     - lastTransitionTime: "2022-05-08T14:23:16Z"
1098       reason: ControllerStarted
1099       status: Unknown
1100       type: RecentBackup
1101     - lastTransitionTime: "2022-05-08T17:52:46Z"
1102       message: |-
1103         ClusterMemberControllerDegraded: unhealthy members found during reconciling members
1104         DefragControllerDegraded: cluster is unhealthy: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1105         EtcdMembersDegraded: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1106         NodeControllerDegraded: The master nodes not ready: node "master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com" not ready since 2022-05-08 17:50:57 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
1107       reason: ClusterMemberController_SyncError::DefragController_Error::EtcdMembers_UnhealthyMembers::NodeController_MasterNodesReady
1108       status: "True"
1109       type: Degraded
1110     - lastTransitionTime: "2022-05-08T17:01:38Z"
1111       message: |-
1112         NodeInstallerProgressing: 3 nodes are at revision 9
1113         EtcdMembersProgressing: No unstarted etcd members found
1114       reason: AsExpected
1115       status: "False"
1116       type: Progressing
1117     - lastTransitionTime: "2022-05-08T14:28:30Z"
1118       message: |-
1119         StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9
1120         EtcdMembersAvailable: 2 of 3 members are available, master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com is unhealthy
1121       reason: AsExpected
1122       status: "True"
1123       type: Available
1124     - lastTransitionTime: "2022-05-08T14:23:16Z"
1125       message: All is well
1126       reason: AsExpected
1127       status: "True"
1128       type: Upgradeable
(...)
1291     name: ingress
1292     ownerReferences:
1293     - apiVersion: config.openshift.io/v1
1294       kind: ClusterVersion
1295       name: version
1296       uid: ac48dfab-714f-4645-a314-40b6a3919adc
1297     resourceVersion: "107450"
1298     uid: 65bc52b9-e3bd-4a78-8da8-a5ce98995f57
1299   spec: {}
1300   status:
1301     conditions:
1302     - lastTransitionTime: "2022-05-08T14:39:13Z"
1303       message: The "default" ingress controller reports Available=True.
1304       reason: IngressAvailable
1305       status: "True"
1306       type: Available
1307     - lastTransitionTime: "2022-05-08T17:21:08Z"
1308       message: desired and current number of IngressControllers are equal
1309       reason: AsExpected
1310       status: "False"
1311       type: Progressing
1312     - lastTransitionTime: "2022-05-08T17:56:45Z"
1313       message: 'The "default" ingress controller reports Degraded=True: DegradedConditions:
1314         One or more other status conditions indicate a degraded state: PodsScheduled=False
1315         (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-86cc7db48b-88sxc"
1316         cannot be scheduled: 0/5 nodes are available: 1 node(s) didn''t have free
1317         ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master:
1318         }, that the pod didn''t tolerate, 2 node(s) were unschedulable. Make sure
1319         you have sufficient worker nodes.)'
1320       reason: IngressDegraded
1321       status: "True"
1322       type: Degraded
1323     - lastTransitionTime: "2022-05-08T14:23:40Z"
1324       reason: IngressControllersUpgradeable
1325       status: "True"
1326       type: Upgradeable
(...)
1514     name: kube-apiserver
1515     ownerReferences:
1516     - apiVersion: config.openshift.io/v1
1517       kind: ClusterVersion
1518       name: version
1519       uid: ac48dfab-714f-4645-a314-40b6a3919adc
1520     resourceVersion: "105902"
1521     uid: 856cab6c-c920-418f-818d-1fda4d796de1
1522   spec: {}
1523   status:
1524     conditions:
1525     - lastTransitionTime: "2022-05-08T17:52:57Z"
1526       message: 'NodeControllerDegraded: The master nodes not ready: node "master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com"
1527         not ready since 2022-05-08 17:50:57 +0000 UTC because NodeStatusUnknown (Kubelet
1528         stopped posting node status.)'
1529       reason: NodeController_MasterNodesReady
1530       status: "True"
1531       type: Degraded
1532     - lastTransitionTime: "2022-05-08T17:06:39Z"
1533       message: 'NodeInstallerProgressing: 3 nodes are at revision 9'
1534       reason: AsExpected
1535       status: "False"
1536       type: Progressing
1537     - lastTransitionTime: "2022-05-08T14:33:23Z"
1538       message: 'StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9'
1539       reason: AsExpected
1540       status: "True"
1541       type: Available
1542     - lastTransitionTime: "2022-05-08T14:23:29Z"
1543       message: 'KubeletMinorVersionUpgradeable: Kubelet and API server minor versions
1544         are synced.'
1545       reason: AsExpected
1546       status: "True"
1547       type: Upgradeable

some errors pointing to machine-config:

2119     name: machine-config
2120     ownerReferences:
2121     - apiVersion: config.openshift.io/v1
2122       kind: ClusterVersion
2123       name: version
2124       uid: ac48dfab-714f-4645-a314-40b6a3919adc
2125     resourceVersion: "120692"
2126     uid: bb575edc-d8c2-4d4a-ab0f-50b6db480b11
2127   spec: {}
2128   status:
2129     conditions:
2130     - lastTransitionTime: "2022-05-08T17:42:38Z"
2131       message: Working towards 4.11.0-0.ci-2022-05-07-211700
2132       status: "True"
2133       type: Progressing
2134     - lastTransitionTime: "2022-05-08T17:46:33Z"
2135       message: One or more machine config pools are updating, please see `oc get mcp`
2136         for further details
2137       reason: PoolUpdating
2138       status: "False"
2139       type: Upgradeable
2140     - lastTransitionTime: "2022-05-08T17:59:49Z"
2141       message: 'Unable to apply 4.11.0-0.ci-2022-05-07-211700: failed to apply machine
2142         config daemon manifests: error during waitForDaemonsetRollout: [timed out
2143         waiting for the condition, daemonset machine-config-daemon is not ready. status:
2144         (desired: 5, updated: 5, ready: 3, unavailable: 2)]'
2145       reason: MachineConfigDaemonFailed
2146       status: "True"
2147       type: Degraded
2148     - lastTransitionTime: "2022-05-08T18:09:51Z"
2149       message: Cluster not available for [{operator 4.10.0-0.nightly-2022-05-07-205137}]
2150       status: "False"
2151       type: Available
(...)
prometheus is upset, too (most certainly consequence): 

2324     name: monitoring
2325     ownerReferences:
2326     - apiVersion: config.openshift.io/v1
2327       kind: ClusterVersion
2328       name: version
2329       uid: ac48dfab-714f-4645-a314-40b6a3919adc
2330     resourceVersion: "119949"
2331     uid: 89f0aca6-6110-4511-9594-d7d979047156
2332   spec: {}
2333   status:
2334     conditions:
2335     - lastTransitionTime: "2022-05-08T14:45:03Z"
2336       status: "True"
2337       type: Upgradeable
2338     - lastTransitionTime: "2022-05-08T18:02:23Z"
2339       message: Rollout of the monitoring stack failed and is degraded. Please investigate
2340         the degraded status error.
2341       reason: UpdatingPrometheusOperatorFailed
2342       status: "False"
2343       type: Available
2344     - lastTransitionTime: "2022-05-08T18:37:24Z"
2345       message: Rolling out the stack.
2346       reason: RollOutInProgress
2347       status: "True"
2348       type: Progressing
2349     - lastTransitionTime: "2022-05-08T18:02:23Z"
2350       message: 'Failed to rollout the stack. Error: updating prometheus operator:
2351         reconciling Prometheus Operator Admission Webhook Deployment failed: updating
2352         Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook:
2353         got 1 unavailable replicas'
2354       reason: UpdatingPrometheusOperatorFailed
2355       status: "True"
2356       type: Degraded

The question is - what was the cause of the failing master to go down - and is this related to static IPs?

Comment 11 Adina Wolff 2022-06-14 13:35:31 UTC

Indeed, What we saw was a result of BZ2036677
After applying the workaround for this bug (https://access.redhat.com/articles/6865841), the upgrade completes successfully.
Right now I'm checking if we need to apply the workaround again post upgrade.

Comment 13 Adina Wolff 2022-06-14 14:07:01 UTC

After further investigation, this bug is a manifestation of bz2036677 and will be closed as such.

*** This bug has been marked as a duplicate of bug 2036677 ***