1888075 – 4.5.14 -> 4.6.rc3 in OVN cluster failed with controller version mismatch

Bug 1888075 - 4.5.14 -> 4.6.rc3 in OVN cluster failed with controller version mismatch

Summary: 4.5.14 -> 4.6.rc3 in OVN cluster failed with controller version mismatch

Keywords:
Status:	CLOSED DUPLICATE of bug 1880591
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-14 00:44 UTC by Mike Fiedler
Modified:	2021-04-05 17:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-14 15:22:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike Fiedler 2020-10-14 00:44:21 UTC

Description of problem:

Cluster started as 4.4.27 and upgraded to 4.5.14 OK.  Upgrading from there to 4.6.0.rc3 hung with the nodes at kube 1.19 and the masters at 1.18 and this error for the MCO:

  Extension:
    Last Sync Error:  pool master has not progressed to latest configuration: controller version mismatch for rendered-master-c34c90c3ab3bcf286f4074a555f7c1ad expected 48d52f385642cbecf5c95e0ac4b0ec8c37664fe7 has 13dd7810adc20c7e6d99adc4179969eac54e7783: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-1e06e226a3c41e1ba728addf7e860cf6, retrying
    Master:           0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-1e06e226a3c41e1ba728addf7e860cf6
    Worker:           all 3 nodes are at latest configuration rendered-worker-74d8bdd6b44d383a653c72938cb7d6e8

Will add must-gather location shortly.

Cluster is a UPI install with OVN network plugin on GCP and I will keep it around until tomorrow (14-Oct)



Version-Release number of selected component (if applicable): 4.6.0.rc3


How reproducible: Unknown


Steps to Reproduce:
1.  Installed UPI on GCP at 4.4.27
2.  Upgraded to 4.5.14 successfully.   
3.  Set channel to 4.6-candidate and upgrade to 4.6.rc3

Actual results:

Upgrade hangs with an MCO error and mixed kubernetes versions for master/compute

Comment 2 Antonio Murdaca 2020-10-14 08:48:19 UTC

adding Tim to assess weather those nodes just lose connectivity (ready 0)

Comment 3 Mike Fiedler 2020-10-14 11:34:47 UTC

12 hours later the cluster is still in this state - I'll put a kubeconfig location in a private comment.

root@ip-172-31-64-58: ~ # oc get nodes
NAME                                                          STATUS   ROLES    AGE   VERSION
mffiedler1013b-zk5xv-m-0.c.openshift-qe.internal              Ready    master   16h   v1.18.3+970c1b3
mffiedler1013b-zk5xv-m-1.c.openshift-qe.internal              Ready    master   16h   v1.18.3+970c1b3
mffiedler1013b-zk5xv-m-2.c.openshift-qe.internal              Ready    master   16h   v1.18.3+970c1b3
mffiedler1013b-zk5xv-worker-a-2cngf.c.openshift-qe.internal   Ready    worker   16h   v1.19.0+d59ce34
mffiedler1013b-zk5xv-worker-b-hvln5.c.openshift-qe.internal   Ready    worker   16h   v1.19.0+d59ce34
mffiedler1013b-zk5xv-worker-c-ghllj.c.openshift-qe.internal   Ready    worker   16h   v1.19.0+d59ce34
root@ip-172-31-64-58: ~ # oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-rc.3   True        False         False      6h10m
cloud-credential                           4.6.0-rc.3   True        False         False      16h
cluster-autoscaler                         4.6.0-rc.3   True        False         False      16h
config-operator                            4.6.0-rc.3   True        False         False      15h
console                                    4.6.0-rc.3   True        False         False      14h
csi-snapshot-controller                    4.6.0-rc.3   True        False         False      14h
dns                                        4.6.0-rc.3   True        False         False      14h
etcd                                       4.6.0-rc.3   True        False         False      16h
image-registry                             4.6.0-rc.3   True        False         False      16h
ingress                                    4.6.0-rc.3   True        False         False      14h
insights                                   4.6.0-rc.3   True        False         False      16h
kube-apiserver                             4.6.0-rc.3   True        False         False      16h
kube-controller-manager                    4.6.0-rc.3   True        False         False      16h
kube-scheduler                             4.6.0-rc.3   True        False         False      16h
kube-storage-version-migrator              4.6.0-rc.3   True        False         False      13h
machine-api                                4.6.0-rc.3   True        False         False      16h
machine-approver                           4.6.0-rc.3   True        False         False      15h
machine-config                             4.5.14       False       True          True       14h
marketplace                                4.6.0-rc.3   True        False         False      14h
monitoring                                 4.6.0-rc.3   True        False         False      14h
network                                    4.6.0-rc.3   True        False         False      16h
node-tuning                                4.6.0-rc.3   True        False         False      14h
openshift-apiserver                        4.6.0-rc.3   True        False         False      14h
openshift-controller-manager               4.6.0-rc.3   True        False         False      16h
openshift-samples                          4.6.0-rc.3   True        False         False      14h
operator-lifecycle-manager                 4.6.0-rc.3   True        False         False      16h
operator-lifecycle-manager-catalog         4.6.0-rc.3   True        False         False      16h
operator-lifecycle-manager-packageserver   4.6.0-rc.3   True        False         False      14h
service-ca                                 4.6.0-rc.3   True        False         False      16h
storage                                    4.6.0-rc.3   True        False         False      14h

Comment 5 Tim Rozet 2020-10-14 15:21:09 UTC

This looks the same as the other upgrade issues with OVN, the node running 4.5 has a screwed up br-local bridge with an extra patch port so kapi access wont work:

[root@mffiedler1013b-zk5xv-m-0 ~]# ovs-vsctl show
df305a14-74e2-4694-8e42-bebcc55fe21d
    Bridge br-local
        Port patch-lnet-node_local_switch-to-br-int
            Interface patch-lnet-node_local_switch-to-br-int
                type: patch
                options: {peer=patch-br-int-to-lnet-node_local_switch}
        Port ovn-k8s-gw0
            Interface ovn-k8s-gw0
                type: internal
        Port br-local
            Interface br-local
                type: internal
        Port patch--to-br-int
            Interface patch--to-br-int
                type: patch
                options: {peer=patch-br-int-to-}
        Port patch-br-local_mffiedler1013b-zk5xv-m-0.c.openshift-qe.internal-to-br-int
            Interface patch-br-local_mffiedler1013b-zk5xv-m-0.c.openshift-qe.internal-to-br-int
                type: patch
                options: {peer=patch-br-int-to-br-local_mffiedler1013b-zk5xv-m-0.c.openshift-qe.internal}

Comment 6 Tim Rozet 2020-10-14 15:22:03 UTC


*** This bug has been marked as a duplicate of bug 1880591 ***

Comment 7 W. Trevor King 2021-04-05 17:47:09 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.