Description of problem: Version-Release number of selected component (if applicable): After successful upgrade from 4.4.5 to 4.4.11 How reproducible: Steps to Reproduce: 1. Deploy 4.4.5 with OVN 2. Upgrade to 4.4.11 3. Check coredumps/nortd container logs for "inconsistent data" Actual results: OVN reports inconsistent data in northd container of ovnkube-master pod and systemd-coredumps reports ovn-northd and ovsdb-server crashing with SIGSEGV 11 Expected results: Additional info: Finished the update successfully: oc adm upgrade Cluster version is 4.4.11 oc logs ovnkube-master-rgjjx -n openshift-ovn-kubernetes -c northd | grep inconsistent 2020-09-03T05:26:50Z|00034|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:26:51Z|00035|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:26:56Z|00036|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:27:01Z|00037|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:27:06Z|00038|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:27:51Z|00043|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:28:54Z|00045|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:29:54Z|00047|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:30:55Z|00049|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:31:55Z|00051|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:32:55Z|00053|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T05:33:56Z|00073|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} ... 2020-09-03T06:27:51Z|00207|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:28:52Z|00209|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:29:51Z|00211|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:30:52Z|00213|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:31:52Z|00215|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:32:54Z|00217|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} oc logs ovnkube-master-jtl7p -n openshift-ovn-kubernetes -c northd | grep inconsistent 2020-09-03T06:44:28Z|00097|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:44:29Z|00098|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:44:32Z|00099|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:44:36Z|00100|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:44:42Z|00101|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:45:32Z|00105|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:46:31Z|00113|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:47:29Z|00119|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:48:31Z|00134|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T06:49:29Z|00138|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} I can see those inconsistent data in my cluster after upgrade and coredumps: oc debug node/master-2.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'coredumpctl' Starting pod/master-2rcerninajlabpnq2ceeredhatcom-debug ... To use host binaries, run `chroot /host` TIME PID UID GID SIG COREFILE EXE Thu 2020-09-03 05:21:39 UTC 4510 0 0 11 present /usr/bin/ovn-northd Thu 2020-09-03 05:21:46 UTC 5412 0 0 11 present /usr/sbin/ovsdb-server Thu 2020-09-03 05:21:46 UTC 4997 0 0 11 present /usr/sbin/ovsdb-server Thu 2020-09-03 07:54:49 UTC 2650 0 0 11 present /usr/bin/ovn-northd Thu 2020-09-03 07:54:56 UTC 4248 0 0 11 present /usr/sbin/ovsdb-server Thu 2020-09-03 07:54:56 UTC 3293 0 0 11 present /usr/sbin/ovsdb-server I have identified my OVN pods: oc get pods -A -o wide | grep ovnkube-master openshift-ovn-kubernetes ovnkube-master-jtl7p 4/4 Running 0 148m 10.74.177.22 master-0.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> openshift-ovn-kubernetes ovnkube-master-rgjjx 4/4 Running 0 143m 10.74.178.167 master-2.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> openshift-ovn-kubernetes ovnkube-master-s2bm9 4/4 Running 0 142m 10.74.178.132 master-1.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> Deleted them including the DB: [1] oc debug node/master-0.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-jtl7p -n openshift-ovn-kubernetes oc debug node/master-1.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-rgjjx -n openshift-ovn-kubernetes oc debug node/master-2.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-s2bm9 -n openshift-ovn-kubernetes After the DB is recreated from Kube API: oc get pods -A -o wide | grep ovnkube-master openshift-ovn-kubernetes ovnkube-master-gtljh 4/4 Running 0 2m13s 10.74.177.22 master-0.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> openshift-ovn-kubernetes ovnkube-master-l5hqh 4/4 Running 0 107s 10.74.178.167 master-2.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> openshift-ovn-kubernetes ovnkube-master-s64fh 4/4 Running 0 52s 10.74.178.132 master-1.rcerninaj.lab.pnq2.cee.redhat.com <none> <none> As of now, I believe I can't see any coredumps after recreating the DB following [1] above.
The 4.4 failure needs to be investigated and, perhaps, backported.
Hi The errors: > inconsistent data","error":"ovsdb error"} Should be resolved with the openvswitch version 2.13.0-52 which should be picked up by openshift/ovn-kubernetes as of a couple of days ago. This version, it is *believed*, might fix the core dumps as well. Important to note: this will not be delivered on 4.4.11 as that version was built 9 weeks ago: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.4.11 4.4.7 contains: rpm -qa | grep openvswitch openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch openvswitch2.13-2.13.0-29.el7fdp.x86_64 openvswitch2.13-devel-2.13.0-29.el7fdp.x86_64 Upgrading from 4.4.5 -> 4.4.11 is thus not supported, any minor upgrade on 4.4.X will need to happen to the latest 4.4.X version.
Sorry, typo on my part: not 4.4.7, 4.4.10 contains
Well, sorry again...4.4.11 :)
So it looks like the PR for openvswitch2.13-2.13.0-52.el8fdp is not even merged yet: https://github.com/openshift/ovn-kubernetes/pull/244 Are there plans to backport this to 4.4, if so, is there a deadline or planned release? This is blocking our customer upgrades right now.
I would say no updates on 4.4 can be assumed functioning properly until we have a fix for https://github.com/openshift/ovn-kubernetes/pull/244 on 4.4 and a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1875438 And yes, we are working on them.
I notice the 4.5.z blocking bughttps://bugzilla.redhat.com/show_bug.cgi?id=1879023 has been closed defferred Does that close the path to getting this addressed for this critical customer?
4.4.5 to 4.5.14 is not a valid upgrade path. We should be using 4.4.5->4.4.26->4.5.14 as the upgrade path.