Bug 1875282 - [OVN][upgrade] 4.4.5 -> 4.4.11 ovn crashing and generating coredump
Summary: [OVN][upgrade] 4.4.5 -> 4.4.11 ovn crashing and generating coredump
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.4.z
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 1879023
Blocks: 1937118
TreeView+ depends on / blocked
 
Reported: 2020-09-03 08:14 UTC by Robin Cernin
Modified: 2023-12-15 19:09 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1879023 (view as bug list)
Environment:
Last Closed: 2020-10-22 08:26:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robin Cernin 2020-09-03 08:14:01 UTC
Description of problem:


Version-Release number of selected component (if applicable):
After successful upgrade from 4.4.5 to 4.4.11

How reproducible:

Steps to Reproduce:
1. Deploy 4.4.5 with OVN
2. Upgrade to 4.4.11
3. Check coredumps/nortd container logs for "inconsistent data"

Actual results:

OVN reports inconsistent data in northd container of ovnkube-master pod and systemd-coredumps reports ovn-northd and ovsdb-server crashing with SIGSEGV 11

Expected results:


Additional info:

Finished the update successfully:

  oc adm upgrade
Cluster version is 4.4.11

 oc logs ovnkube-master-rgjjx -n openshift-ovn-kubernetes -c northd | grep inconsistent
2020-09-03T05:26:50Z|00034|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:26:51Z|00035|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:26:56Z|00036|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:27:01Z|00037|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:27:06Z|00038|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:27:51Z|00043|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:28:54Z|00045|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:29:54Z|00047|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:30:55Z|00049|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:31:55Z|00051|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:32:55Z|00053|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T05:33:56Z|00073|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
...
2020-09-03T06:27:51Z|00207|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:28:52Z|00209|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:29:51Z|00211|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:30:52Z|00213|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:31:52Z|00215|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:32:54Z|00217|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}

 oc logs ovnkube-master-jtl7p -n openshift-ovn-kubernetes -c northd | grep inconsistent
2020-09-03T06:44:28Z|00097|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:44:29Z|00098|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:44:32Z|00099|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:44:36Z|00100|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:44:42Z|00101|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:45:32Z|00105|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:46:31Z|00113|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:47:29Z|00119|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:48:31Z|00134|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T06:49:29Z|00138|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}

I can see those inconsistent data in my cluster after upgrade and coredumps:

 oc debug node/master-2.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'coredumpctl'
Starting pod/master-2rcerninajlabpnq2ceeredhatcom-debug ...
To use host binaries, run `chroot /host`
TIME                            PID   UID   GID SIG COREFILE  EXE
Thu 2020-09-03 05:21:39 UTC    4510     0     0  11 present   /usr/bin/ovn-northd
Thu 2020-09-03 05:21:46 UTC    5412     0     0  11 present   /usr/sbin/ovsdb-server
Thu 2020-09-03 05:21:46 UTC    4997     0     0  11 present   /usr/sbin/ovsdb-server
Thu 2020-09-03 07:54:49 UTC    2650     0     0  11 present   /usr/bin/ovn-northd
Thu 2020-09-03 07:54:56 UTC    4248     0     0  11 present   /usr/sbin/ovsdb-server
Thu 2020-09-03 07:54:56 UTC    3293     0     0  11 present   /usr/sbin/ovsdb-server

I have identified my OVN pods:

 oc get pods -A -o wide | grep ovnkube-master
openshift-ovn-kubernetes                                ovnkube-master-jtl7p                                                  4/4     Running            0          148m   10.74.177.22    master-0.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>
openshift-ovn-kubernetes                                ovnkube-master-rgjjx                                                  4/4     Running            0          143m   10.74.178.167   master-2.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>
openshift-ovn-kubernetes                                ovnkube-master-s2bm9                                                  4/4     Running            0          142m   10.74.178.132   master-1.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>


Deleted them including the DB: [1]

 oc debug node/master-0.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-jtl7p -n openshift-ovn-kubernetes

 oc debug node/master-1.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-rgjjx -n openshift-ovn-kubernetes

 oc debug node/master-2.rcerninaj.lab.pnq2.cee.redhat.com -- chroot /host sh -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && oc delete pod ovnkube-master-s2bm9 -n openshift-ovn-kubernetes

After the DB is recreated from Kube API:

 oc get pods -A -o wide | grep ovnkube-master
openshift-ovn-kubernetes                                ovnkube-master-gtljh                                                  4/4     Running            0          2m13s   10.74.177.22    master-0.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>
openshift-ovn-kubernetes                                ovnkube-master-l5hqh                                                  4/4     Running            0          107s    10.74.178.167   master-2.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>
openshift-ovn-kubernetes                                ovnkube-master-s64fh                                                  4/4     Running            0          52s     10.74.178.132   master-1.rcerninaj.lab.pnq2.cee.redhat.com   <none>           <none>

As of now, I believe I can't see any coredumps after recreating the DB following [1] above.

Comment 3 Ben Bennett 2020-09-03 14:12:57 UTC
The 4.4 failure needs to be investigated and, perhaps, backported.

Comment 4 Alexander Constantinescu 2020-09-03 17:34:48 UTC
Hi

The errors:

> inconsistent data","error":"ovsdb error"}

Should be resolved with the openvswitch version 2.13.0-52 which should be picked up by openshift/ovn-kubernetes as of a couple of days ago. This version, it is *believed*, might fix the core dumps as well. 

Important to note: this will not be delivered on 4.4.11 as that version was built 9 weeks ago: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.4.11

4.4.7 contains: 

rpm -qa | grep openvswitch
openvswitch-selinux-extra-policy-1.0-15.el7fdp.noarch
openvswitch2.13-2.13.0-29.el7fdp.x86_64
openvswitch2.13-devel-2.13.0-29.el7fdp.x86_64

Upgrading from 4.4.5 -> 4.4.11 is thus not supported, any minor upgrade on 4.4.X will need to happen to the latest 4.4.X version.

Comment 5 Alexander Constantinescu 2020-09-03 17:35:44 UTC
Sorry, typo on my part: not 4.4.7, 4.4.10 contains

Comment 6 Alexander Constantinescu 2020-09-03 17:36:31 UTC
Well, sorry again...4.4.11 :)

Comment 7 Michael Zamot 2020-09-03 18:28:25 UTC
So it looks like the PR for openvswitch2.13-2.13.0-52.el8fdp is not even merged yet: https://github.com/openshift/ovn-kubernetes/pull/244

Are there plans to backport this to 4.4, if so, is there a deadline or planned release? This is blocking our customer upgrades right now.

Comment 10 Alexander Constantinescu 2020-09-04 11:53:37 UTC
I would say no updates on 4.4 can be assumed functioning properly until we have a fix for https://github.com/openshift/ovn-kubernetes/pull/244 on 4.4 and a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1875438

And yes, we are working on them.

Comment 20 Steve Reichard 2020-09-23 22:30:49 UTC
I notice the 4.5.z blocking bughttps://bugzilla.redhat.com/show_bug.cgi?id=1879023 has been closed defferred 

Does that close the path to getting this addressed for this critical customer?

Comment 26 Feng Pan 2020-10-20 14:18:41 UTC
4.4.5 to 4.5.14 is not a valid upgrade path. We should be using 4.4.5->4.4.26->4.5.14 as the upgrade path.


Note You need to log in before you can comment on or make changes to this bug.