1992281 – ovn-northd and ovsdb-server core dumps in upgrade job (4.8 -> 4.9)

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1992281 - ovn-northd and ovsdb-server core dumps in upgrade job (4.8 -> 4.9)

Summary: ovn-northd and ovsdb-server core dumps in upgrade job (4.8 -> 4.9)

Keywords:
Status:	CLOSED DUPLICATE of bug 1957030
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	ovn2.12
Sub Component:
Version:	FDP 20.I
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	FDP 20.I
Assignee:	OVN Team
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-10 20:49 UTC by jamo luhrsen
Modified:	2023-09-15 01:13 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-17 16:46:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1491	0	None	None	None	2021-08-17 15:12:06 UTC

Description jamo luhrsen 2021-08-10 20:49:01 UTC

In our upgrade jobs, like this one [0] we end up with core dumps [1]
happening on the master nodes. In this particular job, there was
a core dump of ovn-northd on all 3 masters [2] and one of the master
nodes also had an ovsdb-server core dump on one master, but not the
other two. Another example job has core dumps of ovn-northd and ovsdb-server
all all masters [3].

Filing just one bug for now in case the two processes are aborting
for a similar reason.

It's unclear if this is causing any functionality issues as we still
see these core dumps in the rare job that passes all test cases.

It also appears that these are coming from the initial release software
(so 4.8 for these examples) and not the release we are upgrading to (4.9)
This is because the little bit of debugging I was able to do only
seemed to work when I was using the 4.8 binary.

To debug ovn-northd:

1. bring up a 4.8 cluster bot cluster
2. exec in to one of the ovnkube-master containers
3. download and install all the debuginfo and debugsymbol rpms from brew [4]
   - have to 'oc cp' all the files.
4. copy the core file from the job artifact
5. yum install gdb
6. run gdb


 [root@ip-10-0-144-74 tmp]# rpm -Uhv --force --nodeps *rpm
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:ovn2.13-debugsource-20.12.0-140.e################################# [ 11%]
   2:ovn2.13-debuginfo-20.12.0-140.el8################################# [ 22%]
   3:ovn2.13-20.12.0-140.el8fdp       ################################# [ 33%]
   4:ovn2.13-central-20.12.0-140.el8fd################################# [ 44%]
   5:ovn2.13-host-20.12.0-140.el8fdp  ################################# [ 56%]
   6:ovn2.13-vtep-20.12.0-140.el8fdp  ################################# [ 67%]
   7:ovn2.13-central-debuginfo-20.12.0################################# [ 78%]
   8:ovn2.13-host-debuginfo-20.12.0-14################################# [ 89%]
   9:ovn2.13-vtep-debuginfo-20.12.0-14################################# [100%]
[root@ip-10-0-144-74 tmp]# gdb /usr/bin/ovn-northd core.ovn-northd.0.b15565cd412e47b89ea1e2dfb7cd97cd.2968.1628175478000000
<snip>
...
<snip>
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/ovn-northd...Reading symbols from /usr/lib/debug/usr/bin/ovn-northd-20.12.0-140.el8fdp.x86_64.debug...done.
done.
[New LWP 1]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.149.79:9641'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6fe46d5e91 in abort () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 python3-libs-3.6.8-37.el8.x86_64 unbound-libs-1.7.3-15.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) bt
#0  0x00007f6fe46d5e91 in abort () from /lib64/libc.so.6
#1  0x000055f9187a93fc in fatal_signal_run () at lib/fatal-signal.c:313
#2  0x000055f9187ec1d0 in poll_block () at lib/poll-loop.c:388
#3  0x000055f918732dc5 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14390
(gdb) frame 3
#3  0x000055f918732dc5 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14390
14390            poll_block();
(gdb) list
14385                VLOG_INFO("Resetting northbound database cluster state");
14386                ovsdb_idl_reset_min_index(ovnnb_idl_loop.idl);
14387                reset_ovnnb_idl_min_index = false;
14388            }
14389   
14390            poll_block();
14391            if (should_service_stop()) {
14392                exiting = true;
14393            }
14394        }
(gdb) 



for ovsdb-server, the process was similar, but I could not find any relevant debuginfo or
debugsymbol rpm's to install. I tried to build ovsdb-server from scratch using the CFLAGS="-g"
option (cause google told me to), but using that binary caused some error about the binary
being newer than the core file. But, I was still able to grab something that might be
useful using the installed ovsdb-server binary on that container:

[root@ip-10-0-196-88 openvswitch-2.15.1]# gdb /usr/sbin/ovsdb-server /tmp/core.ovsdb-server.0.b15565cd412e47b89ea1e2dfb7cd97cd.3698.1628175478000000 
<snip>
...
<snip>
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/ovsdb-server...Reading symbols from .gnu_debugdata for /usr/sbin/ovsdb-server...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 193]
[New LWP 1]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-sb.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f68a5feee91 in abort () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f68a4c14700 (LWP 193))]
Missing separate debuginfos, use: yum debuginfo-install openvswitch2.15-2.15.0-9.el8fdp.x86_64
(gdb) bt
#0  0x00007f68a5feee91 in abort () from /lib64/libc.so.6
#1  0x00005604ee6a826c in fatal_signal_run ()
#2  0x00005604ee6b8600 in poll_block ()
#3  0x00005604ee6aff5a in ovsrcu_postpone_thread ()
#4  0x00005604ee6b1143 in ovsthread_wrapper ()
#5  0x00007f68a6c5a14a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f68a60c9dc3 in clone () from /lib64/libc.so.6
(gdb) frame 1


I think the consensus is that having these packages aborting during an upgrade is not ideal.
In the same vein, there is a PR [5] being worked that will fail the step that gathers core
files if they are found. So these upgrade jobs will never be able to pass if we have that
step fail.


[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648/artifacts/e2e-aws-ovn-upgrade/gather-core-dump/artifacts/
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/oc_cmds/machines
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1425022728976470016/artifacts/e2e-aws-ovn-upgrade/gather-core-dump/artifacts/
[4] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1630236

Comment 2 Ilya Maximets 2021-08-17 15:27:16 UTC

Hi.  This looks very much as duplicate of BZ1957030.

The sequence of events is following:
1. Container is terminating sending SIGTERM to the process inside.
2. Process intercepts the SIGTERM, finalizes few things and tries
   to re-raise the signal to terminate itself with a correct exit
   code.
3. Process calls signal(SIGTERM, SIG_DFL); raise(SIGTERM);
4. raise(SIGTERM) fails inside glibc!
5. process tries to terminate itself with abort()
6. abort() inside glibc calls raise(SIGABRT) and fails!
7. After few attempts to raise a signal, glibc gives up
   and executes ABORT_INSTRUCTION that basically generates SIGSEGV.
8. SIGSEGV terminates the process with a coredump.

There is nothing really can be done from the application.
The only way to fix that is to figure out why glibc fails
raising of the signal.  So, it's not an issue of OVN or OVS.
If anything, this BZ should be re-assigned to glibc for the
investigation.

Comment 3 jamo luhrsen 2021-08-17 16:46:44 UTC

yes, agreed it is a dup of bz1957030. Not sure how I missed that in my original search for
this.

*** This bug has been marked as a duplicate of bug 1957030 ***

Comment 4 Red Hat Bugzilla 2023-09-15 01:13:28 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.