In our upgrade jobs, like this one [0] we end up with core dumps [1] happening on the master nodes. In this particular job, there was a core dump of ovn-northd on all 3 masters [2] and one of the master nodes also had an ovsdb-server core dump on one master, but not the other two. Another example job has core dumps of ovn-northd and ovsdb-server all all masters [3]. Filing just one bug for now in case the two processes are aborting for a similar reason. It's unclear if this is causing any functionality issues as we still see these core dumps in the rare job that passes all test cases. It also appears that these are coming from the initial release software (so 4.8 for these examples) and not the release we are upgrading to (4.9) This is because the little bit of debugging I was able to do only seemed to work when I was using the 4.8 binary. To debug ovn-northd: 1. bring up a 4.8 cluster bot cluster 2. exec in to one of the ovnkube-master containers 3. download and install all the debuginfo and debugsymbol rpms from brew [4] - have to 'oc cp' all the files. 4. copy the core file from the job artifact 5. yum install gdb 6. run gdb [root@ip-10-0-144-74 tmp]# rpm -Uhv --force --nodeps *rpm Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:ovn2.13-debugsource-20.12.0-140.e################################# [ 11%] 2:ovn2.13-debuginfo-20.12.0-140.el8################################# [ 22%] 3:ovn2.13-20.12.0-140.el8fdp ################################# [ 33%] 4:ovn2.13-central-20.12.0-140.el8fd################################# [ 44%] 5:ovn2.13-host-20.12.0-140.el8fdp ################################# [ 56%] 6:ovn2.13-vtep-20.12.0-140.el8fdp ################################# [ 67%] 7:ovn2.13-central-debuginfo-20.12.0################################# [ 78%] 8:ovn2.13-host-debuginfo-20.12.0-14################################# [ 89%] 9:ovn2.13-vtep-debuginfo-20.12.0-14################################# [100%] [root@ip-10-0-144-74 tmp]# gdb /usr/bin/ovn-northd core.ovn-northd.0.b15565cd412e47b89ea1e2dfb7cd97cd.2968.1628175478000000 <snip> ... <snip> Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/bin/ovn-northd...Reading symbols from /usr/lib/debug/usr/bin/ovn-northd-20.12.0-140.el8fdp.x86_64.debug...done. done. [New LWP 1] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `ovn-northd --no-chdir -vconsole:info -vfile:off --ovnnb-db ssl:10.0.149.79:9641'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f6fe46d5e91 in abort () from /lib64/libc.so.6 Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.x86_64 libcap-ng-0.7.9-5.el8.x86_64 libevent-2.1.8-5.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 python3-libs-3.6.8-37.el8.x86_64 unbound-libs-1.7.3-15.el8.x86_64 zlib-1.2.11-17.el8.x86_64 (gdb) bt #0 0x00007f6fe46d5e91 in abort () from /lib64/libc.so.6 #1 0x000055f9187a93fc in fatal_signal_run () at lib/fatal-signal.c:313 #2 0x000055f9187ec1d0 in poll_block () at lib/poll-loop.c:388 #3 0x000055f918732dc5 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14390 (gdb) frame 3 #3 0x000055f918732dc5 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:14390 14390 poll_block(); (gdb) list 14385 VLOG_INFO("Resetting northbound database cluster state"); 14386 ovsdb_idl_reset_min_index(ovnnb_idl_loop.idl); 14387 reset_ovnnb_idl_min_index = false; 14388 } 14389 14390 poll_block(); 14391 if (should_service_stop()) { 14392 exiting = true; 14393 } 14394 } (gdb) for ovsdb-server, the process was similar, but I could not find any relevant debuginfo or debugsymbol rpm's to install. I tried to build ovsdb-server from scratch using the CFLAGS="-g" option (cause google told me to), but using that binary caused some error about the binary being newer than the core file. But, I was still able to grab something that might be useful using the installed ovsdb-server binary on that container: [root@ip-10-0-196-88 openvswitch-2.15.1]# gdb /usr/sbin/ovsdb-server /tmp/core.ovsdb-server.0.b15565cd412e47b89ea1e2dfb7cd97cd.3698.1628175478000000 <snip> ... <snip> Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/sbin/ovsdb-server...Reading symbols from .gnu_debugdata for /usr/sbin/ovsdb-server...(no debugging symbols found)...done. (no debugging symbols found)...done. [New LWP 193] [New LWP 1] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `ovsdb-server -vconsole:info -vfile:off --log-file=/var/log/ovn/ovsdb-server-sb.'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f68a5feee91 in abort () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7f68a4c14700 (LWP 193))] Missing separate debuginfos, use: yum debuginfo-install openvswitch2.15-2.15.0-9.el8fdp.x86_64 (gdb) bt #0 0x00007f68a5feee91 in abort () from /lib64/libc.so.6 #1 0x00005604ee6a826c in fatal_signal_run () #2 0x00005604ee6b8600 in poll_block () #3 0x00005604ee6aff5a in ovsrcu_postpone_thread () #4 0x00005604ee6b1143 in ovsthread_wrapper () #5 0x00007f68a6c5a14a in start_thread () from /lib64/libpthread.so.0 #6 0x00007f68a60c9dc3 in clone () from /lib64/libc.so.6 (gdb) frame 1 I think the consensus is that having these packages aborting during an upgrade is not ideal. In the same vein, there is a PR [5] being worked that will fail the step that gathers core files if they are found. So these upgrade jobs will never be able to pass if we have that step fail. [0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648 [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648/artifacts/e2e-aws-ovn-upgrade/gather-core-dump/artifacts/ [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1423271989505691648/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/oc_cmds/machines [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1425022728976470016/artifacts/e2e-aws-ovn-upgrade/gather-core-dump/artifacts/ [4] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1630236
Hi. This looks very much as duplicate of BZ1957030. The sequence of events is following: 1. Container is terminating sending SIGTERM to the process inside. 2. Process intercepts the SIGTERM, finalizes few things and tries to re-raise the signal to terminate itself with a correct exit code. 3. Process calls signal(SIGTERM, SIG_DFL); raise(SIGTERM); 4. raise(SIGTERM) fails inside glibc! 5. process tries to terminate itself with abort() 6. abort() inside glibc calls raise(SIGABRT) and fails! 7. After few attempts to raise a signal, glibc gives up and executes ABORT_INSTRUCTION that basically generates SIGSEGV. 8. SIGSEGV terminates the process with a coredump. There is nothing really can be done from the application. The only way to fix that is to figure out why glibc fails raising of the signal. So, it's not an issue of OVN or OVS. If anything, this BZ should be re-assigned to glibc for the investigation.
yes, agreed it is a dup of bz1957030. Not sure how I missed that in my original search for this. *** This bug has been marked as a duplicate of bug 1957030 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days