Bug 1883242 - northd and ovn NBDB both crash during ovn-k8s deployment
Summary: northd and ovn NBDB both crash during ovn-k8s deployment
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: OVN
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dumitru Ceara
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-28 14:52 UTC by Tim Rozet
Modified: 2021-05-31 00:54 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-12 13:40:44 UTC
Target Upstream Version:


Attachments (Terms of Use)
ovsdb server core dump (9.88 MB, application/x-lz4)
2020-09-28 14:54 UTC, Tim Rozet
no flags Details
northd core dump (5.24 MB, application/x-lz4)
2020-09-28 14:54 UTC, Tim Rozet
no flags Details
ovn dbs (981.14 KB, application/gzip)
2020-09-28 14:55 UTC, Tim Rozet
no flags Details

Description Tim Rozet 2020-09-28 14:52:30 UTC
Description of problem:
While investigating upgrade failure in:
https://bugzilla.redhat.com/show_bug.cgi?id=1880591#c29

I see northd and nbdb both crashed. nbdb is a segfault:
2020-09-28T07:05:42Z|00048|raft|INFO|current entry eid c412ebfa-44ea-4406-a6f8-3f0260801ab4 does not match prerequisite 0a392e2c-d534-47d0-8c65-7329dec47f6f in execute_command_request
2020-09-28T07:05:52Z|00049|raft|INFO|current entry eid 7887dfd3-7719-4a20-98c7-ea5552f70363 does not match prerequisite e50798a4-d7d1-4be2-b20b-11a8734cd445 in execute_command_request
2020-09-28T07:06:11Z|00050|raft|INFO|Dropped 4 log messages in last 20 seconds (most recently, 18 seconds ago) due to excessive rate
2020-09-28T07:06:11Z|00051|raft|INFO|current entry eid d9154f99-15c3-401b-9b1b-cdc006f15357 does not match prerequisite 438c7302-f9ba-492e-b1b1-c657f4c446a0 in execute_command_request

2020-09-28T07:08:08Z|00001|fatal_signal(log_fsync0)|WARN|terminating with signal 15 (Terminated)
2020-09-28T07:08:08Z|00001|fatal_signal(urcu2)|WARN|terminating with signal 15 (Terminated)
2020-09-28T07:08:08Z|00052|fatal_signal|WARN|terminating with signal 11 (Segmentation fault)

The only thing unique to this upgrade was QE was trying a patch I wrote to remove certain reject ACLs, which was flawed and will just end up in an error from ovn-nbctl:

E0928 14:12:52.224724       1 loadbalancer.go:291] Error while removing ACL: 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443, from switches, error: OVN command '/usr/bin/ovn-nbctl --timeout=15 -- --if-exists remove logical_switch 51f5ee00-ae2b-48a9-92f3-7beb877b5106 acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443 -- --if-exists remove logical_switch e9c07de0-65c5-47e3-ad1e-072a4ecc0bdb acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443 -- --if-exists remove logical_switch e9319c12-33b5-49d0-bea7-eb41fcc3bd0a acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443 -- --if-exists remove logical_switch 7ac93672-2b9a-4221-9ee9-1d9ad0fccfbb acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443 -- --if-exists remove logical_switch 8af3f2d8-3a11-48c7-9015-d8501bad79b4 acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443 -- --if-exists remove logical_switch 4650f909-f5cf-4465-97e5-01d52b9f5469 acl 866090f8-397f-4446-9b51-5d0e0f07a118-172.30.229.223\:443' failed: exit status 1

[root@ip-10-0-74-3 ~]# rpm -qa |grep ovn
ovn2.13-central-20.06.2-11.el8fdp.x86_64
ovn2.13-vtep-20.06.2-11.el8fdp.x86_64
ovn2.13-20.06.2-11.el8fdp.x86_64
ovn2.13-host-20.06.2-11.el8fdp.x86_64

Will attach dbs and coredumps. If nbdb crash is different from northd let me know and I'll open a separate bug.

Comment 1 Tim Rozet 2020-09-28 14:54:06 UTC
Created attachment 1717279 [details]
ovsdb server core dump

Comment 2 Tim Rozet 2020-09-28 14:54:39 UTC
Created attachment 1717280 [details]
northd core dump

Comment 3 Tim Rozet 2020-09-28 14:55:41 UTC
Created attachment 1717281 [details]
ovn dbs

Comment 5 Tim Rozet 2020-10-12 13:14:34 UTC
I'm not seeing this anymore, and it looks like the segfault was caused by the termination. I'm not sure when it happened during upgrade due to the log rotation. I'd suggest we close this for now and if we see it again reopen.

Comment 6 Dumitru Ceara 2020-10-12 13:40:44 UTC
(In reply to Tim Rozet from comment #5)
> I'm not seeing this anymore, and it looks like the segfault was caused by
> the termination. I'm not sure when it happened during upgrade due to the log
> rotation. I'd suggest we close this for now and if we see it again reopen.

Thanks for the update. Closing the BZ for now.


Note You need to log in before you can comment on or make changes to this bug.