Currently ovn-ctl just executes 'ovs-appctl -t <database> exit' while executing ovn-ctl stop_ovsdb/stop_nb_ovsdb/stop_sb_ovsdb commands. However, 'ovs-appctl -t <database> exit' doesn't wait for the process to actually exit, this command only notifies the process that it needs to exit. This causes issues in container environment. When container engine asks OVN container to exit, it sends SIGTERM. As a preStop hook for this container 'ovn-ctl stop_ovsdb' could be executed to perform a graceful shutdown of databases, but that will not happen, because right after 'ovn-ctl stop_ovsdb' container engine will send SIGTERM to stop all the remaining processes. Since databases are still alive at this point, they will receive the signal and terminate without detaching the storage and closing connections gracefully. This may result in longer failure detection and service downtime. And while this is still should be OK for a cluster, it's better to not stress all the failover mechanisms if not necessary. ovn-ctl should use similar to stop_ovn_daemon() procedure for databases and actually wait for the process to exit.
Related OCP issue: https://bugzilla.redhat.com/show_bug.cgi?id=1944264
http://patchwork.ozlabs.org/project/openvswitch/list/?series=238747&state=%2A&archive=both
v2: http://patchwork.ozlabs.org/project/ovn/list/?series=238774&state=%2A&archive=both
Tested on ovn2.13-20.12.0-97.el8fdp [root@dell-per740-30 ovn]# pgrep -f OVN_Northbound 52570 [root@dell-per740-30 ovn]# ovn-ctl stop_nb_ovsdb && kill -15 52570 [root@dell-per740-30 ovn]# cat ovsdb-server-nb.log 2021-06-16T19:00:13.290Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-16T19:00:13.320Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-16T19:00:23.332Z|00003|memory|INFO|6048 kB peak resident set size after 10.0 seconds 2021-06-16T19:00:23.332Z|00004|memory|INFO|cells:43 monitors:2 sessions:1 2021-06-16T19:06:29.110Z|00002|daemon_unix(monitor)|INFO|pid 39527 died, exit status 0, exiting =======> kill command executed without error Tested on ovn2.13-20.12.0-135.el8fdp. [root@dell-per740-33 ~]# pgrep -f OVN_Northbound 52676 [root@dell-per740-33 ~]# ovn-ctl stop_nb_ovsdb && kill -15 52676 Exiting ovnnb_db (52676) [ OK ] -bash: kill: (52676) - No such process [root@dell-per740-33 ~]# cd /var/log/ovn [root@dell-per740-33 ovn]# cat ovsdb-server-nb.log 2021-06-17T19:09:51.829Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-17T19:09:51.839Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-17T19:10:01.851Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds 2021-06-17T19:10:01.851Z|00006|memory|INFO|cells:34 monitors:0 2021-06-17T19:10:31.035Z|00002|daemon_unix(monitor)|INFO|pid 52676 died, exit status 0, exiting ==========> kill ended up in error strace the OVN_Northbound process in both the versions showed similar results. After stop_nb_ovsdb, in both the versions, the ports remained opened and I can ping on these ports to netns. However ovn-nbctl does not show any valid databases. Is this what is expected?
(In reply to Ehsan Elahi from comment #7) > > strace the OVN_Northbound process in both the versions showed similar > results. OVS processes signals and the "exit" command in the main loop. I see that OVS reported exit code 0 in both cases, but 'kill' succeeded in the first case. It's probably because in the first case signal was delivered to the ovs-vswitchd while it was already too far in the processing of "exit" command, so it didn't get a chance to process SIGTERM. Though it seems to be fine to just check that kill fails, you may want to use non-maskable signal to have a more clear result, e.g. SIGKILL or SIGSEGV. This way OVS will not be able to trap it, so the exit code will reflect the signal regardless of the current state of ovs-vswitchd. > After stop_nb_ovsdb, in both the versions, the ports remained opened and I > can ping on these ports to netns. However ovn-nbctl does not show any valid > databases. Is this what is expected? Yes, this is fine. Dead Northbound database should not affect the dataplane, so ports and traffic should still work.
Tried different types of signals. Reproduced on: # rpm -qa | grep ovn ovn2.13-20.12.0-97.el8fdp.x86_64 ovn2.13-host-20.12.0-97.el8fdp.x86_64 ovn2.13-central-20.12.0-97.el8fdp.x86_64 # export PATH=$PATH:/usr/share/ovn/scripts # pgrep -f OVN_Northbound 39579 # ovn-ctl stop_nb_ovsdb && kill -9 39579 # cat ovsdb-server-nb.log 2021-06-21T11:02:51.928Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T11:02:51.949Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T11:03:01.962Z|00003|memory|INFO|6052 kB peak resident set size after 10.0 seconds 2021-06-21T11:03:01.962Z|00004|memory|INFO|cells:43 monitors:2 sessions:1 2021-06-21T11:06:03.922Z|00002|daemon_unix(monitor)|INFO|pid 39579 died, killed (Killed), exiting <=========== db server killed through the signal and the signal details mentioned in the log # ovn-ctl start_nb_ovsdb /etc/ovn/ovnnb_db.db does not exist ... (warning). Creating empty database /etc/ovn/ovnnb_db.db [ OK ] Starting ovsdb-nb [ OK ] # pgrep -f OVN_Northbound 40173 # ovn-ctl stop_nb_ovsdb && kill -11 40173 # cat ovsdb-server-nb.log 2021-06-21T11:02:51.928Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T11:02:51.949Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T11:03:01.962Z|00003|memory|INFO|6052 kB peak resident set size after 10.0 seconds 2021-06-21T11:03:01.962Z|00004|memory|INFO|cells:43 monitors:2 sessions:1 2021-06-21T11:06:03.922Z|00002|daemon_unix(monitor)|INFO|pid 39579 died, killed (Killed), exiting 2021-06-21T11:12:45.186Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T11:12:45.196Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T11:12:55.208Z|00003|memory|INFO|6024 kB peak resident set size after 10.0 seconds 2021-06-21T11:12:55.208Z|00004|memory|INFO|cells:34 monitors:0 2021-06-21T11:13:25.470Z|00002|backtrace(monitor)|WARN|Backtrace using libunwind not supported. 2021-06-21T11:13:25.470Z|00003|daemon_unix(monitor)|ERR|1 crashes: pid 40173 died, killed (Segmentation fault), core dumped, restarting 2021-06-21T11:13:25.474Z|00004|ovsdb_server(ovsdb-server)|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T11:13:25.474Z|00005|memory(ovsdb-server)|INFO|4736 kB peak resident set size after 40.3 seconds 2021-06-21T11:13:25.474Z|00006|memory(ovsdb-server)|INFO|cells:14 monitors:0 <============ db server killed through the signal and the signal name can be seen in the log above Verified on: # rpm -qa | grep ovn ovn2.13-20.12.0-135.el8fdp.x86_64 ovn2.13-host-20.12.0-135.el8fdp.x86_64 ovn2.13-central-20.12.0-135.el8fdp.x86_64 # rpm -qa | grep ovn ovn2.13-central-20.12.0-135.el7fdp.x86_64 ovn2.13-20.12.0-135.el7fdp.x86_64 ovn2.13-host-20.12.0-135.el7fdp.x86_64 # rpm -qa | grep ovn ovn-2021-21.03.0-40.el8fdp.x86_64 ovn-2021-host-21.03.0-40.el8fdp.x86_64 ovn-2021-central-21.03.0-40.el8fdp.x86_64 ## Below results are from verification on 135.el8fdp. Similar results on the other two releases. # pgrep -f OVN_Northbound 97671 # ovn-ctl stop_nb_ovsdb && kill -9 97671 Exiting ovnnb_db (97671) [ OK ] -bash: kill: (97671) - No such process # cat ovsdb-server-nb.log 2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds 2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting # ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] # pgrep -f OVN_Northbound 97856 # ovn-ctl stop_nb_ovsdb && kill -11 97856 Exiting ovnnb_db (97856) [ OK ] -bash: kill: (97856) - No such process # cat ovsdb-server-nb.log 2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds 2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting 2021-06-21T10:54:20.602Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T10:54:20.611Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T10:54:20.622Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T10:54:20.622Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T10:54:30.621Z|00005|memory|INFO|6024 kB peak resident set size after 10.0 seconds 2021-06-21T10:54:30.621Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T10:55:24.073Z|00002|daemon_unix(monitor)|INFO|pid 97856 died, exit status 0, exiting # ovn-ctl start_nb_ovsdb Starting ovsdb-nb [ OK ] # pgrep -f OVN_Northbound 98371 # ovn-ctl stop_nb_ovsdb && kill -15 98371 Exiting ovnnb_db (98371) [ OK ] -bash: kill: (98371) - No such process # cat ovsdb-server-nb.log 2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds 2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting 2021-06-21T10:54:20.602Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T10:54:20.611Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T10:54:20.622Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T10:54:20.622Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T10:54:30.621Z|00005|memory|INFO|6024 kB peak resident set size after 10.0 seconds 2021-06-21T10:54:30.621Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T10:55:24.073Z|00002|daemon_unix(monitor)|INFO|pid 97856 died, exit status 0, exiting 2021-06-21T11:50:49.421Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2021-06-21T11:50:49.432Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4 2021-06-21T11:50:49.442Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2021-06-21T11:50:49.442Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2021-06-21T11:50:59.443Z|00005|memory|INFO|5920 kB peak resident set size after 10.0 seconds 2021-06-21T11:50:59.443Z|00006|memory|INFO|cells:34 monitors:0 2021-06-21T11:51:26.157Z|00002|daemon_unix(monitor)|INFO|pid 98371 died, exit status 0, exiting <============= For every type of kill signal, the db stopped normally as expected with exist status 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2507