1944239 – [OVN] ovn-ctl should wait for database processes being stopped

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1944239 - [OVN] ovn-ctl should wait for database processes being stopped

Summary: [OVN] ovn-ctl should wait for database processes being stopped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	FDP 21.C
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	OVN Team
QA Contact:	Ehsan Elahi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-29 15:08 UTC by Ilya Maximets
Modified:	2021-06-21 14:46 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-21 14:44:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1944264	1	urgent	CLOSED	[ovn] CNO should gracefully terminate OVN databases	2022-08-10 10:36:37 UTC
Red Hat Product Errata	RHBA-2021:2507	0	None	None	None	2021-06-21 14:46:02 UTC

Internal Links: 1944264

Description Ilya Maximets 2021-03-29 15:08:56 UTC

Currently ovn-ctl just executes 'ovs-appctl -t <database> exit' while
executing ovn-ctl stop_ovsdb/stop_nb_ovsdb/stop_sb_ovsdb commands.

However, 'ovs-appctl -t <database> exit' doesn't wait for the process
to actually exit, this command only notifies the process that it needs
to exit.  This causes issues in container environment.

When container engine asks OVN container to exit, it sends SIGTERM.
As a preStop hook for this container 'ovn-ctl stop_ovsdb' could be
executed to perform a graceful shutdown of databases, but that will
not happen, because right after 'ovn-ctl stop_ovsdb' container engine
will send SIGTERM to stop all the remaining processes.  Since databases
are still alive at this point, they will receive the signal and terminate
without detaching the storage and closing connections gracefully.

This may result in longer failure detection and service downtime.
And while this is still should be OK for a cluster, it's better to not
stress all the failover mechanisms if not necessary.

ovn-ctl should use similar to stop_ovn_daemon() procedure for databases
and actually wait for the process to exit.

Comment 1 Ilya Maximets 2021-03-29 15:48:15 UTC

Related OCP issue:
  https://bugzilla.redhat.com/show_bug.cgi?id=1944264

Comment 2 Dan Williams 2021-04-12 14:32:19 UTC

http://patchwork.ozlabs.org/project/openvswitch/list/?series=238747&state=%2A&archive=both

Comment 3 Dan Williams 2021-04-12 16:48:52 UTC

v2: http://patchwork.ozlabs.org/project/ovn/list/?series=238774&state=%2A&archive=both

Comment 7 Ehsan Elahi 2021-06-17 19:51:40 UTC

Tested on ovn2.13-20.12.0-97.el8fdp 

[root@dell-per740-30 ovn]# pgrep -f OVN_Northbound
52570
[root@dell-per740-30 ovn]# ovn-ctl stop_nb_ovsdb && kill -15 52570
[root@dell-per740-30 ovn]# cat ovsdb-server-nb.log 
2021-06-16T19:00:13.290Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-16T19:00:13.320Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-16T19:00:23.332Z|00003|memory|INFO|6048 kB peak resident set size after 10.0 seconds
2021-06-16T19:00:23.332Z|00004|memory|INFO|cells:43 monitors:2 sessions:1
2021-06-16T19:06:29.110Z|00002|daemon_unix(monitor)|INFO|pid 39527 died, exit status 0, exiting

=======> kill command executed without error

Tested on ovn2.13-20.12.0-135.el8fdp. 

[root@dell-per740-33 ~]# pgrep -f OVN_Northbound
52676
[root@dell-per740-33 ~]# ovn-ctl stop_nb_ovsdb && kill -15 52676
Exiting ovnnb_db (52676)                                   [  OK  ]
-bash: kill: (52676) - No such process
[root@dell-per740-33 ~]# cd /var/log/ovn
[root@dell-per740-33 ovn]# cat ovsdb-server-nb.log 
2021-06-17T19:09:51.829Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-17T19:09:51.839Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-17T19:10:01.851Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds
2021-06-17T19:10:01.851Z|00006|memory|INFO|cells:34 monitors:0
2021-06-17T19:10:31.035Z|00002|daemon_unix(monitor)|INFO|pid 52676 died, exit status 0, exiting

==========> kill ended up in error

strace the OVN_Northbound process in both the versions showed similar results. 
After stop_nb_ovsdb, in both the versions, the ports remained opened and I can ping on these ports to netns. However ovn-nbctl does not show any valid databases. Is this what is expected?

Comment 8 Ilya Maximets 2021-06-21 09:53:50 UTC

(In reply to Ehsan Elahi from comment #7)
> 
> strace the OVN_Northbound process in both the versions showed similar
> results.

OVS processes signals and the "exit" command in the main loop.  I see that
OVS reported exit code 0 in both cases, but 'kill' succeeded in the first case.
It's probably because in the first case signal was delivered to the ovs-vswitchd
while it was already too far in the processing of "exit" command, so it didn't
get a chance to process SIGTERM.  Though it seems to be fine to just check that
kill fails, you may want to use non-maskable signal to have a more clear result,
e.g. SIGKILL or SIGSEGV.  This way OVS will not be able to trap it, so the exit
code will reflect the signal regardless of the current state of ovs-vswitchd.

> After stop_nb_ovsdb, in both the versions, the ports remained opened and I
> can ping on these ports to netns. However ovn-nbctl does not show any valid
> databases. Is this what is expected?

Yes, this is fine.  Dead Northbound database should not affect the dataplane,
so ports and traffic should still work.

Comment 11 Ehsan Elahi 2021-06-21 13:06:44 UTC

Tried different types of signals. 
Reproduced on:
# rpm -qa | grep ovn
ovn2.13-20.12.0-97.el8fdp.x86_64
ovn2.13-host-20.12.0-97.el8fdp.x86_64
ovn2.13-central-20.12.0-97.el8fdp.x86_64

# export PATH=$PATH:/usr/share/ovn/scripts
# pgrep -f OVN_Northbound
39579
# ovn-ctl stop_nb_ovsdb && kill -9 39579
# cat ovsdb-server-nb.log 
2021-06-21T11:02:51.928Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T11:02:51.949Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T11:03:01.962Z|00003|memory|INFO|6052 kB peak resident set size after 10.0 seconds
2021-06-21T11:03:01.962Z|00004|memory|INFO|cells:43 monitors:2 sessions:1
2021-06-21T11:06:03.922Z|00002|daemon_unix(monitor)|INFO|pid 39579 died, killed (Killed), exiting

<=========== db server killed through the signal and the signal details mentioned in the log

# ovn-ctl start_nb_ovsdb

/etc/ovn/ovnnb_db.db does not exist ... (warning).
Creating empty database /etc/ovn/ovnnb_db.db               [  OK  ]
Starting ovsdb-nb                                          [  OK  ]

# pgrep -f OVN_Northbound
40173
# ovn-ctl stop_nb_ovsdb && kill -11 40173
# cat ovsdb-server-nb.log 
2021-06-21T11:02:51.928Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T11:02:51.949Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T11:03:01.962Z|00003|memory|INFO|6052 kB peak resident set size after 10.0 seconds
2021-06-21T11:03:01.962Z|00004|memory|INFO|cells:43 monitors:2 sessions:1
2021-06-21T11:06:03.922Z|00002|daemon_unix(monitor)|INFO|pid 39579 died, killed (Killed), exiting
2021-06-21T11:12:45.186Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T11:12:45.196Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T11:12:55.208Z|00003|memory|INFO|6024 kB peak resident set size after 10.0 seconds
2021-06-21T11:12:55.208Z|00004|memory|INFO|cells:34 monitors:0
2021-06-21T11:13:25.470Z|00002|backtrace(monitor)|WARN|Backtrace using libunwind not supported.
2021-06-21T11:13:25.470Z|00003|daemon_unix(monitor)|ERR|1 crashes: pid 40173 died, killed (Segmentation fault), core dumped, restarting
2021-06-21T11:13:25.474Z|00004|ovsdb_server(ovsdb-server)|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T11:13:25.474Z|00005|memory(ovsdb-server)|INFO|4736 kB peak resident set size after 40.3 seconds
2021-06-21T11:13:25.474Z|00006|memory(ovsdb-server)|INFO|cells:14 monitors:0

<============ db server killed through the signal and the signal name can be seen in the log above

Verified on:
# rpm -qa | grep ovn
ovn2.13-20.12.0-135.el8fdp.x86_64
ovn2.13-host-20.12.0-135.el8fdp.x86_64
ovn2.13-central-20.12.0-135.el8fdp.x86_64

# rpm -qa | grep ovn
ovn2.13-central-20.12.0-135.el7fdp.x86_64
ovn2.13-20.12.0-135.el7fdp.x86_64
ovn2.13-host-20.12.0-135.el7fdp.x86_64

# rpm -qa | grep ovn
ovn-2021-21.03.0-40.el8fdp.x86_64
ovn-2021-host-21.03.0-40.el8fdp.x86_64
ovn-2021-central-21.03.0-40.el8fdp.x86_64

## Below results are from verification on 135.el8fdp. Similar results on the other two releases. 

# pgrep -f OVN_Northbound
97671
# ovn-ctl stop_nb_ovsdb && kill -9 97671
Exiting ovnnb_db (97671)                                   [  OK  ]
-bash: kill: (97671) - No such process
# cat ovsdb-server-nb.log
2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds
2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting

# ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
# pgrep -f OVN_Northbound
97856
# ovn-ctl stop_nb_ovsdb && kill -11 97856
Exiting ovnnb_db (97856)                                   [  OK  ]
-bash: kill: (97856) - No such process
# cat ovsdb-server-nb.log
2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds
2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting
2021-06-21T10:54:20.602Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T10:54:20.611Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T10:54:20.622Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T10:54:20.622Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T10:54:30.621Z|00005|memory|INFO|6024 kB peak resident set size after 10.0 seconds
2021-06-21T10:54:30.621Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T10:55:24.073Z|00002|daemon_unix(monitor)|INFO|pid 97856 died, exit status 0, exiting

# ovn-ctl start_nb_ovsdb
Starting ovsdb-nb                                          [  OK  ]
# pgrep -f OVN_Northbound
98371
# ovn-ctl stop_nb_ovsdb && kill -15 98371
Exiting ovnnb_db (98371)                                   [  OK  ]
-bash: kill: (98371) - No such process
# cat ovsdb-server-nb.log 
2021-06-21T10:37:41.529Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T10:37:41.536Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T10:37:41.546Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T10:37:41.546Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T10:37:51.547Z|00005|memory|INFO|6224 kB peak resident set size after 10.0 seconds
2021-06-21T10:37:51.547Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T10:53:05.863Z|00002|daemon_unix(monitor)|INFO|pid 97671 died, exit status 0, exiting
2021-06-21T10:54:20.602Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T10:54:20.611Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T10:54:20.622Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T10:54:20.622Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T10:54:30.621Z|00005|memory|INFO|6024 kB peak resident set size after 10.0 seconds
2021-06-21T10:54:30.621Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T10:55:24.073Z|00002|daemon_unix(monitor)|INFO|pid 97856 died, exit status 0, exiting
2021-06-21T11:50:49.421Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2021-06-21T11:50:49.432Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-06-21T11:50:49.442Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-06-21T11:50:49.442Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-06-21T11:50:59.443Z|00005|memory|INFO|5920 kB peak resident set size after 10.0 seconds
2021-06-21T11:50:59.443Z|00006|memory|INFO|cells:34 monitors:0
2021-06-21T11:51:26.157Z|00002|daemon_unix(monitor)|INFO|pid 98371 died, exit status 0, exiting

<============= For every type of kill signal, the db stopped normally as expected with exist status 0

Comment 13 errata-xmlrpc 2021-06-21 14:44:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2507

Note You need to log in before you can comment on or make changes to this bug.