Bug 2000375

Summary:	ovsdb-server silently drops monitors
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Tim Rozet <trozet>
Component:	ovsdb2.13	Assignee:	Ilya Maximets <i.maximets>
Status:	CLOSED DEFERRED	QA Contact:	ovs-qe
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	RHEL 8.0	CC:	bbennett, ctrautma, dblack, dcbw, i.maximets, jhsiao, jiji, ralongi, smalleni, surya
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-23 12:07:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1998614

Description Tim Rozet 2021-09-02 02:50:03 UTC

Description of problem:
With a client connected to an ovsdb-server, during heavy load the server will drop the monitor without sending a monitor_cancel message back to the client.

Version-Release number of selected component (if applicable):
openvswitch2.13-2.13.0-79.el8fdp.x86_64

How reproducible:
Somewhat difficult


Steps to Reproduce:
1. using ovn-kubernetes, have ovnkube-master (ovsdb client using go-ovn/libovsdb) connect to an nbdb follower
2. run scale tests repeatedly, loading nbdb to high cpu usage
3. eventually transaction errors will start appearing, because go-ovn cache is stale, due to the monitor being dropped silently by ovsdb-server


Additional info:
I was able to capture the logs from the nbdb with jsonrpc:DBG on. Simultaneously I was polling for how many monitors the nbdb had. 

ovnkube connection:
tcp        0    965 192.168.216.12:53134    192.168.216.10:9641     ESTABLISHED 1947758/ovnkube


We can see on nbdb at 192.168.216.10:

2021-09-02T01:56:36Z|2639119|jsonrpc|DBG|unix#35292: send reply, result="cells:123514 monitors:1 raft-connections:4 raft-log:674 sessions:3 triggers:1\n", id=0

...

2021-09-02T01:56:38Z|2639295|timeval|WARN|Unreasonably long 1605ms poll interval (1432ms user, 151ms system)
2021-09-02T01:56:38Z|2639296|timeval|WARN|faults: 183423 minor, 0 major
2021-09-02T01:56:38Z|2639297|timeval|WARN|disk: 0 reads, 13192 writes
2021-09-02T01:56:38Z|2639298|timeval|WARN|context switches: 409 voluntary, 662 involuntary
2021-09-02T01:56:38Z|2639299|coverage|INFO|Skipping details of duplicate event coverage for hash=3b99bba2
2021-09-02T01:56:38Z|2639300|poll_loop|INFO|Dropped 466 log messages in last 16 seconds (most recently, 12 seconds ago) due to excessive rate
2021-09-02T01:56:38Z|2639301|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (192.168.216.10:9643<->192.168.216.12:47238) at ../lib/stream-ssl.c:832 (96% CPU usage)
2021-09-02T01:56:38Z|2639302|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 (/var/run/ovn/ovnnb_db.ctl<->) at ../lib/stream-fd.c:274 (96% CPU usage)
2021-09-02T01:56:38Z|2639303|jsonrpc|DBG|unix#35294: received request, method="memory/show", params=[], id=0
2021-09-02T01:56:38Z|2639304|jsonrpc|DBG|ssl:192.168.216.12:53134: send reply, error="canceled", id=4785
2021-09-02T01:56:38Z|2639305|jsonrpc|DBG|unix#35294: send reply, result="cells:131126 monitors:0 raft-connections:4 sessions:3\n", id=0


^ a transaction reply for 4785 is sent as canceled, and the monitor is now gone. I was unable to find the transaction request for 4785 in any of the logs.

Comment 1 Tim Rozet 2021-09-02 02:52:33 UTC

The full log is too large to attach. Please message me and I'll share it.

Comment 2 Tim Rozet 2021-09-02 03:00:58 UTC

Looking in the ovsdb code I see 

ovsdb_idl_txn_process_reply(struct ovsdb_idl *idl,
                            const struct jsonrpc_msg *msg)
{
    struct ovsdb_idl_txn *txn = ovsdb_idl_txn_find(idl, msg->id);
    if (!txn) {
        return;
    }

    enum ovsdb_idl_txn_status status;
    if (msg->type == JSONRPC_ERROR) {
        if (msg->error
            && msg->error->type == JSON_STRING
            && !strcmp(json_string(msg->error), "canceled")) {
            /* ovsdb-server uses this error message to indicate that the
            * transaction was canceled because the database in question was
            * removed, converted, etc. */

which I think corresponds to:
2021-09-02T01:56:38Z|2639304|jsonrpc|DBG|ssl:192.168.216.12:53134: send reply, error="canceled", id=4785

Not sure why this happened.

Comment 4 Ilya Maximets 2021-09-02 11:07:43 UTC

Hi.  I looked through the logs.  Unfortunately, they doesn't contain
enough information.  What we know:
1. ovsdb-server in question is a raft follower.
2. It's ovs 2.13 and it is loaded heavily and used non-optimally.
   (transactions executed against the follower, client uses monitor v1,
    ovs 2.13 doesn't support column diffs)
3. Follower was a little behind the leader, but communicated normally,
   i.e. installed a snapshot at some point, replied to heartbeats and
   append requests.
4. At some point it decided to cancel the transaction from a client.
5. For some reason that happened on the same iteration when memory/show
   appctl was executed.  Likely a coincidence.

Typically, transaction canceling happens if server decides to re-connect
clients or if monitor is removed.  Presumably, monitor was removed, as
we see that memory/show reports 0 monitors.  This usually happens if
the database itself got removed or converted or disconnected from a
storage, i.e. server fell form the raft cluster.  But we have no indication
of any of these events in logs.

OTOH, log covers only 40 seconds of the run time and it doesn't even
include the moment where transaction in question received from a client.

It would be great to collect more logs.  At least, we need to collect
logs from the raft leader for the same period of time to see what is
going on with a cluster.

I understand that logs are huge.  If you can collect only a first 1000
characters per line, that should be enough, and the resulted log file
will be small.  E.g. after trimming the attached log this way its size
reduced from almost 1GB down to ~4MB.  This can be done by redirecting
the log output to the 'cut -c -1000'.

Comment 5 Ilya Maximets 2021-09-02 22:32:33 UTC

Since it's not easy to get logs from this setup, as we discussed on slack,
I prepared a special build to test with.  It's the same package that you have,
except that it crashes itself every time it tries to cancel a transaction.
Build is available here:
  http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/79.bz2000375.0.1.el8fdp/

Please, try to reproduce the issue with it, so we can analyze the coredump
and find out why the monitor is getting canceled.

Cancellation of a transaction should not be a very common operation, so I
hope that it will not just crash randomly.  Though, false-positives are
possible.

Comment 6 Dan Williams 2021-09-13 20:42:01 UTC

Tim, any luck on reproducing the issue with Ilya's test packages?

Comment 7 Surya Seetharaman 2021-09-22 17:42:24 UTC

https://github.com/openshift/ovn-kubernetes/pull/767 will create an ovn-k image for scale team to test later on.

Comment 9 Ilya Maximets 2022-02-22 15:52:03 UTC

Hi, Tim, Surya.
I don't think any work has been done for a last 4-5 months on this BZ.
Should we close it as DEFERRED or at least reduce the priority/severity?

Comment 10 Tim Rozet 2022-02-22 16:15:25 UTC

Yeah I think it is OK to close it for now unless we reproduce it again.

Comment 11 Surya Seetharaman 2022-02-23 12:09:35 UTC

(In reply to Tim Rozet from comment #10)
> Yeah I think it is OK to close it for now unless we reproduce it again.

++