Bug 2000375
Summary: | ovsdb-server silently drops monitors | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Tim Rozet <trozet> |
Component: | ovsdb2.13 | Assignee: | Ilya Maximets <i.maximets> |
Status: | CLOSED DEFERRED | QA Contact: | ovs-qe |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | RHEL 8.0 | CC: | bbennett, ctrautma, dblack, dcbw, i.maximets, jhsiao, jiji, ralongi, smalleni, surya |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-02-23 12:07:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1998614 |
Description
Tim Rozet
2021-09-02 02:50:03 UTC
The full log is too large to attach. Please message me and I'll share it. Looking in the ovsdb code I see ovsdb_idl_txn_process_reply(struct ovsdb_idl *idl, const struct jsonrpc_msg *msg) { struct ovsdb_idl_txn *txn = ovsdb_idl_txn_find(idl, msg->id); if (!txn) { return; } enum ovsdb_idl_txn_status status; if (msg->type == JSONRPC_ERROR) { if (msg->error && msg->error->type == JSON_STRING && !strcmp(json_string(msg->error), "canceled")) { /* ovsdb-server uses this error message to indicate that the * transaction was canceled because the database in question was * removed, converted, etc. */ which I think corresponds to: 2021-09-02T01:56:38Z|2639304|jsonrpc|DBG|ssl:192.168.216.12:53134: send reply, error="canceled", id=4785 Not sure why this happened. Hi. I looked through the logs. Unfortunately, they doesn't contain enough information. What we know: 1. ovsdb-server in question is a raft follower. 2. It's ovs 2.13 and it is loaded heavily and used non-optimally. (transactions executed against the follower, client uses monitor v1, ovs 2.13 doesn't support column diffs) 3. Follower was a little behind the leader, but communicated normally, i.e. installed a snapshot at some point, replied to heartbeats and append requests. 4. At some point it decided to cancel the transaction from a client. 5. For some reason that happened on the same iteration when memory/show appctl was executed. Likely a coincidence. Typically, transaction canceling happens if server decides to re-connect clients or if monitor is removed. Presumably, monitor was removed, as we see that memory/show reports 0 monitors. This usually happens if the database itself got removed or converted or disconnected from a storage, i.e. server fell form the raft cluster. But we have no indication of any of these events in logs. OTOH, log covers only 40 seconds of the run time and it doesn't even include the moment where transaction in question received from a client. It would be great to collect more logs. At least, we need to collect logs from the raft leader for the same period of time to see what is going on with a cluster. I understand that logs are huge. If you can collect only a first 1000 characters per line, that should be enough, and the resulted log file will be small. E.g. after trimming the attached log this way its size reduced from almost 1GB down to ~4MB. This can be done by redirecting the log output to the 'cut -c -1000'. Since it's not easy to get logs from this setup, as we discussed on slack, I prepared a special build to test with. It's the same package that you have, except that it crashes itself every time it tries to cancel a transaction. Build is available here: http://brew-task-repos.usersys.redhat.com/repos/scratch/imaximet/openvswitch2.13/2.13.0/79.bz2000375.0.1.el8fdp/ Please, try to reproduce the issue with it, so we can analyze the coredump and find out why the monitor is getting canceled. Cancellation of a transaction should not be a very common operation, so I hope that it will not just crash randomly. Though, false-positives are possible. Tim, any luck on reproducing the issue with Ilya's test packages? https://github.com/openshift/ovn-kubernetes/pull/767 will create an ovn-k image for scale team to test later on. Hi, Tim, Surya. I don't think any work has been done for a last 4-5 months on this BZ. Should we close it as DEFERRED or at least reduce the priority/severity? Yeah I think it is OK to close it for now unless we reproduce it again. (In reply to Tim Rozet from comment #10) > Yeah I think it is OK to close it for now unless we reproduce it again. ++ |