Bug 1821360 - [OVN SCALE] HA/RAFT not integrated correctly into ovsdb
Summary: [OVN SCALE] HA/RAFT not integrated correctly into ovsdb
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovsdb
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: ovs-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-06 16:37 UTC by Anton Ivanov
Modified: 2023-07-13 07:31 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-556 0 None None None 2022-02-16 14:50:31 UTC

Description Anton Ivanov 2020-04-06 16:37:07 UTC
This was found when analysing failures in the early prototypes of the async IO patchset.

Initial condition:

A DB is removed from OVSDB due to an admin request or a raft HA decision

This triggers notifications for monitor cancels of all clients which are subscribed to the database. These notifications are enqueued to be sent to clients which takes a finite amount of time. 

Any client which will issue a transaction to the database during this time window will receive a "syntax error" JSON reply.

This will be extremely difficult to fix without major API additions because there is no mandatory flush and there is no "upper level" means of triggering a json rpc "echo" and waiting for an echo reply to ensure that anything on the wire between the server and the client has been flushed.

This is likely to be an issue ONLY at scale when there are a lot of pending requests and a lot of pending notifications to transmit.

It is somewhat mitigated by ovsdb connection being effectively half-duplex and it not invoking jsonrpc session receive if there is pending transmit. While this mitigates it, it does not fix it. If ovsdb is optimized for throughput in any way, this is likely to become easier to reproduce.


Note You need to log in before you can comment on or make changes to this bug.