The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1798158 - Lower bound for OVN raft membership is not respected
Summary: Lower bound for OVN raft membership is not respected
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: ovn2.12
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: OVN Team
QA Contact: Jianlin Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-04 17:12 UTC by Alexander Constantinescu
Modified: 2022-10-13 03:05 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-10 14:26:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-431 0 None None None 2022-10-13 03:05:38 UTC

Description Alexander Constantinescu 2020-02-04 17:12:49 UTC
Description of problem:

There seems to be an issue with the OVN cluster implementation once the lower bound on the amount of cluster members is no longer respected. 

If a cluster is created using 3 members (requiring the minimum amount of cluster members to be 2, according to the raft concensus formula) and members then start dropping off (using: ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl  cluster/leave OVN_Northbound) until there's only one member left: the cluster behaves normally. No errors (which is expected) or warnings (at the very least) are logged, and transactions seem to be working fine to the last member. 

Version-Release number of selected component (if applicable):

OVN 2.12

How reproducible:


Steps to Reproduce:
1. Create and OVN cluster with 3 members
2. Delete 2 of them
3. Check OVN cluster status and last's members logs.

Actual results:

No indication of problems and new transactions are accepted.

Expected results:

Indication that cluster consensus cannot be established, and future transactions not accepted.

Additional info:

-

Comment 2 Mark Michelson 2020-04-24 15:39:13 UTC
Hi, I reached out to Ben Pfaff about this issue. I'll just quote his response here:

    "The confusion is over what "cluster/leave" does.  I guess the
     documentation isn't clear enough!  This command removes a server from
     the cluster.  That is, if you use it to remove 2 servers from a 3-node
     cluster, the remaining server is a 1-node cluster and thus quorum
     exists.  It uses the Raft procedure for safely updating cluster
     membership.

     To see the behavior when quorum isn't available, just kill two of the
     server processes."

So "leave" in this case is meant to permanently alter the cluster size, not just remove a server and keep the cluster the same configured size. Does this explanation make sense? If so, would a documentation tweak be enough to fix this?

Comment 4 Mark Michelson 2020-05-13 20:09:27 UTC
I submitted a documentation fix upstream and it has been merged. The documentation for cluster/leave now has an additional paragraph:

"Note that removing the server from the cluster alters the total size
of the cluster. For example, if you remove two servers from a three
server cluster, then the "cluster" becomes a single functioning server.
This does not result in a three server cluster that lacks quorum."

Comment 5 Dan Williams 2020-07-28 16:15:49 UTC
Mark, can we consider this bug finished now?


Note You need to log in before you can comment on or make changes to this bug.