Description of problem: There seems to be an issue with the OVN cluster implementation once the lower bound on the amount of cluster members is no longer respected. If a cluster is created using 3 members (requiring the minimum amount of cluster members to be 2, according to the raft concensus formula) and members then start dropping off (using: ovs-appctl -t /var/run/openvswitch/ovnnb_db.ctl cluster/leave OVN_Northbound) until there's only one member left: the cluster behaves normally. No errors (which is expected) or warnings (at the very least) are logged, and transactions seem to be working fine to the last member. Version-Release number of selected component (if applicable): OVN 2.12 How reproducible: Steps to Reproduce: 1. Create and OVN cluster with 3 members 2. Delete 2 of them 3. Check OVN cluster status and last's members logs. Actual results: No indication of problems and new transactions are accepted. Expected results: Indication that cluster consensus cannot be established, and future transactions not accepted. Additional info: -
Hi, I reached out to Ben Pfaff about this issue. I'll just quote his response here: "The confusion is over what "cluster/leave" does. I guess the documentation isn't clear enough! This command removes a server from the cluster. That is, if you use it to remove 2 servers from a 3-node cluster, the remaining server is a 1-node cluster and thus quorum exists. It uses the Raft procedure for safely updating cluster membership. To see the behavior when quorum isn't available, just kill two of the server processes." So "leave" in this case is meant to permanently alter the cluster size, not just remove a server and keep the cluster the same configured size. Does this explanation make sense? If so, would a documentation tweak be enough to fix this?
I submitted a documentation fix upstream and it has been merged. The documentation for cluster/leave now has an additional paragraph: "Note that removing the server from the cluster alters the total size of the cluster. For example, if you remove two servers from a three server cluster, then the "cluster" becomes a single functioning server. This does not result in a three server cluster that lacks quorum."
Mark, can we consider this bug finished now?