Bug 1953688

Summary: Make cluster status user-friendly no matter which node it is run on
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Carlos Goncalves <cgoncalves>
Component: ovsdb2.15Assignee: Ilya Maximets <i.maximets>
Status: NEW --- QA Contact: qding
Severity: medium Docs Contact:
Priority: medium    
Version: RHEL 8.0CC: ctrautma, echaudro, jhsiao, mmichels, ralongi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Carlos Goncalves 2021-04-26 16:23:59 UTC
This is a follow-up to BZ #1929690 to request improvements to the CLI output of "ovs-appctl cluster/status" command for a better user experience.

Whenever a clustered node becomes offline or in a split-brain situation, the cluster status output appears to reflect that with a combination of server IDs, parenthesis and arrows. The output is far from being intuitive to mortal/non-developer OVS users, and thus likely to mislead users into thinking the cluster is in a healthy state. The impact of not clearly indicating a known network issue is arguably even more critical when troubleshooting production environments.

Below is an example copy-pasted from Michele's comment #10 in BZ #1929690.

We have the following three nodes (running in VMs):
controller-0 172.16.2.241
controller-1 172.16.2.79
controller-2 172.16.2.57

ovn-dbs is clustered across those three nodes and we virsh destroy controller-0.
Then on controller-1 we see the following:
[root@controller-1 ovs-2.13.0]# podman exec -ti ovn_cluster_north_db_server sh
sh-4.4# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
45fb
Name: OVN_Northbound
Cluster ID: 613c (613c0b6e-65af-4810-bb48-c9cbea43d442)
Server ID: 45fb (45fba88d-5980-49ab-b562-1ea6e0db266c)
Address: ssl:172.16.2.79:6643
Status: cluster member
Role: follower
Term: 2
Leader: c7dd
Vote: c7dd

Election timer: 1000
Log: [2, 118]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->c7dd (->9287) <-c7dd
Servers:
    45fb (45fb at ssl:172.16.2.79:6643) (self)
    c7dd (c7dd at ssl:172.16.2.57:6643)
    9287 (9287 at ssl:172.16.2.241:6643)

And above I have no indication that controller-0 (aka 172.16.2.241) is really gone when we query from the surviving, quorate partition.