Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1953688

Summary:	Make cluster status user-friendly no matter which node it is run on
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Carlos Goncalves <cgoncalves>
Component:	ovsdb2.15	Assignee:	Ilya Maximets <i.maximets>
Status:	CLOSED WONTFIX	QA Contact:	qding
Severity:	medium	Docs Contact:
Priority:	medium
Version:	RHEL 8.0	CC:	ctrautma, echaudro, jhsiao, mmichels, ralongi
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-04-23 21:09:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Carlos Goncalves 2021-04-26 16:23:59 UTC

This is a follow-up to BZ #1929690 to request improvements to the CLI output of "ovs-appctl cluster/status" command for a better user experience.

Whenever a clustered node becomes offline or in a split-brain situation, the cluster status output appears to reflect that with a combination of server IDs, parenthesis and arrows. The output is far from being intuitive to mortal/non-developer OVS users, and thus likely to mislead users into thinking the cluster is in a healthy state. The impact of not clearly indicating a known network issue is arguably even more critical when troubleshooting production environments.

Below is an example copy-pasted from Michele's comment #10 in BZ #1929690.

We have the following three nodes (running in VMs):
controller-0 172.16.2.241
controller-1 172.16.2.79
controller-2 172.16.2.57

ovn-dbs is clustered across those three nodes and we virsh destroy controller-0.
Then on controller-1 we see the following:
[root@controller-1 ovs-2.13.0]# podman exec -ti ovn_cluster_north_db_server sh
sh-4.4# ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
45fb
Name: OVN_Northbound
Cluster ID: 613c (613c0b6e-65af-4810-bb48-c9cbea43d442)
Server ID: 45fb (45fba88d-5980-49ab-b562-1ea6e0db266c)
Address: ssl:172.16.2.79:6643
Status: cluster member
Role: follower
Term: 2
Leader: c7dd
Vote: c7dd

Election timer: 1000
Log: [2, 118]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->c7dd (->9287) <-c7dd
Servers:
    45fb (45fb at ssl:172.16.2.79:6643) (self)
    c7dd (c7dd at ssl:172.16.2.57:6643)
    9287 (9287 at ssl:172.16.2.241:6643)

And above I have no indication that controller-0 (aka 172.16.2.241) is really gone when we query from the surviving, quorate partition.

Comment 2 Ilya Maximets 2024-04-23 21:09:25 UTC

The situation with the readability of the cluster/status output
was significantly improved by Lorenzo in:
  https://github.com/openvswitch/ovs/commit/e8451e1443e7c677190da9ddce5dbd4dfffc3590

The number of disconnections was added as well as the time since
the last message from each server.  This should make it clear
that some connections are not actually working.

The braces syntax for connections that are not fully established
is still there, but I'm not sure how to make it more clear without
putting a bunch of text to the output.

Also, followers do not talk to each other, so it's hard to reliably
determine the state of the other follower.  Only the leader actually
talks to all the followers, so only the leader has the most up to
date state of the cluster members.  Cluster may be healthy even if
followers are fenced from each other as long as the leader can talk
to them.

I'll close this issue for now.  If we need further improvements,
a new FDP Jira issue can be open for that.