Bug 1519742

Summary: Split Brain data is not reflected in grafana
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijay Avuthu <vavuthu>
Component: web-admin-tendrl-monitoring-integrationAssignee: Nishanth Thomas <nthomas>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: fbalak, ltrilety, rhs-bugs, sanandpa, sankarshan, shtripat, ssaha
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-monitoring-integration-1.5.4-14.el7rhgs.noarch Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-18 04:38:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Healing panel none

Description Vijay Avuthu 2017-12-01 10:59:18 UTC
Description of problem:

Split Brain data is not reflected in grafana. Showing as 0 even though there are split brain count in gluster.

Version-Release number of selected component (if applicable):

#dhcp46-247.lab.eng.blr.redhat.com
[root@dhcp46-247 ~]# rpm -qa | grep -i tendrl
tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-api-httpd-1.5.4-3.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-3.el7rhgs.noarch
[root@dhcp46-247 ~]# 

How reproducible:

Always

Steps to Reproduce:

1. Create vol ( 1 * 2 )
2. Create split brain files ( stopping all heal daemons )

eg: 

[root@dhcp42-129 ~]# gluster vol heal vol12 info split-brain
Brick 10.70.42.127:/gluster/brick10/b10
/
Status: Connected
Number of entries in split-brain: 1

Brick 10.70.42.119:/gluster/brick10/b10
/
Status: Connected
Number of entries in split-brain: 1

[root@dhcp42-129 ~]# 

3. check the grafana for the same volume.

Actual results:

Showing as 0

Expected results:

It should show Split Brain - 1

Additional info:

Comment 3 Shubhendu Tripathi 2017-12-04 05:39:37 UTC
@Vijay, we have modified the logic to use `gluster volume heal <volname> info` and `gluster volume heal <volname> info split-brain` now. If the last command `gluster volume heal <volname> info split-brain` returns a value, it should be reported in grafana. Yes it might take sometime to reflect in grafana as next sync would reflect the same.

Comment 4 Shubhendu Tripathi 2017-12-04 06:18:33 UTC
Also, please check if you are using the latest builds of tendrl as there have been changes around heal info commands. If not I would suggest to migrate to latest builds and verify the scenario once.

Comment 6 Filip Balák 2017-12-05 18:11:53 UTC
I followed reproducer for setting split brain file from https://usmqe-testdoc.readthedocs.io/en/latest/web/alerting/glusternative_subvolume.html
I ended up with:

````
# gluster volume heal volume_alpha_distrep_6x2 info split-brain
...
Brick fbalak-usm1-gl5.usmqe.com:/mnt/brick_alpha_distrep_2/2
/
Status: Connected
Number of entries in split-brain: 1

Brick fbalak-usm1-gl6.usmqe.com:/mnt/brick_alpha_distrep_2/2
/
Status: Connected
Number of entries in split-brain: 1
````

But even after hour in `Volume` dashboard is:
`Split Brain: -`

--> ASSIGNED

Tested with:
tendrl-gluster-integration-1.5.4-8.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.5.4-5.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-11.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-commons-1.5.4-6.el7rhgs.noarch
tendrl-api-1.5.4-4.el7rhgs.noarch
tendrl-api-httpd-1.5.4-4.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-11.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-node-agent-1.5.4-9.el7rhgs.noarch
tendrl-notifier-1.5.4-6.el7rhgs.noarch

Comment 7 Filip Balák 2017-12-05 18:13:22 UTC
Created attachment 1363305 [details]
Healing panel

Comment 9 Nishanth Thomas 2017-12-05 19:07:18 UTC
@filip, What is the system configuration of Tendrl server and storage nodes
Also require the logs
We need access to your setup. We have seen this working in our environment.

Comment 11 Shubhendu Tripathi 2017-12-06 09:55:39 UTC
Added a PR https://github.com/Tendrl/node-agent/pull/701 to handle parsing of `heal info` output in split brain case.

Comment 12 Lubos Trilety 2017-12-08 13:43:33 UTC
Tested with:
tendrl-monitoring-integration-1.5.4-13.el7rhgs.noarch

The split-brain number is not zero or empty anymore.

However as the number is sum of all entries, it's always multiple of replica count.
i.e. 1 'bad' file on volume with replica 3 means there will be 'Split Brains - 3' in Grafana Healing panel. Because there is 1 occurrence of the same file with different content on each brick from the replica set. I am not sure if this is correct behaviour.

Comment 13 Shubhendu Tripathi 2017-12-08 15:23:29 UTC
Lubos, yes this concern was discussed with Bala as well. Bala would be raising a separate BZ for the same.

Comment 14 Nishanth Thomas 2017-12-10 11:23:55 UTC
Please refer https://bugzilla.redhat.com/show_bug.cgi?id=1523786 and fix is available in the latest builds

Comment 15 Lubos Trilety 2017-12-11 09:42:37 UTC
Tested with:
tendrl-monitoring-integration-1.5.4-14.el7rhgs.noarch

The split-brain status is displayed on Grafana per brick.

Comment 17 errata-xmlrpc 2017-12-18 04:38:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478