1687333 – Incorrect number of hosts down reported

Bug 1687333 - Incorrect number of hosts down reported

Summary: Incorrect number of hosts down reported

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.5.0
Assignee:	Timothy Asir
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1645221 1696807
TreeView+	depends on / blocked

Reported:	2019-03-11 09:45 UTC by Filip Balák
Modified:	2019-10-30 12:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-19.el7rhgs
Doc Type:	Bug Fix
Doc Text:	Previously, when all nodes in a cluster were offline, the web administration interface did not report the correct number of nodes offline. Node status is now correctly tracked and reported.
Clone Of:
Environment:
Last Closed:	2019-10-30 12:23:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Cluster dashboard (138.14 KB, image/png) 2019-03-11 09:45 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl commons issues 1076	None	open	Incorrect number of hosts down reported	2020-02-27 12:07:24 UTC
Github	Tendrl commons pull 1087	None	closed	Fixed issue: Incorrect number of hosts down reported	2020-02-27 12:07:24 UTC
Red Hat Product Errata	RHBA-2019:3251	None	None	None	2019-10-30 12:23:34 UTC

Description Filip Balák 2019-03-11 09:45:56 UTC

Created attachment 1542818 [details]
Cluster dashboard

Description of problem:
Grafana sometimes reports wrong number of hosts down when all nodes are shut down.

When I stop services tendrl-node-agent, collectd and tendrl-gluster-integration then for the first time grafana usually shows correctly that all nodes are down but if I start them and after a while I stop these services again then grafana reports that 4 hosts are down and 2 are up. This is happening consistently with multiple installation with 6 nodes.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

How reproducible:
60%

Steps to Reproduce:
1. Import cluster with 6 nodes into Tendrl.
2. Stop services tendrl-node-agent, collectd and tendrl-gluster-integration on all nodes.
3. Wait for 5 minutes.
4. Check Cluster dashboard.
5. Start services tendrl-node-agent, collectd and tendrl-gluster-integration on all nodes.
6. Wait for all nodes to start.
7. Repeat steps 2-6 multiple times

Actual results:
In most of the times it reports 4 nodes are down and 2 nodes are up.

Expected results:
There should be always reported that all nodes are down.

Additional info:
Stopping/starting of services is automated by:
https://github.com/usmqe/usmqe-setup/blob/master/test_setup.tendrl_services_stopped_on_nodes.yml
https://github.com/usmqe/usmqe-setup/blob/master/test_teardown.tendrl_services_stopped_on_nodes.yml

Comment 2 gowtham 2019-04-01 17:05:22 UTC

PR: https://github.com/Tendrl/commons/pull/1077

Modified tendrl-node-agent watcher thread logic to report node status correctly

Comment 17 errata-xmlrpc 2019-10-30 12:23:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3251

Note You need to log in before you can comment on or make changes to this bug.