1509873 – Errors Per Second panel doesn't reflect errors on hosts

Bug 1509873 - Errors Per Second panel doesn't reflect errors on hosts

Summary: Errors Per Second panel doesn't reflect errors on hosts

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ankush Behl
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-06 09:14 UTC by Filip Balák
Modified:	2019-05-08 18:13 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 16:14:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Errors Per Second panel (41.11 KB, image/png) 2017-11-06 09:15 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	https://github.com/Tendrl monitoring-integration issues 235	0	None	None	None	2017-11-07 12:57:57 UTC
Red Hat Bugzilla	1508041	0	unspecified	CLOSED	5 from 6 nodes are down and some chart don't reflect it	2021-02-22 00:41:40 UTC

Internal Links: 1508041

Description Filip Balák 2017-11-06 09:14:42 UTC

Description of problem:
When I call:

# journalctl -u collectd --no-pager

I get these errors periodically almost every minute:

...
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Exception in thread Thread-59438:
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Traceback (most recent call last):
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.run()
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 765, in run
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.__target(*self.__args, **self.__kwargs)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_details
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.get_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 138, in get_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self.fetch_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 61, in fetch_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: brick_path.replace('/', '_').replace("_", "", 1)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 598, in read
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self._result_from_response(response)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 812, in _result_from_response
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: 'Server response was not valid JSON: %r' % e)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: EtcdException: Server response was not valid JSON: ValueError('No JSON object could be decoded',)
Nov 06 03:43:26 fbalak-usm1-gl1.usmqe collectd[3253]: Failed to fetch volume heal statistics.The error is: Gathering crawl statistics on volume volume_alpha_distrep_6x2 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
...

But when I open Hosts dashboard I see no errors in the Errors Per Second panel.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
glusterfs-3.8.4-48.el7rhgs.x86_64

How reproducible:
Not sure. 60%

Steps to Reproduce:
1. Import cluster with volume
2. Restart all machines.
3. Mount the volume and add some files.
4. Remove some file directly from disk. (I tried to break it, so I would see some errors in chart)
5. Run `journalctl -u collectd --no-pager` and look at the output for errors.
6. Open Hosts dashboard and look at Errors Per Second.

Actual results:
I have log for collectd full of errors but in Errors Per Second are shown no errors for all the time it runs.

Expected results:
I think that errors on collectd machines should be reported in UI. Maybe there should be tooltip describing what is reported.

Additional info:
Also bug https://bugzilla.redhat.com/show_bug.cgi?id=1508041 seems to appear.

Comment 1 Filip Balák 2017-11-06 09:15:54 UTC

Created attachment 1348484 [details]
Errors Per Second panel

Comment 2 Nishanth Thomas 2017-11-07 09:08:27 UTC

Ideally this is something which needs to be taken care as part of the integration with common logging. Also some parts this can be addressed by doing service level monitoring in tendrl. Ankush will be filing a upstream issue and link it here. 

This is an enhancement, take up later in a future release

Comment 3 Petr Penicka 2017-11-08 13:39:14 UTC

Triage Nov 8: QE agrees this can be postponed to next release.

Note You need to log in before you can comment on or make changes to this bug.