Bug 1509873

Summary: Errors Per Second panel doesn't reflect errors on hosts
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Filip Balák <fbalak>
Component: web-admin-tendrl-monitoring-integrationAssignee: Ankush Behl <anbehl>
Status: CLOSED WONTFIX QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: gshanmug, mbukatov, mkarnik, ppenicka
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-08 16:14:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Errors Per Second panel none

Description Filip Balák 2017-11-06 09:14:42 UTC
Description of problem:
When I call:

# journalctl -u collectd --no-pager

I get these errors periodically almost every minute:

...
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Exception in thread Thread-59438:
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Traceback (most recent call last):
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.run()
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 765, in run
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.__target(*self.__args, **self.__kwargs)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_details
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.get_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 138, in get_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self.fetch_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 61, in fetch_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: brick_path.replace('/', '_').replace("_", "", 1)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 598, in read
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self._result_from_response(response)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 812, in _result_from_response
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: 'Server response was not valid JSON: %r' % e)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: EtcdException: Server response was not valid JSON: ValueError('No JSON object could be decoded',)
Nov 06 03:43:26 fbalak-usm1-gl1.usmqe collectd[3253]: Failed to fetch volume heal statistics.The error is: Gathering crawl statistics on volume volume_alpha_distrep_6x2 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
...

But when I open Hosts dashboard I see no errors in the Errors Per Second panel.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
glusterfs-3.8.4-48.el7rhgs.x86_64

How reproducible:
Not sure. 60%

Steps to Reproduce:
1. Import cluster with volume
2. Restart all machines.
3. Mount the volume and add some files.
4. Remove some file directly from disk. (I tried to break it, so I would see some errors in chart)
5. Run `journalctl -u collectd --no-pager` and look at the output for errors.
6. Open Hosts dashboard and look at Errors Per Second.

Actual results:
I have log for collectd full of errors but in Errors Per Second are shown no errors for all the time it runs.

Expected results:
I think that errors on collectd machines should be reported in UI. Maybe there should be tooltip describing what is reported.

Additional info:
Also bug https://bugzilla.redhat.com/show_bug.cgi?id=1508041 seems to appear.

Comment 1 Filip Balák 2017-11-06 09:15:54 UTC
Created attachment 1348484 [details]
Errors Per Second panel

Comment 2 Nishanth Thomas 2017-11-07 09:08:27 UTC
Ideally this is something which needs to be taken care as part of the integration with common logging. Also some parts this can be addressed by doing service level monitoring in tendrl. Ankush will be filing a upstream issue and link it here. 

This is an enhancement, take up later in a future release

Comment 3 Petr Penicka 2017-11-08 13:39:14 UTC
Triage Nov 8: QE agrees this can be postponed to next release.