Bug 1509873 - Errors Per Second panel doesn't reflect errors on hosts
Summary: Errors Per Second panel doesn't reflect errors on hosts
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Ankush Behl
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-06 09:14 UTC by Filip Balák
Modified: 2019-05-08 18:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 16:14:10 UTC
Embargoed:


Attachments (Terms of Use)
Errors Per Second panel (41.11 KB, image/png)
2017-11-06 09:15 UTC, Filip Balák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/Tendrl monitoring-integration issues 235 0 None None None 2017-11-07 12:57:57 UTC
Red Hat Bugzilla 1508041 0 unspecified CLOSED 5 from 6 nodes are down and some chart don't reflect it 2021-02-22 00:41:40 UTC

Internal Links: 1508041

Description Filip Balák 2017-11-06 09:14:42 UTC
Description of problem:
When I call:

# journalctl -u collectd --no-pager

I get these errors periodically almost every minute:

...
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Exception in thread Thread-59438:
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Traceback (most recent call last):
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.run()
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 765, in run
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.__target(*self.__args, **self.__kwargs)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_details
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.get_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 138, in get_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self.fetch_brick_devices(brick_path)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 61, in fetch_brick_devices
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: brick_path.replace('/', '_').replace("_", "", 1)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 598, in read
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self._result_from_response(response)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 812, in _result_from_response
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: 'Server response was not valid JSON: %r' % e)
Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: EtcdException: Server response was not valid JSON: ValueError('No JSON object could be decoded',)
Nov 06 03:43:26 fbalak-usm1-gl1.usmqe collectd[3253]: Failed to fetch volume heal statistics.The error is: Gathering crawl statistics on volume volume_alpha_distrep_6x2 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
...

But when I open Hosts dashboard I see no errors in the Errors Per Second panel.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
glusterfs-3.8.4-48.el7rhgs.x86_64

How reproducible:
Not sure. 60%

Steps to Reproduce:
1. Import cluster with volume
2. Restart all machines.
3. Mount the volume and add some files.
4. Remove some file directly from disk. (I tried to break it, so I would see some errors in chart)
5. Run `journalctl -u collectd --no-pager` and look at the output for errors.
6. Open Hosts dashboard and look at Errors Per Second.

Actual results:
I have log for collectd full of errors but in Errors Per Second are shown no errors for all the time it runs.

Expected results:
I think that errors on collectd machines should be reported in UI. Maybe there should be tooltip describing what is reported.

Additional info:
Also bug https://bugzilla.redhat.com/show_bug.cgi?id=1508041 seems to appear.

Comment 1 Filip Balák 2017-11-06 09:15:54 UTC
Created attachment 1348484 [details]
Errors Per Second panel

Comment 2 Nishanth Thomas 2017-11-07 09:08:27 UTC
Ideally this is something which needs to be taken care as part of the integration with common logging. Also some parts this can be addressed by doing service level monitoring in tendrl. Ankush will be filing a upstream issue and link it here. 

This is an enhancement, take up later in a future release

Comment 3 Petr Penicka 2017-11-08 13:39:14 UTC
Triage Nov 8: QE agrees this can be postponed to next release.


Note You need to log in before you can comment on or make changes to this bug.