Description of problem: When I call: # journalctl -u collectd --no-pager I get these errors periodically almost every minute: ... Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Exception in thread Thread-59438: Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: Traceback (most recent call last): Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.run() Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/python2.7/threading.py", line 765, in run Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.__target(*self.__args, **self.__kwargs) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_details Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: self.get_brick_devices(brick_path) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 138, in get_brick_devices Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self.fetch_brick_devices(brick_path) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 61, in fetch_brick_devices Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: brick_path.replace('/', '_').replace("_", "", 1) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 598, in read Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: return self._result_from_response(response) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 812, in _result_from_response Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: 'Server response was not valid JSON: %r' % e) Nov 06 03:41:25 fbalak-usm1-gl1.usmqe collectd[3253]: EtcdException: Server response was not valid JSON: ValueError('No JSON object could be decoded',) Nov 06 03:43:26 fbalak-usm1-gl1.usmqe collectd[3253]: Failed to fetch volume heal statistics.The error is: Gathering crawl statistics on volume volume_alpha_distrep_6x2 has been unsuccessful on bricks that are down. Please check if all brick processes are running. ... But when I open Hosts dashboard I see no errors in the Errors Per Second panel. Version-Release number of selected component (if applicable): tendrl-ansible-1.5.3-2.el7rhgs.noarch tendrl-commons-1.5.3-1.el7rhgs.noarch tendrl-api-1.5.3-2.el7rhgs.noarch tendrl-notifier-1.5.3-1.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch tendrl-node-agent-1.5.3-3.el7rhgs.noarch tendrl-ui-1.5.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch tendrl-api-httpd-1.5.3-2.el7rhgs.noarch glusterfs-3.8.4-48.el7rhgs.x86_64 How reproducible: Not sure. 60% Steps to Reproduce: 1. Import cluster with volume 2. Restart all machines. 3. Mount the volume and add some files. 4. Remove some file directly from disk. (I tried to break it, so I would see some errors in chart) 5. Run `journalctl -u collectd --no-pager` and look at the output for errors. 6. Open Hosts dashboard and look at Errors Per Second. Actual results: I have log for collectd full of errors but in Errors Per Second are shown no errors for all the time it runs. Expected results: I think that errors on collectd machines should be reported in UI. Maybe there should be tooltip describing what is reported. Additional info: Also bug https://bugzilla.redhat.com/show_bug.cgi?id=1508041 seems to appear.
Created attachment 1348484 [details] Errors Per Second panel
Ideally this is something which needs to be taken care as part of the integration with common logging. Also some parts this can be addressed by doing service level monitoring in tendrl. Ankush will be filing a upstream issue and link it here. This is an enhancement, take up later in a future release
Triage Nov 8: QE agrees this can be postponed to next release.