Description of problem: I see in Tendrl UI and also in Grafana that bricks are down on "gl5" node. Etcd: $ etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /clusters/ded5d5dc-3930-4628-9791-983e37a40eea/Bricks/all/mkudlej-usm2-gl5/mnt_brick_gama_disperse_1_1/status Stopped All tendrl and gluster services(including collectd) run on gl5 node. $ systemctl -a | grep tendrl tendrl-gluster-integration.service loaded active running Tendrl Gluster Daemon to Manage gluster tasks tendrl-node-agent.service loaded active running A python agent local to every managed storage node in the sds cluster tendrl-node-agent.socket loaded active running Tendrl message socket for logging Also I see in gluster that these 2 bricks are started. $ gluster volume status Status of volume: volume_gama_disperse_4_plus_2x2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick mkudlej-usm2-gl1 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1981 Brick mkudlej-usm2-gl2 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1712 Brick mkudlej-usm2-gl3 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1689 Brick mkudlej-usm2-gl4 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1741 Brick mkudlej-usm2-gl5 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1669 Brick mkudlej-usm2-gl6 dhat.com:/mnt/brick_gama_disperse_1/1 49152 0 Y 1676 Brick mkudlej-usm2-gl1 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1992 Brick mkudlej-usm2-gl2 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1724 Brick mkudlej-usm2-gl3 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1699 Brick mkudlej-usm2-gl4 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1750 Brick mkudlej-usm2-gl5 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1680 Brick mkudlej-usm2-gl6 dhat.com:/mnt/brick_gama_disperse_2/2 49153 0 Y 1684 Self-heal Daemon on localhost N/A N/A Y 1516 Self-heal Daemon on mkudlej-usm2-gl1.usmqe. N/A N/A Y 1744 Self-heal Daemon on mkudlej-usm2-gl2.usmqe. N/A N/A Y 1472 Self-heal Daemon on mkudlej-usm2-gl4.usmqe. N/A N/A Y 1514 Self-heal Daemon on mkudlej-usm2-gl6.usmqe. N/A N/A Y 1506 Self-heal Daemon on mkudlej-usm2-gl3.usmqe. N/A N/A Y 1499 Task Status of Volume volume_gama_disperse_4_plus_2x2 ------------------------------------------------------------------------------ There are no active volume tasks I don't see any error in logs. Version-Release number of selected component (if applicable): glusterfs-3.8.4-52.el7rhgs.x86_64 glusterfs-api-3.8.4-52.el7rhgs.x86_64 glusterfs-cli-3.8.4-52.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64 glusterfs-events-3.8.4-52.el7rhgs.x86_64 glusterfs-fuse-3.8.4-52.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64 glusterfs-libs-3.8.4-52.el7rhgs.x86_64 glusterfs-rdma-3.8.4-52.el7rhgs.x86_64 glusterfs-server-3.8.4-52.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 python-gluster-3.8.4-52.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch tendrl-commons-1.5.4-5.el7rhgs.noarch tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-node-agent-1.5.4-8.el7rhgs.noarch tendrl-selinux-1.5.4-1.el7rhgs.noarch vdsm-gluster-4.17.33-1.2.el7rhgs.noarch How reproducible: 1/1 Steps to Reproduce: 1. install and configure gluster and WA 2. import gluster into WA 3. shutdown some gluster nodes 4. start them after couple of hours 5. watch WA UI and Grafana Actual results: Some bricks are marked as down/stopped. Expected results: All bricks are marked as up/started after restart or some node outage.
I see this issue after rebooting machines. So now reproducibility is 2/3.
Having discussed it with dev, it has been agreed to document this bug as a known_issue for this release. And have detailed steps mentioning that the node-agent and gluster-integration services need to be restarted explicitly (just to be on the safer side), when there is an unplanned reboot of storage node.
Tried the steps as given in the description. But was not able to reproduce this. Was able to see bricks in started state after reboot.
Since this bug is not seen, moving this to ON_QA
Seems ok. --> VERIFIED Tested with: tendrl-ansible-1.6.3-3.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616