Bug 1518678

Summary: bricks are marked as down in UI
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Kudlej <mkudlej>
Component: web-admin-tendrl-gluster-integrationAssignee: Nishanth Thomas <nthomas>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, asriram, fbalak, mkudlej, nthomas, rghatvis, rhs-bugs, sanandpa, sankarshan, srmukher, ssaha
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-gluster-integration-1.6.1-1.el7rhgs, tendrl-api-1.6.1-1.el7rhgs.noarch.rpm, tendrl-commons-1.6.1-1.el7rhgs.noarch.rpm, tendrl-monitoring-integration-1.6.1-1.el7rhgs.noarch.rpm, tendrl-node-agent-1.6.1-1.el7, tendrl-ui-1.6.1-1.el7rhgs.noarch.rpm, Doc Type: Known Issue
Doc Text:
An unexpected reboot of storage nodes leads to service misconfiguration. As a result of this, bricks are marked ‘Down’ on the user interface. Workaround: To get the correct status of the brick, node-agent and gluster-integration services need to be restarted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:59:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503134    

Description Martin Kudlej 2017-11-29 12:58:54 UTC
Description of problem:
I see in Tendrl UI and also in Grafana that bricks are down on "gl5" node.

Etcd:
$ etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /clusters/ded5d5dc-3930-4628-9791-983e37a40eea/Bricks/all/mkudlej-usm2-gl5/mnt_brick_gama_disperse_1_1/status
Stopped

All tendrl and gluster services(including collectd) run on gl5 node.

$ systemctl -a | grep tendrl
  tendrl-gluster-integration.service                                                                             loaded    active   running   Tendrl Gluster Daemon to Manage gluster tasks
  tendrl-node-agent.service                                                                                      loaded    active   running   A python agent local to every managed storage node in the sds cluster
  tendrl-node-agent.socket                                                                                       loaded    active   running   Tendrl message socket for logging

Also I see in gluster that these 2 bricks are started.

$ gluster volume status
Status of volume: volume_gama_disperse_4_plus_2x2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick mkudlej-usm2-gl1
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1981 
Brick mkudlej-usm2-gl2
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1712 
Brick mkudlej-usm2-gl3
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1689 
Brick mkudlej-usm2-gl4
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1741 
Brick mkudlej-usm2-gl5
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1669 
Brick mkudlej-usm2-gl6
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1676 
Brick mkudlej-usm2-gl1
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1992 
Brick mkudlej-usm2-gl2
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1724 
Brick mkudlej-usm2-gl3
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1699 
Brick mkudlej-usm2-gl4
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1750 
Brick mkudlej-usm2-gl5
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1680 
Brick mkudlej-usm2-gl6
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1684 
Self-heal Daemon on localhost               N/A       N/A        Y       1516 
Self-heal Daemon on mkudlej-usm2-gl1.usmqe.
                      N/A       N/A        Y       1744 
Self-heal Daemon on mkudlej-usm2-gl2.usmqe.
                      N/A       N/A        Y       1472 
Self-heal Daemon on mkudlej-usm2-gl4.usmqe.
                      N/A       N/A        Y       1514 
Self-heal Daemon on mkudlej-usm2-gl6.usmqe.
                      N/A       N/A        Y       1506 
Self-heal Daemon on mkudlej-usm2-gl3.usmqe.
                      N/A       N/A        Y       1499 
 
Task Status of Volume volume_gama_disperse_4_plus_2x2
------------------------------------------------------------------------------
There are no active volume tasks


I don't see any error in logs.


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-52.el7rhgs.x86_64
glusterfs-api-3.8.4-52.el7rhgs.x86_64
glusterfs-cli-3.8.4-52.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64
glusterfs-events-3.8.4-52.el7rhgs.x86_64
glusterfs-fuse-3.8.4-52.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64
glusterfs-libs-3.8.4-52.el7rhgs.x86_64
glusterfs-rdma-3.8.4-52.el7rhgs.x86_64
glusterfs-server-3.8.4-52.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64
python-gluster-3.8.4-52.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch


How reproducible:
1/1

Steps to Reproduce:
1. install and configure gluster and WA
2. import gluster into WA
3. shutdown some gluster nodes
4. start them after couple of hours
5. watch WA UI and Grafana

Actual results:
Some bricks are marked as down/stopped.

Expected results:
All bricks are marked as up/started after restart or some node outage.

Comment 3 Martin Kudlej 2017-11-29 15:07:11 UTC
I see this issue after rebooting machines. So now reproducibility is 2/3.

Comment 4 Sweta Anandpara 2017-12-06 10:18:57 UTC
Having discussed it with dev, it has been agreed to document this bug as a known_issue for this release. And have detailed steps mentioning that the node-agent and gluster-integration services need to be restarted explicitly (just to be on the safer side), when there is an unplanned reboot of storage node.

Comment 8 Darshan 2018-01-30 09:53:14 UTC
Tried the steps as given in the description. But was not able to reproduce this. Was able to see bricks in started state after reboot.

Comment 9 Nishanth Thomas 2018-01-30 13:49:31 UTC
Since this bug is not seen, moving this to ON_QA

Comment 12 Filip Balák 2018-05-14 08:43:27 UTC
Seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 16 errata-xmlrpc 2018-09-04 06:59:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616