Bug 1518678 - bricks are marked as down in UI
Summary: bricks are marked as down in UI
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-gluster-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Nishanth Thomas
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2017-11-29 12:58 UTC by Martin Kudlej
Modified: 2018-09-04 07:00 UTC (History)
11 users (show)

Fixed In Version: tendrl-gluster-integration-1.6.1-1.el7rhgs, tendrl-api-1.6.1-1.el7rhgs.noarch.rpm, tendrl-commons-1.6.1-1.el7rhgs.noarch.rpm, tendrl-monitoring-integration-1.6.1-1.el7rhgs.noarch.rpm, tendrl-node-agent-1.6.1-1.el7, tendrl-ui-1.6.1-1.el7rhgs.noarch.rpm,
Doc Type: Known Issue
Doc Text:
An unexpected reboot of storage nodes leads to service misconfiguration. As a result of this, bricks are marked ‘Down’ on the user interface. Workaround: To get the correct status of the brick, node-agent and gluster-integration services need to be restarted.
Clone Of:
Environment:
Last Closed: 2018-09-04 06:59:21 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:00:23 UTC

Description Martin Kudlej 2017-11-29 12:58:54 UTC
Description of problem:
I see in Tendrl UI and also in Grafana that bricks are down on "gl5" node.

Etcd:
$ etcdctl --ca-file /etc/pki/tls/certs/ca-usmqe.crt --cert-file /etc/pki/tls/certs/etcd.crt --key-file /etc/pki/tls/private/etcd.key --endpoints https://${HOSTNAME}:2379 get /clusters/ded5d5dc-3930-4628-9791-983e37a40eea/Bricks/all/mkudlej-usm2-gl5/mnt_brick_gama_disperse_1_1/status
Stopped

All tendrl and gluster services(including collectd) run on gl5 node.

$ systemctl -a | grep tendrl
  tendrl-gluster-integration.service                                                                             loaded    active   running   Tendrl Gluster Daemon to Manage gluster tasks
  tendrl-node-agent.service                                                                                      loaded    active   running   A python agent local to every managed storage node in the sds cluster
  tendrl-node-agent.socket                                                                                       loaded    active   running   Tendrl message socket for logging

Also I see in gluster that these 2 bricks are started.

$ gluster volume status
Status of volume: volume_gama_disperse_4_plus_2x2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick mkudlej-usm2-gl1
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1981 
Brick mkudlej-usm2-gl2
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1712 
Brick mkudlej-usm2-gl3
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1689 
Brick mkudlej-usm2-gl4
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1741 
Brick mkudlej-usm2-gl5
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1669 
Brick mkudlej-usm2-gl6
dhat.com:/mnt/brick_gama_disperse_1/1       49152     0          Y       1676 
Brick mkudlej-usm2-gl1
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1992 
Brick mkudlej-usm2-gl2
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1724 
Brick mkudlej-usm2-gl3
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1699 
Brick mkudlej-usm2-gl4
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1750 
Brick mkudlej-usm2-gl5
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1680 
Brick mkudlej-usm2-gl6
dhat.com:/mnt/brick_gama_disperse_2/2       49153     0          Y       1684 
Self-heal Daemon on localhost               N/A       N/A        Y       1516 
Self-heal Daemon on mkudlej-usm2-gl1.usmqe.
                      N/A       N/A        Y       1744 
Self-heal Daemon on mkudlej-usm2-gl2.usmqe.
                      N/A       N/A        Y       1472 
Self-heal Daemon on mkudlej-usm2-gl4.usmqe.
                      N/A       N/A        Y       1514 
Self-heal Daemon on mkudlej-usm2-gl6.usmqe.
                      N/A       N/A        Y       1506 
Self-heal Daemon on mkudlej-usm2-gl3.usmqe.
                      N/A       N/A        Y       1499 
 
Task Status of Volume volume_gama_disperse_4_plus_2x2
------------------------------------------------------------------------------
There are no active volume tasks


I don't see any error in logs.


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-52.el7rhgs.x86_64
glusterfs-api-3.8.4-52.el7rhgs.x86_64
glusterfs-cli-3.8.4-52.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-52.el7rhgs.x86_64
glusterfs-events-3.8.4-52.el7rhgs.x86_64
glusterfs-fuse-3.8.4-52.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-52.el7rhgs.x86_64
glusterfs-libs-3.8.4-52.el7rhgs.x86_64
glusterfs-rdma-3.8.4-52.el7rhgs.x86_64
glusterfs-server-3.8.4-52.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64
python-gluster-3.8.4-52.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-1.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch


How reproducible:
1/1

Steps to Reproduce:
1. install and configure gluster and WA
2. import gluster into WA
3. shutdown some gluster nodes
4. start them after couple of hours
5. watch WA UI and Grafana

Actual results:
Some bricks are marked as down/stopped.

Expected results:
All bricks are marked as up/started after restart or some node outage.

Comment 3 Martin Kudlej 2017-11-29 15:07:11 UTC
I see this issue after rebooting machines. So now reproducibility is 2/3.

Comment 4 Sweta Anandpara 2017-12-06 10:18:57 UTC
Having discussed it with dev, it has been agreed to document this bug as a known_issue for this release. And have detailed steps mentioning that the node-agent and gluster-integration services need to be restarted explicitly (just to be on the safer side), when there is an unplanned reboot of storage node.

Comment 8 Darshan 2018-01-30 09:53:14 UTC
Tried the steps as given in the description. But was not able to reproduce this. Was able to see bricks in started state after reboot.

Comment 9 Nishanth Thomas 2018-01-30 13:49:31 UTC
Since this bug is not seen, moving this to ON_QA

Comment 12 Filip Balák 2018-05-14 08:43:27 UTC
Seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 16 errata-xmlrpc 2018-09-04 06:59:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.