1728766 – Volume start failed when shd is down in one of the node in cluster

Bug 1728766 - Volume start failed when shd is down in one of the node in cluster

Summary: Volume start failed when shd is down in one of the node in cluster

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1726219
Blocks:	1696809
TreeView+	depends on / blocked

Reported:	2019-07-10 15:57 UTC by Mohammed Rafi KC
Modified:	2019-08-05 06:48 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:	1726219
Environment:
Last Closed:	2019-08-05 06:48:55 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23007	0	None	Merged	glusterd/shd: Return null proc if process is not running.	2019-08-05 06:48:54 UTC

Description Mohammed Rafi KC 2019-07-10 15:57:22 UTC

+++ This bug was initially created as a clone of Bug #1726219 +++

Description of problem:


Volume info o/p is not consistent across the cluster, output from two nodes says volume is in stopped state, whereas one node says volume is in start state.


Node1: 
[root@dhcp35-50 ~]# gluster v info test3
 
Volume Name: test3
Type: Replicate
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/tes3
Brick2: 10.70.46.216:/bricks/brick1/tes3
Brick3: 10.70.46.132:/bricks/brick1/tes3
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.216. Error: Volume test3 is not started
Staging failed on 10.70.46.132. Error: Volume test3 is not started

Node 2: 

[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped
==================================================

Version-Release number of selected component (if applicable):


How reproducible:
2/2

Steps to Reproduce:
1.  Create 2 replica 3 vols
2.  Stop 1 volume, execute command on node 1 (35.50)
[root@dhcp35-50 ~]# gluster v stop test3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test3: success

3.  Kill shd on one node
kill -15 5928
4.  Check #gluster v info from all 3 nodes
Volume is in stopped state, as seen from o/p of all three nodes
5. Now start volume from node 1 
# gluster v start test3
volume start: test3: failed: Commit failed on localhost. Please check log file for details.
O/p says volume start failed.
6. Now check vol info o/p on all three nodes

Node1:
[root@dhcp35-50 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started

Node2:
[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped


Actual results:

As described above in Steps to reproduce

Expected results:

1. Volume should start without any error (confirmed that volume starts in older release (glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

2. Command o/p should be consistent when executed from any nodes, (As all automation cases randomly take any node as master for command execution)

3. Volume start force should bring up shd on a node where it was killed (confirmed on older release glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

Additional info:

Also there is discrepancy in output of vol status when executed from different nodes.

[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.132. Error: Volume test3 is not started
Staging failed on 10.70.46.216. Error: Volume test3 is not started

[root@dhcp46-132 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v start test3 force
volume start: test3: failed: Commit failed on dhcp35-50.lab.eng.blr.redhat.com. Please check log file for details.

Comment 1 Worker Ant 2019-07-10 16:00:55 UTC

REVIEW: https://review.gluster.org/23007 (glusterd/shd: Return null proc if process is not running.) posted (#2) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2019-08-05 06:48:55 UTC

REVIEW: https://review.gluster.org/23007 (glusterd/shd: Return null proc if process is not running.) merged (#5) on master by Amar Tumballi

Note You need to log in before you can comment on or make changes to this bug.