Bug 1728766 - Volume start failed when shd is down in one of the node in cluster
Summary: Volume start failed when shd is down in one of the node in cluster
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1726219
Blocks: 1696809
TreeView+ depends on / blocked
 
Reported: 2019-07-10 15:57 UTC by Mohammed Rafi KC
Modified: 2019-08-05 06:48 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1726219
Environment:
Last Closed: 2019-08-05 06:48:55 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23007 0 None Merged glusterd/shd: Return null proc if process is not running. 2019-08-05 06:48:54 UTC

Description Mohammed Rafi KC 2019-07-10 15:57:22 UTC
+++ This bug was initially created as a clone of Bug #1726219 +++

Description of problem:


Volume info o/p is not consistent across the cluster, output from two nodes says volume is in stopped state, whereas one node says volume is in start state.


Node1: 
[root@dhcp35-50 ~]# gluster v info test3
 
Volume Name: test3
Type: Replicate
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/tes3
Brick2: 10.70.46.216:/bricks/brick1/tes3
Brick3: 10.70.46.132:/bricks/brick1/tes3
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.216. Error: Volume test3 is not started
Staging failed on 10.70.46.132. Error: Volume test3 is not started

Node 2: 

[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped
==================================================

Version-Release number of selected component (if applicable):


How reproducible:
2/2

Steps to Reproduce:
1.  Create 2 replica 3 vols
2.  Stop 1 volume, execute command on node 1 (35.50)
[root@dhcp35-50 ~]# gluster v stop test3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test3: success

3.  Kill shd on one node
kill -15 5928
4.  Check #gluster v info from all 3 nodes
Volume is in stopped state, as seen from o/p of all three nodes
5. Now start volume from node 1 
# gluster v start test3
volume start: test3: failed: Commit failed on localhost. Please check log file for details.
O/p says volume start failed.
6. Now check vol info o/p on all three nodes

Node1:
[root@dhcp35-50 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started

Node2:
[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped


Actual results:

As described above in Steps to reproduce

Expected results:

1. Volume should start without any error (confirmed that volume starts in older release (glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

2. Command o/p should be consistent when executed from any nodes, (As all automation cases randomly take any node as master for command execution)

3. Volume start force should bring up shd on a node where it was killed (confirmed on older release glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

Additional info:

Also there is discrepancy in output of vol status when executed from different nodes.

[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.132. Error: Volume test3 is not started
Staging failed on 10.70.46.216. Error: Volume test3 is not started

[root@dhcp46-132 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v start test3 force
volume start: test3: failed: Commit failed on dhcp35-50.lab.eng.blr.redhat.com. Please check log file for details.

Comment 1 Worker Ant 2019-07-10 16:00:55 UTC
REVIEW: https://review.gluster.org/23007 (glusterd/shd: Return null proc if process is not running.) posted (#2) for review on master by mohammed rafi  kc

Comment 2 Worker Ant 2019-08-05 06:48:55 UTC
REVIEW: https://review.gluster.org/23007 (glusterd/shd: Return null proc if process is not running.) merged (#5) on master by Amar Tumballi


Note You need to log in before you can comment on or make changes to this bug.