Bug 1726219

Summary:	Volume start failed when shd is down in one of the node in cluster
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Anees Patel <anepatel>
Component:	glusterd	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED DEFERRED	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	amukherj, nchilaka, rhs-bugs, rkavunga, sheggodu, srakonde, storage-qa-internal, vbellur, vdas
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:	shd-multiplexing
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1728766 (view as bug list)		Environment:
Last Closed:	2020-01-20 07:57:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1728766

Description Anees Patel 2019-07-02 11:04:56 UTC

Description of problem:

While validating BZ#1716626 hit this issue.
Volume info o/p is not consistent across the cluster, output from two nodes says volume is in stopped state, whereas one node says volume is in start state.


Node1: 
[root@dhcp35-50 ~]# gluster v info test3
 
Volume Name: test3
Type: Replicate
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/tes3
Brick2: 10.70.46.216:/bricks/brick1/tes3
Brick3: 10.70.46.132:/bricks/brick1/tes3
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.216. Error: Volume test3 is not started
Staging failed on 10.70.46.132. Error: Volume test3 is not started

Node 2: 

[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped
==================================================

Version-Release number of selected component (if applicable):


How reproducible:
2/2

Steps to Reproduce:
1.  Create 2 replica 3 vols
2.  Stop 1 volume, execute command on node 1 (35.50)
[root@dhcp35-50 ~]# gluster v stop test3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: test3: success

3.  Kill shd on one node
kill -15 5928
4.  Check #gluster v info from all 3 nodes
Volume is in stopped state, as seen from o/p of all three nodes
5. Now start volume from node 1 
# gluster v start test3
volume start: test3: failed: Commit failed on localhost. Please check log file for details.
O/p says volume start failed.
6. Now check vol info o/p on all three nodes

Node1:
[root@dhcp35-50 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Started

Node2:
[root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped

Node3:
[root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status'
Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983
Status: Stopped


Actual results:

As described above in Steps to reproduce

Expected results:

1. Volume should start without any error (confirmed that volume starts in older release (glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

2. Command o/p should be consistent when executed from any nodes, (As all automation cases randomly take any node as master for command execution)

3. Volume start force should bring up shd on a node where it was killed (confirmed on older release glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64)

Additional info:

Also there is discrepancy in output of vol status when executed from different nodes.

[root@dhcp35-50 ~]# gluster v status test3
Staging failed on 10.70.46.132. Error: Volume test3 is not started
Staging failed on 10.70.46.216. Error: Volume test3 is not started

[root@dhcp46-132 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v status test3
Volume test3 is not started

[root@dhcp46-216 ~]# gluster v start test3 force
volume start: test3: failed: Commit failed on dhcp35-50.lab.eng.blr.redhat.com. Please check log file for details.


The issue is consistently reproducible, I will upload the sos-reports in the following comment

Comment 2 Anees Patel 2019-07-02 11:09:11 UTC

Version-Release number of selected component
]# rpm -qa | grep gluster
glusterfs-cli-6.0-7.el7rhgs.x86_64
glusterfs-api-6.0-7.el7rhgs.x86_64
glusterfs-resource-agents-6.0-7.el7rhgs.noarch
python2-gluster-6.0-7.el7rhgs.x86_64
glusterfs-geo-replication-6.0-7.el7rhgs.x86_64
glusterfs-6.0-7.el7rhgs.x86_64
glusterfs-fuse-6.0-7.el7rhgs.x86_64
glusterfs-api-devel-6.0-7.el7rhgs.x86_64

Comment 3 Karthik U S 2019-07-02 11:14:56 UTC

Changing the component to CLI as it is failing in volume start and giving inconsistent outputs in the status. Might need some attention from glusterd folks, CCing them as well.

Comment 4 Sanju 2019-07-02 11:46:43 UTC

From the reproducer:
5. Now start volume from node 1 
# gluster v start test3
volume start: test3: failed: Commit failed on localhost. Please check log file for details.
O/p says volume start failed.

As the volume failed to start, the half-cooked "volume start" transaction might have written as to the store as this volume is started. but as the commit is failed, the commit request is not sent to the peers. That's why peers show this volume as stopped when the originator show it is as started.

Here, we need to root cause why the volume start transaction has failed. I will follow the reproducer and try to reproduce this on my setup. Looks like, it has some relation with shd too, will update the BZ with details soon.

and, changing the component to glusterd for now.

Thanks,
Sanju

Comment 5 Atin Mukherjee 2019-07-02 13:50:41 UTC

(In reply to Sanju from comment #4)
> From the reproducer:
> 5. Now start volume from node 1 
> # gluster v start test3
> volume start: test3: failed: Commit failed on localhost. Please check log
> file for details.
> O/p says volume start failed.
> 
> As the volume failed to start, the half-cooked "volume start" transaction
> might have written as to the store as this volume is started. but as the
> commit is failed, the commit request is not sent to the peers. That's why
> peers show this volume as stopped when the originator show it is as started.

No doubt, that's how it has happened. The title of the bug is now misleading now and it's expected. Just like what you mentioned, we should check why volume start failed. Have we not looked at the respective glusterd logs to see what happened there?
 
> 
> Here, we need to root cause why the volume start transaction has failed. I
> will follow the reproducer and try to reproduce this on my setup. Looks
> like, it has some relation with shd too, will update the BZ with details
> soon.
> 
> and, changing the component to glusterd for now.
> 
> Thanks,
> Sanju

Comment 15 Red Hat Bugzilla 2023-09-14 05:31:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days