Description of problem: While validating BZ#1716626 hit this issue. Volume info o/p is not consistent across the cluster, output from two nodes says volume is in stopped state, whereas one node says volume is in start state. Node1: [root@dhcp35-50 ~]# gluster v info test3 Volume Name: test3 Type: Replicate Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.50:/bricks/brick1/tes3 Brick2: 10.70.46.216:/bricks/brick1/tes3 Brick3: 10.70.46.132:/bricks/brick1/tes3 Options Reconfigured: transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off [root@dhcp35-50 ~]# gluster v status test3 Staging failed on 10.70.46.216. Error: Volume test3 is not started Staging failed on 10.70.46.132. Error: Volume test3 is not started Node 2: [root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status' Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Stopped Node3: [root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status' Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Stopped ================================================== Version-Release number of selected component (if applicable): How reproducible: 2/2 Steps to Reproduce: 1. Create 2 replica 3 vols 2. Stop 1 volume, execute command on node 1 (35.50) [root@dhcp35-50 ~]# gluster v stop test3 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: test3: success 3. Kill shd on one node kill -15 5928 4. Check #gluster v info from all 3 nodes Volume is in stopped state, as seen from o/p of all three nodes 5. Now start volume from node 1 # gluster v start test3 volume start: test3: failed: Commit failed on localhost. Please check log file for details. O/p says volume start failed. 6. Now check vol info o/p on all three nodes Node1: [root@dhcp35-50 ~]# gluster v info test3 | egrep 'Volume ID|Status' Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Started Node2: [root@dhcp46-216 ~]# gluster v info test3 | egrep 'Volume ID|Status' Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Stopped Node3: [root@dhcp46-132 ~]# gluster v info test3 | egrep 'Volume ID|Status' Volume ID: 11e30537-ce20-42d6-8a5e-a2668dc6b983 Status: Stopped Actual results: As described above in Steps to reproduce Expected results: 1. Volume should start without any error (confirmed that volume starts in older release (glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64) 2. Command o/p should be consistent when executed from any nodes, (As all automation cases randomly take any node as master for command execution) 3. Volume start force should bring up shd on a node where it was killed (confirmed on older release glusterfs-fuse-3.12.2-47.2.el7rhgs.x86_64) Additional info: Also there is discrepancy in output of vol status when executed from different nodes. [root@dhcp35-50 ~]# gluster v status test3 Staging failed on 10.70.46.132. Error: Volume test3 is not started Staging failed on 10.70.46.216. Error: Volume test3 is not started [root@dhcp46-132 ~]# gluster v status test3 Volume test3 is not started [root@dhcp46-216 ~]# gluster v status test3 Volume test3 is not started [root@dhcp46-216 ~]# gluster v start test3 force volume start: test3: failed: Commit failed on dhcp35-50.lab.eng.blr.redhat.com. Please check log file for details. The issue is consistently reproducible, I will upload the sos-reports in the following comment
Version-Release number of selected component ]# rpm -qa | grep gluster glusterfs-cli-6.0-7.el7rhgs.x86_64 glusterfs-api-6.0-7.el7rhgs.x86_64 glusterfs-resource-agents-6.0-7.el7rhgs.noarch python2-gluster-6.0-7.el7rhgs.x86_64 glusterfs-geo-replication-6.0-7.el7rhgs.x86_64 glusterfs-6.0-7.el7rhgs.x86_64 glusterfs-fuse-6.0-7.el7rhgs.x86_64 glusterfs-api-devel-6.0-7.el7rhgs.x86_64
Changing the component to CLI as it is failing in volume start and giving inconsistent outputs in the status. Might need some attention from glusterd folks, CCing them as well.
From the reproducer: 5. Now start volume from node 1 # gluster v start test3 volume start: test3: failed: Commit failed on localhost. Please check log file for details. O/p says volume start failed. As the volume failed to start, the half-cooked "volume start" transaction might have written as to the store as this volume is started. but as the commit is failed, the commit request is not sent to the peers. That's why peers show this volume as stopped when the originator show it is as started. Here, we need to root cause why the volume start transaction has failed. I will follow the reproducer and try to reproduce this on my setup. Looks like, it has some relation with shd too, will update the BZ with details soon. and, changing the component to glusterd for now. Thanks, Sanju
(In reply to Sanju from comment #4) > From the reproducer: > 5. Now start volume from node 1 > # gluster v start test3 > volume start: test3: failed: Commit failed on localhost. Please check log > file for details. > O/p says volume start failed. > > As the volume failed to start, the half-cooked "volume start" transaction > might have written as to the store as this volume is started. but as the > commit is failed, the commit request is not sent to the peers. That's why > peers show this volume as stopped when the originator show it is as started. No doubt, that's how it has happened. The title of the bug is now misleading now and it's expected. Just like what you mentioned, we should check why volume start failed. Have we not looked at the respective glusterd logs to see what happened there? > > Here, we need to root cause why the volume start transaction has failed. I > will follow the reproducer and try to reproduce this on my setup. Looks > like, it has some relation with shd too, will update the BZ with details > soon. > > and, changing the component to glusterd for now. > > Thanks, > Sanju
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days