+++ This bug was initially created as a clone of Bug #858333 +++ Description of problem: Running the command... [root@gluster1 /]# gluster volume status all operation failed Failed to get names of volumes Yields the above error when one of the gluster peers is shutdown, rebooted or communication is lost during an active transaction (such as a heal, rebalance, etc.) Version-Release number of selected component (if applicable): glusterfs-3.3.0rhs-25.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a 2 node replicate volume 2. Copy some data into the volume 3. While the copy is in progress reboot one of the nodes 4. Run the command 'gluster volume status all' from the active node. Actual results: 'gluster volume status all' fails to report any details about the other bricks in the volume with the error: operation failed Failed to get names of volumes Expected results: 'gluster volume status all' reports the status of the remaining active bricks in the volume. Additional info: The error is being generated from glusterd_unlock() which means the call to uuid_compare() is returning > 0 --- Additional comment from ksquizza on 2012-09-18 13:10:21 EDT --- Thought I'd also add that restarting the glusterd service will allow the 'gluster volume status all' command to report correctly. --- Additional comment from kaushal on 2012-10-04 10:18:31 EDT --- This looks like its caused by a stale lock being held, because of a frame which hasn't been replied to or hasn't timed out yet. Waiting for 30/10 mins (depending on the RHS build, the recent builds have the timeout changed to 10 mins) should lead to resumption of normal activity without needing a glusterd restart. --- Additional comment from kaushal on 2012-10-15 07:44:11 EDT --- Commit 9c0cbe6955f702b1ca27e0f48e309382f5d59186 (rpc: Reduce frame-timeout for glusterd connections) should mitigate this problem for the present.
Verified the bug by executing the steps given to recreate the problem. The bug doesn't exist anymore. For 10-13 minutes after powering off one of the server, the execution of gluster cli command fails with error message "operation failed". After >10 minutes of powering off the machines, the execution of gluster cli commands are successful. Servers command execution output: ================================= [root@darrel ~]# gluster --version glusterfs 3.3.0.3rhs built on Oct 10 2012 09:16:20 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@darrel ~]# uname -a Linux darrel.lab.eng.blr.redhat.com 2.6.32-220.28.1.el6.x86_64 #1 SMP Wed Oct 3 12:26:28 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux Server1:- ============= [root@darrel ~]# service glusterd start Starting glusterd: [ OK ] [root@darrel ~]# service glusterd status glusterd (pid 2811) is running... [root@darrel ~]# hostname darrel.lab.eng.blr.redhat.com [root@darrel ~]# gluster peer probe king.lab.eng.blr.redhat.com Probe successful [root@darrel ~]# gluster peer status Number of Peers: 1 Hostname: king.lab.eng.blr.redhat.com Port: 24007 Uuid: 0f7403e2-86dd-4347-b168-5181f4ff1c31 State: Peer in Cluster (Connected) [root@darrel ~]# gluster volume create rep replica 2 darrel.lab.eng.blr.redhat.com:/home/export1 king.lab.eng.blr.redhat.com:/home/export1 Creation of volume rep has been successful. Please start the volume to access data. [root@darrel ~]# gluster v info rep Volume Name: rep Type: Replicate Volume ID: 665bf1a7-4289-471f-9647-e1144cd1242d Status: Created Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: darrel.lab.eng.blr.redhat.com:/home/export1 Brick2: king.lab.eng.blr.redhat.com:/home/export1 [root@darrel ~]# gluster v start rep Starting volume rep has been successful [root@darrel ~]# gluster v status rep Status of volume: rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick darrel.lab.eng.blr.redhat.com:/home/export1 24009 Y 2915 Brick king.lab.eng.blr.redhat.com:/home/export1 24009 Y 2879 NFS Server on localhost 38467 Y 2920 Self-heal Daemon on localhost N/A Y 2926 NFS Server on king.lab.eng.blr.redhat.com 38467 Y 2884 Self-heal Daemon on king.lab.eng.blr.redhat.com N/A Y 2891 [root@darrel ~]# gluster v status rep Unable to obtain volume status information. [root@darrel ~]# gluster v status all Status of volume: rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick darrel.lab.eng.blr.redhat.com:/home/export1 24009 Y 2915 Brick king.lab.eng.blr.redhat.com:/home/export1 24009 Y 1583 NFS Server on localhost 38467 Y 2920 Self-heal Daemon on localhost N/A Y 2926 NFS Server on king.lab.eng.blr.redhat.com 38467 Y 1588 Self-heal Daemon on king.lab.eng.blr.redhat.com N/A Y 1594 Server2:- =============== [root@king ~]# gluster v status Status of volume: rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick darrel.lab.eng.blr.redhat.com:/home/export1 24009 Y 2915 Brick king.lab.eng.blr.redhat.com:/home/export1 24009 Y 1583 NFS Server on localhost 38467 Y 1588 Self-heal Daemon on localhost N/A Y 1594 NFS Server on 10.70.34.115 38467 Y 2920 Self-heal Daemon on 10.70.34.115 N/A Y 2926 [root@king ~]# poweroff Broadcast message from root.eng.blr.redhat.com (/dev/pts/0) at 23:56 ... The system is going down for power off NOW! Server1:- ============ [root@darrel ~]# gluster v status [root@darrel ~]# echo $? 130 [root@darrel ~]# gluster v status all operation failed Failed to get names of volumes [root@darrel ~]# gluster v heal rep info operation failed [root@darrel ~]# gluster v set rep stat-prefetch off [root@darrel ~]# echo $? 255 [root@darrel ~]# gluster v status all operation failed Failed to get names of volumes After 13 minutes on server1:- ========================== [root@darrel ~]# gluster v status all Status of volume: rep Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick darrel.lab.eng.blr.redhat.com:/home/export1 24009 Y 2915 NFS Server on localhost 38467 Y 2920 Self-heal Daemon on localhost N/A Y 2926
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1456.html