Bug 866758

Summary:	gluster volume status all "Failed to get names of volumes" when peer in volume is restarted during transaction
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Scott Haines <shaines>
Component:	glusterfs	Assignee:	Amar Tumballi <amarts>
Status:	CLOSED ERRATA	QA Contact:	spandura
Severity:	high	Docs Contact:
Priority:	high
Version:	2.0	CC:	cww, kristof.wevers, ksquizza, ndevos, rhs-bugs, shaines, spandura, ujjwala, vbellur, vinaraya, vraman
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.3.0.3rhs-33.el6rhs	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	858333	Environment:
Last Closed:	2012-11-12 18:47:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	858333
Bug Blocks:

Description Scott Haines 2012-10-16 04:38:46 UTC

+++ This bug was initially created as a clone of Bug #858333 +++

Description of problem:

Running the command...

 [root@gluster1 /]# gluster volume status all
 operation failed
 
 Failed to get names of volumes

Yields the above error when one of the gluster peers is shutdown, rebooted or communication is lost during an active transaction (such as a heal, rebalance, etc.)

Version-Release number of selected component (if applicable):
glusterfs-3.3.0rhs-25.el6rhs.x86_64  

How reproducible:
Always


Steps to Reproduce:
1. Create a 2 node replicate volume
2. Copy some data into the volume
3. While the copy is in progress reboot one of the nodes
4. Run the command 'gluster volume status all' from the active node.
  
Actual results:
'gluster volume status all' fails to report any details about the other bricks in the volume with the error:

 operation failed
 
 Failed to get names of volumes


Expected results:
'gluster volume status all' reports the status of the remaining active bricks in the volume.

Additional info:
The error is being generated from glusterd_unlock() which means the call to uuid_compare() is returning > 0

--- Additional comment from ksquizza on 2012-09-18 13:10:21 EDT ---

Thought I'd also add that restarting the glusterd service will allow the 'gluster volume status all' command to report correctly.

--- Additional comment from kaushal on 2012-10-04 10:18:31 EDT ---

This looks like its caused by a stale lock being held, because of a frame which hasn't been replied to or hasn't timed out yet.

Waiting for 30/10 mins (depending on the RHS build, the recent builds have the timeout changed to 10 mins) should lead to resumption of normal activity without needing a glusterd restart.

--- Additional comment from kaushal on 2012-10-15 07:44:11 EDT ---

Commit 9c0cbe6955f702b1ca27e0f48e309382f5d59186 (rpc: Reduce frame-timeout for glusterd connections) should mitigate this problem for the present.

Comment 1 spandura 2012-10-23 02:58:01 UTC

Verified the bug by executing the steps given to recreate the problem. The bug doesn't exist anymore. For 10-13 minutes after powering off one of the server, the execution of gluster cli command fails with error message "operation failed". 

After >10 minutes of powering off the machines, the execution of gluster cli commands are successful. 

Servers command execution output:
=================================

[root@darrel ~]# gluster --version
glusterfs 3.3.0.3rhs built on Oct 10 2012 09:16:20
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

[root@darrel ~]# uname -a
Linux darrel.lab.eng.blr.redhat.com 2.6.32-220.28.1.el6.x86_64 #1 SMP Wed Oct 3 12:26:28 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Server1:-
=============
[root@darrel ~]# service glusterd start
Starting glusterd:                                         [  OK  ]

[root@darrel ~]# service glusterd status
glusterd (pid  2811) is running...

[root@darrel ~]# hostname
darrel.lab.eng.blr.redhat.com

[root@darrel ~]# gluster peer probe king.lab.eng.blr.redhat.com
Probe successful

[root@darrel ~]# gluster peer status
Number of Peers: 1

Hostname: king.lab.eng.blr.redhat.com
Port: 24007
Uuid: 0f7403e2-86dd-4347-b168-5181f4ff1c31
State: Peer in Cluster (Connected)

[root@darrel ~]# gluster volume create rep replica 2 darrel.lab.eng.blr.redhat.com:/home/export1 king.lab.eng.blr.redhat.com:/home/export1
Creation of volume rep has been successful. Please start the volume to access data.

[root@darrel ~]# gluster v info rep
 
Volume Name: rep
Type: Replicate
Volume ID: 665bf1a7-4289-471f-9647-e1144cd1242d
Status: Created
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: darrel.lab.eng.blr.redhat.com:/home/export1
Brick2: king.lab.eng.blr.redhat.com:/home/export1

[root@darrel ~]# gluster v start rep
Starting volume rep has been successful

[root@darrel ~]# gluster v status rep
Status of volume: rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick darrel.lab.eng.blr.redhat.com:/home/export1	24009	Y	2915
Brick king.lab.eng.blr.redhat.com:/home/export1		24009	Y	2879
NFS Server on localhost					38467	Y	2920
Self-heal Daemon on localhost				N/A	Y	2926
NFS Server on king.lab.eng.blr.redhat.com		38467	Y	2884
Self-heal Daemon on king.lab.eng.blr.redhat.com		N/A	Y	2891
 
[root@darrel ~]# gluster v status rep
Unable to obtain volume status information.

[root@darrel ~]# gluster v status all
Status of volume: rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick darrel.lab.eng.blr.redhat.com:/home/export1	24009	Y	2915
Brick king.lab.eng.blr.redhat.com:/home/export1		24009	Y	1583
NFS Server on localhost					38467	Y	2920
Self-heal Daemon on localhost				N/A	Y	2926
NFS Server on king.lab.eng.blr.redhat.com		38467	Y	1588
Self-heal Daemon on king.lab.eng.blr.redhat.com		N/A	Y	1594
 
Server2:-
===============
[root@king ~]# gluster v status
Status of volume: rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick darrel.lab.eng.blr.redhat.com:/home/export1	24009	Y	2915
Brick king.lab.eng.blr.redhat.com:/home/export1		24009	Y	1583
NFS Server on localhost					38467	Y	1588
Self-heal Daemon on localhost				N/A	Y	1594
NFS Server on 10.70.34.115				38467	Y	2920
Self-heal Daemon on 10.70.34.115			N/A	Y	2926


[root@king ~]# poweroff

Broadcast message from root.eng.blr.redhat.com
	(/dev/pts/0) at 23:56 ...

The system is going down for power off NOW!


Server1:-
============
[root@darrel ~]# gluster v status

[root@darrel ~]# echo $?
130

[root@darrel ~]# gluster v status all
operation failed
 
Failed to get names of volumes

[root@darrel ~]# gluster v heal rep info
operation failed

[root@darrel ~]# gluster v set rep stat-prefetch off

[root@darrel ~]# echo $?
255

[root@darrel ~]# gluster v status all
operation failed
 
Failed to get names of volumes

After 13 minutes on server1:-
==========================
[root@darrel ~]# gluster v status all
Status of volume: rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick darrel.lab.eng.blr.redhat.com:/home/export1	24009	Y	2915
NFS Server on localhost					38467	Y	2920
Self-heal Daemon on localhost				N/A	Y	2926

Comment 3 errata-xmlrpc 2012-11-12 18:47:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1456.html