Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 788696

Summary:	Gluster CLI cmds "operation failed" when a node off network until glusterd svc is restarted.
Product:	[Community] GlusterFS	Reporter:	Jeff Byers <jbyers>
Component:	glusterd	Assignee:	Kaushal <kaushal>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	medium	Docs Contact:
Priority:	high
Version:	3.2.5	CC:	amarts, gluster-bugs, jbyers
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-05-10 06:20:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Byers 2012-02-08 20:17:08 UTC

Gluster CLI cmds "operation failed" when a node off network until glusterd svc is restarted.

When the communication path between two Gluster nodes is
disrupted, and any CLI command that changes the configuration is
attempted, it fails with "operation failed". This situation
persists until the 'glusterd' service is restarted.

To reproduce, with a two-way replica volume, disrupt
communication between the two gluster nodes by downing the
network interface on one of them.

Then attempt to change a volume option. After a longish 2 minute
timeout, the command will fail with "operation failed".

Repeating the command immediately fails with the same message,
but different exit status.

Re-establish the network communications by bringing the network
back up on the remote gluster node.

Again, attempting to change a volume option fails, and will 
continue to fail even though network path to all nodes is back 
up. CLI commands will fail until the local 'glusterd' service is 
restarted. This is a known work-around:

  Can't set volume options: "operation failed"
  http://comments.gmane.org/gmane.comp.file-systems.gluster.user/7652

But the work-around interrupts client I/O to all volumes.

Step-by-step below:

[root@S-100]# gluster peer status
Number of Peers: 1

Hostname: 192.168.5.99
Uuid: c0efccec-b0a0-a091-e517-00114331c257
State: Peer in Cluster (Connected)

[root@S-100]# gluster volume info nas-volume-0001

Volume Name: nas-volume-0001
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.5.100:/exports/nas-segment-0004/nas-volume-0001
Brick2: 192.168.5.99:/exports/nas-segment-0007/nas-volume-0001
Options Reconfigured:
performance.read-ahead: off
performance.write-behind: off
performance.stat-prefetch: on
nfs.disable: on
nfs.addr-namelookup: off

[root@S-100]# time gluster volume set nas-volume-0001 performance.stat-prefetch off
Set volume successful

real    0m2.244s

[root@S-99]# ifdown eth1

[root@S-100]# ping 192.168.5.99
PING 192.168.5.99 (192.168.5.99) 56(84) bytes of data.
From 192.168.5.100 icmp_seq=1 Destination Host Unreachable

[root@S-100]# time gluster volume set nas-volume-0001 performance.stat-prefetch on

real    2m0.144s

[root@S-100]# echo $?
110

[root@S-100]# tail -f /usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log
[2012-02-08 11:37:29.987084] I [glusterd-utils.c:243:glusterd_lock] 0-glusterd: Cluster lock held by c0efccec-b0a0-a091-e517-00114331bf4c
[2012-02-08 11:37:29.987117] I [glusterd-handler.c:420:glusterd_op_txn_begin] 0-glusterd: Acquired local lock
[2012-02-08 11:37:41.950195] W [socket.c:204:__socket_rwv] 0-socket.management: readv failed (Connection timed out)
[2012-02-08 11:37:41.950282] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Connection timed out), peer (192.168.5.99:1019)
[2012-02-08 11:39:30.8670] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:952)

[root@S-100]# time gluster volume set nas-volume-0001 performance.stat-prefetch on
operation failed

real    0m0.142s

[root@S-100]# tail -f /usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log
[2012-02-08 11:41:35.627701] E [glusterd-handler.c:1905:glusterd_handle_set_volume] 0-: Unable to set cli op: 16
[2012-02-08 11:41:35.628143] E [glusterd-rpc-ops.c:61:glusterd_op_send_cli_response] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x2b7f31e7737c] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x2b7f31e77171] (-->/usr/local/lib/glusterfs/3.2.5-SF.1/xlator/mgmt/glusterd.so(glusterd_handle_set_volume+0x105) [0x2aaaaaacc3a5]))) 0-: Assertion failed: op_ctx
[2012-02-08 11:41:35.630258] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1015)

[root@S-100]# gluster volume info nas-volume-0001

Volume Name: nas-volume-0001
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.5.100:/exports/nas-segment-0004/nas-volume-0001
Brick2: 192.168.5.99:/exports/nas-segment-0007/nas-volume-0001
Options Reconfigured:
performance.read-ahead: off
performance.write-behind: off
performance.stat-prefetch: off
nfs.disable: on
nfs.addr-namelookup: off

[root@S-100]# gluster volume stop nas-volume-0001
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
operation failed
[root@S-100]# gluster peer status
Number of Peers: 1

Hostname: 192.168.5.99
Uuid: c0efccec-b0a0-a091-e517-00114331c257
State: Peer in Cluster (Connected)

[root@S-99]# ifup eth1

[root@S-100]# ping 192.168.5.99
PING 192.168.5.99 (192.168.5.99) 56(84) bytes of data.
64 bytes from 192.168.5.99: icmp_seq=1 ttl=64 time=0.121 ms

[root@S-100]# gluster peer status
Number of Peers: 1

Hostname: 192.168.5.99
Uuid: c0efccec-b0a0-a091-e517-00114331c257
State: Peer in Cluster (Connected)

[root@S-100]# time gluster volume set nas-volume-0001 performance.stat-prefetch off
operation failed

real    0m0.124s

[root@S-100]# tail -f /usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log
[2012-02-08 11:44:02.822372] E [glusterd-handler.c:1905:glusterd_handle_set_volume] 0-: Unable to set cli op: 16
[2012-02-08 11:44:02.822523] E [glusterd-rpc-ops.c:61:glusterd_op_send_cli_response] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_notify+0x16c) [0x2b7f31e7737c] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x291) [0x2b7f31e77171] (-->/usr/local/lib/glusterfs/3.2.5-SF.1/xlator/mgmt/glusterd.so(glusterd_handle_set_volume+0x105) [0x2aaaaaacc3a5]))) 0-: Assertion failed: op_ctx
[2012-02-08 11:44:02.824656] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (127.0.0.1:1009)

[root@S-100]# /etc/init.d/glusterd restart
Stopping glusterd:                                         [  OK  ]
Starting glusterd:                                         [  OK  ]

[root@S-100]# time gluster volume set nas-volume-0001 performance.stat-prefetch on
Set volume successful

real    0m2.293s

[root@S-100]# glusterd --version
glusterfs 3.2.5 built on Jan  6 2012 12:57:01

Comment 1 Amar Tumballi 2012-03-12 07:41:01 UTC

need to check the behavior on master branch. Lets not focus much on 3.2.x branch, but make sure this will not be happening on master branch.

Comment 2 Kaushal 2012-05-08 11:53:54 UTC

Doesn't happen anymore on master. Will check once again and update the bug.

Comment 3 Kaushal 2012-05-10 06:20:39 UTC

This doesn't happen anymore on the master branch.