Bug 1335357

Summary:	Modified volume options are not syncing once glusterd comes up.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Byreddy <bsrirama>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED ERRATA	QA Contact:	Byreddy <bsrirama>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	asrivast, rhinduja, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1335531 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:22:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817, 1335359, 1335367, 1335531, 1337113, 1337114

Description Byreddy 2016-05-12 04:08:08 UTC

Description of problem:
=======================
Modified volume options from peer node are not syncing once glusterd comes up in a node where one of the volume brick is down due to underlying filesystem(xfs) crash.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Create a simple distribute volume using two node cluster.
2. Crash one volume brick filesystem part of node-1.
3. stop glusterd on node-1
4. goto node-2 and modify some volume options.
5. start glusterd on node-1.
6. Check handshake has happened using volume get option on node1 (OR) using volume info.


Actual results:
===============
Handshake not happening.


Expected results:
=================
Handshake should happen.


Additional info:

Comment 2 Byreddy 2016-05-12 04:09:39 UTC

i will provide the logs

Comment 5 Byreddy 2016-05-12 06:02:30 UTC

glusterd log from node-1:
=========================
[2016-05-12 05:55:30.764115] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-05-12 05:55:30.766892] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707
[2016-05-12 05:55:30.825808] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d
[2016-05-12 05:55:30.827399] I [MSGID: 106009] [glusterd-utils.c:2824:glusterd_compare_friend_volume] 0-management: Version of volume Dis differ. local version = 2, remote version = 3 on peer 10.70.43.151
[2016-05-12 05:55:30.827750] C [MSGID: 106425] [glusterd-utils.c:3134:glusterd_import_new_brick] 0-management: realpath() failed for brick /bricks/brick0/h0. The underlying file system may be in bad state [Input/output error]
[2016-05-12 05:55:30.827838] E [MSGID: 106376] [glusterd-sm.c:1401:glusterd_friend_sm] 0-glusterd: handler returned: -1
[2016-05-12 05:55:30.834463] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d, host: 10.70.43.151, port: 0
[2016-05-12 05:55:30.890545] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d
[2016-05-12 05:55:30.890595] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-05-12 05:55:30.891211] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600
[2016-05-12 05:55:31.892768] I [MSGID: 106540] [glusterd-utils.c:4384:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV3 successfully
[2016-05-12 05:55:31.893132] I [MSGID: 106540] [glusterd-utils.c:4393:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered MOUNTV1 successfully
[2016-05-12 05:55:31.893455] I [MSGID: 106540] [glusterd-utils.c:4402:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NFSV3 successfully
[2016-05-12 05:55:31.893808] I [MSGID: 106540] [glusterd-utils.c:4411:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v4 successfully
[2016-05-12 05:55:31.894266] I [MSGID: 106540] [glusterd-utils.c:4420:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered NLM v1 successfully
[2016-05-12 05:55:31.894626] I [MSGID: 106540] [glusterd-utils.c:4429:glusterd_nfs_pmap_deregister] 0-glusterd: De-registered ACL v3 successfully
[2016-05-12 05:55:31.900965] W [socket.c:3133:socket_connect] 0-nfs: Ignore failed connection attempt on , (No such file or directory) 
[2016-05-12 05:55:31.901072] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600
[2016-05-12 05:55:31.901196] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: glustershd already stopped
[2016-05-12 05:55:31.901274] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2016-05-12 05:55:31.901408] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped
[2016-05-12 05:55:31.901463] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600
[2016-05-12 05:55:31.901589] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2016-05-12 05:55:31.901636] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600
[2016-05-12 05:55:31.901745] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2016-05-12 05:55:31.909398] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2016-05-12 05:55:31.909624] I [rpc-clnt.c:984:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2016-05-12 05:55:31.909847] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6dea3b5-a249-4108-9ac8-0d0319e9ed5d
[2016-05-12 05:55:31.910016] W [socket.c:701:__socket_rwv] 0-nfs: readv on /var/run/gluster/b7f9049f47b3ab44cdbaf600d7f5c6f8.socket failed (Invalid argument)
[2016-05-12 05:55:31.910034] I [MSGID: 106006] [glusterd-svc-mgmt.c:323:glusterd_svc_common_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2016-05-12 05:55:31.910296] W [socket.c:701:__socket_rwv] 0-management: readv on /var/run/gluster/c1eec530a1c811faf8e3d20e6c09c320.socket failed (Invalid argument)
[2016-05-12 05:55:31.910527] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
[2016-05-12 05:55:34.993921] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis
[2016-05-12 05:55:35.742207] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
[2016-05-12 05:55:36.189912] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume Dis
[2016-05-12 05:55:38.742298] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
[2016-05-12 05:55:41.742760] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
[2016-05-12 05:55:41.962352] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2016-05-12 05:55:41.963934] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2016-05-12 05:55:44.743559] I [MSGID: 106005] [glusterd-handler.c:5034:__glusterd_brick_rpc_notify] 0-management: Brick 10.70.42.77:/bricks/brick0/h0 has disconnected from glusterd.
(END)


glusterd log from node-2
========================

[2016-05-12 05:55:27.968405] W [socket.c:984:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT -1000 on socket 7, Invalid argument
[2016-05-12 05:55:27.968450] E [socket.c:3089:socket_connect] 0-management: Failed to set keep-alive: Invalid argument
The message "I [MSGID: 106004] [glusterd-handler.c:5192:__glusterd_peer_rpc_notify] 0-management: Peer <dhcp42-77.lab.eng.blr.redhat.com> (<f6f03f6a-dc88-4592-95b4-76482b1742d5>), in state <Peer in Cluster>, has disconnected from glusterd." repeated 11 times between [2016-05-12 05:54:44.009808] and [2016-05-12 05:55:24.962631]
[2016-05-12 05:55:30.769240] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707
[2016-05-12 05:55:30.828060] I [MSGID: 106490] [glusterd-handler.c:2600:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5
[2016-05-12 05:55:30.836792] I [MSGID: 106493] [glusterd-handler.c:3842:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp42-77.lab.eng.blr.redhat.com (0), ret: 0
[2016-05-12 05:55:30.894602] I [MSGID: 106493] [glusterd-rpc-ops.c:696:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5
[2016-05-12 05:55:30.894652] I [MSGID: 106492] [glusterd-handler.c:2776:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f6f03f6a-dc88-4592-95b4-76482b1742d5
[2016-05-12 05:55:30.894683] I [MSGID: 106502] [glusterd-handler.c:2821:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
The message "I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req" repeated 3 times between [2016-05-12 05:54:20.746141] and [2016-05-12 05:55:18.238141]
(END)

Comment 6 Atin Mukherjee 2016-05-12 07:21:47 UTC

RCA : Since the volume configuration was changed on glusterd instance on Node 1 where glusterd was down on Node 2, when glusterd restarts on Node 2 importing the updated volume happens at Node 2. However the brickinfo objects are created freshly and here since the underlying the file system was crashed, realpath () call in glusterd_import_new_brick () fails resulting into importing the volume and volume data goes inconsistent.

Comment 7 Atin Mukherjee 2016-05-12 12:36:15 UTC

An upstream patch http://review.gluster.org/14306 posted for review.

Comment 9 Atin Mukherjee 2016-05-20 11:12:43 UTC

Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/74663/

Upstream patches:

mainline : http://review.gluster.org/#/c/14306 
release-3.7 : http://review.gluster.org/#/c/14410 
release-3.8 : http://review.gluster.org/#/c/14411

Comment 11 Byreddy 2016-05-23 15:45:10 UTC

Verified this bug using the build "glusterfs-3.7.9-6" and found that fix is working good.

Steps done: Repeated the reproducing steps mentioned in the description section.

Moving to verified state.

Comment 13 errata-xmlrpc 2016-06-23 05:22:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240