Bug 1308837

Summary:	Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Shashank Raj <sraj>
Component:	snapshot	Assignee:	Avra Sengupta <asengupt>
Status:	CLOSED ERRATA	QA Contact:	Anil Shah <ashah>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	asengupt, byarlaga, rcyriac, rhinduja, rhs-bugs, rjoseph, sashinde, storage-qa-internal
Target Milestone:	---	Keywords:	Patch, ZStream
Target Release:	RHGS 3.1.3
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-3	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1316848 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:08:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1268895, 1299184, 1316848, 1329492

Description Shashank Raj 2016-02-16 09:18:11 UTC

Description of problem:
Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
2/2

Steps to Reproduce:
1.Create a tiered volume, start it, enable quota and attach tier on the volume.

Volume Name: tiervolume
Type: Tier
Volume ID: ec14de5c-45dc-4a3a-80f1-b5f5b569fab2
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/bricks/brick3/b3
Brick2: 10.70.35.141:/bricks/brick3/b3
Brick3: 10.70.35.228:/bricks/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/bricks/brick0/b0
Brick5: 10.70.35.141:/bricks/brick0/b0
Brick6: 10.70.35.142:/bricks/brick0/b0
Brick7: 10.70.35.140:/bricks/brick0/b0
Brick8: 10.70.35.228:/bricks/brick1/b1
Brick9: 10.70.35.141:/bricks/brick1/b1
Brick10: 10.70.35.142:/bricks/brick1/b1
Brick11: 10.70.35.140:/bricks/brick1/b1
Brick12: 10.70.35.228:/bricks/brick2/b2
Brick13: 10.70.35.141:/bricks/brick2/b2
Brick14: 10.70.35.142:/bricks/brick2/b2
Brick15: 10.70.35.140:/bricks/brick2/b2
Options Reconfigured:
features.barrier: disable
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

2.Create a snapshot of this volume and activate it.

3.Create a clone of this snapshot and start it. Observe in gluster volume info that quota is enabled on the cloned volume.

Volume Name: clone1
Type: Tier
Volume ID: 3dbf687c-c2cf-46c0-af4e-ca542c5bae0d
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/run/gluster/snaps/clone1/brick1/b3
Brick2: 10.70.35.141:/run/gluster/snaps/clone1/brick2/b3
Brick3: 10.70.35.228:/run/gluster/snaps/clone1/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/run/gluster/snaps/clone1/brick4/b0
Brick5: 10.70.35.141:/run/gluster/snaps/clone1/brick5/b0
Brick6: 10.70.35.142:/run/gluster/snaps/clone1/brick6/b0
Brick7: 10.70.35.140:/run/gluster/snaps/clone1/brick7/b0
Brick8: 10.70.35.228:/run/gluster/snaps/clone1/brick8/b1
Brick9: 10.70.35.141:/run/gluster/snaps/clone1/brick9/b1
Brick10: 10.70.35.142:/run/gluster/snaps/clone1/brick10/b1
Brick11: 10.70.35.140:/run/gluster/snaps/clone1/brick11/b1
Brick12: 10.70.35.228:/run/gluster/snaps/clone1/brick12/b2
Brick13: 10.70.35.141:/run/gluster/snaps/clone1/brick13/b2
Brick14: 10.70.35.142:/run/gluster/snaps/clone1/brick14/b2
Brick15: 10.70.35.140:/run/gluster/snaps/clone1/brick15/b2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

4. Reboot one of the node and when it comes up, observe that the peer status shows other nodes in "Peer rejected state"

[root@dhcp35-140 ~]#  gluster peer status
Number of Peers: 3

Hostname: dhcp35-228.lab.eng.blr.redhat.com
Uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
State: Peer Rejected (Connected)

Hostname: 10.70.35.141
Uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
State: Peer Rejected (Connected)

Hostname: 10.70.35.142
Uuid: 8ea53341-1055-4288-8692-b1adc8244168
State: Peer Rejected (Connected)

5.Following messages are observed in glusterd logs

[2016-02-16 08:53:28.318532] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-02-16 08:53:28.320218] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707
The message "I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707" repeated 2 times between [2016-02-16 08:53:28.320218] and [2016-02-16 08:53:28.353309]
[2016-02-16 08:53:28.415908] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
[2016-02-16 08:53:28.418308] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.141
[2016-02-16 08:53:28.418453] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.141 (0), ret: 0
[2016-02-16 08:53:28.435497] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 8ea53341-1055-4288-8692-b1adc8244168
[2016-02-16 08:53:28.436686] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.142
[2016-02-16 08:53:28.436797] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.142 (0), ret: 0
[2016-02-16 08:53:28.460544] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
[2016-02-16 08:53:28.461872] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer dhcp35-228.lab.eng.blr.redhat.com
[2016-02-16 08:53:28.461972] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp35-228.lab.eng.blr.redhat.com (0), ret: 0
[2016-02-16 08:53:28.475055] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 8ea53341-1055-4288-8692-b1adc8244168, host: 10.70.35.142, port: 0
[2016-02-16 08:53:28.478578] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a, host: dhcp35-228.lab.eng.blr.redhat.com, port: 0
[2016-02-16 08:53:28.482286] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250, host: 10.70.35.141, port: 0

6.After checking under /var/lib/glusterd/snaps/snap-name/snap-id and /var/lib/glusterd/vols/clone1, it is seen that "quota.cksum" file is missing which is a part of tiervolume under /var/lib/glusterd/vols/tiervolume

[root@dhcp35-228 tiervolume]# ls
bricks           tiervolume.10.70.35.140.bricks-brick0-b0.vol  tiervolume.10.70.35.142.bricks-brick2-b2.vol
cksum            tiervolume.10.70.35.140.bricks-brick1-b1.vol  tiervolume.10.70.35.142.bricks-brick3-b3.vol
info             tiervolume.10.70.35.140.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick0-b0.vol
node_state.info  tiervolume.10.70.35.141.bricks-brick0-b0.vol  tiervolume.10.70.35.228.bricks-brick1-b1.vol
quota.cksum      tiervolume.10.70.35.141.bricks-brick1-b1.vol  tiervolume.10.70.35.228.bricks-brick2-b2.vol
quota.conf       tiervolume.10.70.35.141.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick3-b3.vol
run              tiervolume.10.70.35.141.bricks-brick3-b3.vol  tiervolume-rebalance.vol
snapd.info       tiervolume.10.70.35.142.bricks-brick0-b0.vol  tiervolume.tcp-fuse.vol
tier             tiervolume.10.70.35.142.bricks-brick1-b1.vol  trusted-tiervolume.tcp-


Actual results:
quota.cksum file is not getting copied after creating snapshot and clone.

Expected results:
quota.cksum file should also get copied after creation of snapshot and clone and the peers should not go into peer rejected state once a node is rebooted.

Additional info:

Comment 3 Avra Sengupta 2016-02-22 05:45:52 UTC

It looks great. Thanks Laura

Comment 5 Avra Sengupta 2016-03-17 08:32:53 UTC

Master URL: http://review.gluster.org/#/c/13760/ (IN REVIEW)

Comment 6 Mike McCune 2016-03-28 22:16:42 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 7 rjoseph 2016-04-27 06:24:58 UTC

Patches:

Upstream master : http://review.gluster.org/13760
Upstream release-3.7 : http://review.gluster.org/14047
Downstream : https://code.engineering.redhat.com/gerrit/73092

Comment 9 Anil Shah 2016-05-03 10:18:23 UTC

Volume Name: vol
Type: Distributed-Replicate
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.4:/rhs/brick1/b1
Brick2: 10.70.47.46:/rhs/brick1/b2
Brick3: 10.70.46.213:/rhs/brick1/b3
Brick4: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable


=========================

[root@dhcp46-4 ~]# gluster v attach-tier vol replica 2 10.70.46.4:/rhs/brick2/b1 10.70.47.46:/rhs/brick2/b2 10.70.46.213:/rhs/brick2/b3 10.70.46.148:/rhs/brick2/b4
volume attach-tier: success
Tiering Migration Functionality: vol: success: Attach tier is successful on vol. use tier status to check the status.
ID: 0f504264-81c8-46e4-85b8-348ea76123b3
================================================

[root@dhcp46-4 ~]# gluster v info vol
 
Volume Name: vol
Type: Tier
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.148:/rhs/brick2/b4
Brick2: 10.70.46.213:/rhs/brick2/b3
Brick3: 10.70.47.46:/rhs/brick2/b2
Brick4: 10.70.46.4:/rhs/brick2/b1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: 10.70.46.4:/rhs/brick1/b1
Brick6: 10.70.47.46:/rhs/brick1/b2
Brick7: 10.70.46.213:/rhs/brick1/b3
Brick8: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable
[root@dhcp46-4 ~]# 
===========================================
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)
===========================================
[root@dhcp46-4 ~]# gluster snapshot create snap2 vol no-timestamp
snapshot create: success: Snap snap2 created successfully
[root@dhcp46-4 ~]# gluster snapshot activate snap2
Snapshot activate: snap2: Snap activated successfully
[root@dhcp46-4 ~]# gluster snapshot clone clone2 snap2
snapshot clone: success: Clone clone2 created successfully
============================================


[root@dhcp46-4 ~]#  init 6
Connection to 10.70.46.4 closed by remote host.
Connection to 10.70.46.4 closed.
[ashah@localhost ~]$ ssh root.46.4
root.46.4's password: 
Last login: Tue May  3 20:37:13 2016 from dhcp-0-50.blr.redhat.com
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)

Bug verified on build glusterfs-3.7.9-3.el7rhgs.x86_64

Comment 12 errata-xmlrpc 2016-06-23 05:08:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240