Bug 1308837

Summary: Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shashank Raj <sraj>
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: Anil Shah <ashah>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asengupt, byarlaga, rcyriac, rhinduja, rhs-bugs, rjoseph, sashinde, storage-qa-internal
Target Milestone: ---Keywords: Patch, ZStream
Target Release: RHGS 3.1.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1316848 (view as bug list) Environment:
Last Closed: 2016-06-23 05:08:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1268895, 1299184, 1316848, 1329492    

Description Shashank Raj 2016-02-16 09:18:11 UTC
Description of problem:
Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
2/2

Steps to Reproduce:
1.Create a tiered volume, start it, enable quota and attach tier on the volume.

Volume Name: tiervolume
Type: Tier
Volume ID: ec14de5c-45dc-4a3a-80f1-b5f5b569fab2
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/bricks/brick3/b3
Brick2: 10.70.35.141:/bricks/brick3/b3
Brick3: 10.70.35.228:/bricks/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/bricks/brick0/b0
Brick5: 10.70.35.141:/bricks/brick0/b0
Brick6: 10.70.35.142:/bricks/brick0/b0
Brick7: 10.70.35.140:/bricks/brick0/b0
Brick8: 10.70.35.228:/bricks/brick1/b1
Brick9: 10.70.35.141:/bricks/brick1/b1
Brick10: 10.70.35.142:/bricks/brick1/b1
Brick11: 10.70.35.140:/bricks/brick1/b1
Brick12: 10.70.35.228:/bricks/brick2/b2
Brick13: 10.70.35.141:/bricks/brick2/b2
Brick14: 10.70.35.142:/bricks/brick2/b2
Brick15: 10.70.35.140:/bricks/brick2/b2
Options Reconfigured:
features.barrier: disable
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

2.Create a snapshot of this volume and activate it.

3.Create a clone of this snapshot and start it. Observe in gluster volume info that quota is enabled on the cloned volume.

Volume Name: clone1
Type: Tier
Volume ID: 3dbf687c-c2cf-46c0-af4e-ca542c5bae0d
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/run/gluster/snaps/clone1/brick1/b3
Brick2: 10.70.35.141:/run/gluster/snaps/clone1/brick2/b3
Brick3: 10.70.35.228:/run/gluster/snaps/clone1/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/run/gluster/snaps/clone1/brick4/b0
Brick5: 10.70.35.141:/run/gluster/snaps/clone1/brick5/b0
Brick6: 10.70.35.142:/run/gluster/snaps/clone1/brick6/b0
Brick7: 10.70.35.140:/run/gluster/snaps/clone1/brick7/b0
Brick8: 10.70.35.228:/run/gluster/snaps/clone1/brick8/b1
Brick9: 10.70.35.141:/run/gluster/snaps/clone1/brick9/b1
Brick10: 10.70.35.142:/run/gluster/snaps/clone1/brick10/b1
Brick11: 10.70.35.140:/run/gluster/snaps/clone1/brick11/b1
Brick12: 10.70.35.228:/run/gluster/snaps/clone1/brick12/b2
Brick13: 10.70.35.141:/run/gluster/snaps/clone1/brick13/b2
Brick14: 10.70.35.142:/run/gluster/snaps/clone1/brick14/b2
Brick15: 10.70.35.140:/run/gluster/snaps/clone1/brick15/b2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

4. Reboot one of the node and when it comes up, observe that the peer status shows other nodes in "Peer rejected state"

[root@dhcp35-140 ~]#  gluster peer status
Number of Peers: 3

Hostname: dhcp35-228.lab.eng.blr.redhat.com
Uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
State: Peer Rejected (Connected)

Hostname: 10.70.35.141
Uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
State: Peer Rejected (Connected)

Hostname: 10.70.35.142
Uuid: 8ea53341-1055-4288-8692-b1adc8244168
State: Peer Rejected (Connected)

5.Following messages are observed in glusterd logs

[2016-02-16 08:53:28.318532] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-02-16 08:53:28.320218] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707
The message "I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707" repeated 2 times between [2016-02-16 08:53:28.320218] and [2016-02-16 08:53:28.353309]
[2016-02-16 08:53:28.415908] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
[2016-02-16 08:53:28.418308] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.141
[2016-02-16 08:53:28.418453] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.141 (0), ret: 0
[2016-02-16 08:53:28.435497] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 8ea53341-1055-4288-8692-b1adc8244168
[2016-02-16 08:53:28.436686] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.142
[2016-02-16 08:53:28.436797] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.142 (0), ret: 0
[2016-02-16 08:53:28.460544] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
[2016-02-16 08:53:28.461872] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer dhcp35-228.lab.eng.blr.redhat.com
[2016-02-16 08:53:28.461972] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp35-228.lab.eng.blr.redhat.com (0), ret: 0
[2016-02-16 08:53:28.475055] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 8ea53341-1055-4288-8692-b1adc8244168, host: 10.70.35.142, port: 0
[2016-02-16 08:53:28.478578] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a, host: dhcp35-228.lab.eng.blr.redhat.com, port: 0
[2016-02-16 08:53:28.482286] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250, host: 10.70.35.141, port: 0

6.After checking under /var/lib/glusterd/snaps/snap-name/snap-id and /var/lib/glusterd/vols/clone1, it is seen that "quota.cksum" file is missing which is a part of tiervolume under /var/lib/glusterd/vols/tiervolume

[root@dhcp35-228 tiervolume]# ls
bricks           tiervolume.10.70.35.140.bricks-brick0-b0.vol  tiervolume.10.70.35.142.bricks-brick2-b2.vol
cksum            tiervolume.10.70.35.140.bricks-brick1-b1.vol  tiervolume.10.70.35.142.bricks-brick3-b3.vol
info             tiervolume.10.70.35.140.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick0-b0.vol
node_state.info  tiervolume.10.70.35.141.bricks-brick0-b0.vol  tiervolume.10.70.35.228.bricks-brick1-b1.vol
quota.cksum      tiervolume.10.70.35.141.bricks-brick1-b1.vol  tiervolume.10.70.35.228.bricks-brick2-b2.vol
quota.conf       tiervolume.10.70.35.141.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick3-b3.vol
run              tiervolume.10.70.35.141.bricks-brick3-b3.vol  tiervolume-rebalance.vol
snapd.info       tiervolume.10.70.35.142.bricks-brick0-b0.vol  tiervolume.tcp-fuse.vol
tier             tiervolume.10.70.35.142.bricks-brick1-b1.vol  trusted-tiervolume.tcp-


Actual results:
quota.cksum file is not getting copied after creating snapshot and clone.

Expected results:
quota.cksum file should also get copied after creation of snapshot and clone and the peers should not go into peer rejected state once a node is rebooted.

Additional info:

Comment 3 Avra Sengupta 2016-02-22 05:45:52 UTC
It looks great. Thanks Laura

Comment 5 Avra Sengupta 2016-03-17 08:32:53 UTC
Master URL: http://review.gluster.org/#/c/13760/ (IN REVIEW)

Comment 6 Mike McCune 2016-03-28 22:16:42 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 7 rjoseph 2016-04-27 06:24:58 UTC
Patches:

Upstream master : http://review.gluster.org/13760
Upstream release-3.7 : http://review.gluster.org/14047
Downstream : https://code.engineering.redhat.com/gerrit/73092

Comment 9 Anil Shah 2016-05-03 10:18:23 UTC
Volume Name: vol
Type: Distributed-Replicate
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.4:/rhs/brick1/b1
Brick2: 10.70.47.46:/rhs/brick1/b2
Brick3: 10.70.46.213:/rhs/brick1/b3
Brick4: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable


=========================

[root@dhcp46-4 ~]# gluster v attach-tier vol replica 2 10.70.46.4:/rhs/brick2/b1 10.70.47.46:/rhs/brick2/b2 10.70.46.213:/rhs/brick2/b3 10.70.46.148:/rhs/brick2/b4
volume attach-tier: success
Tiering Migration Functionality: vol: success: Attach tier is successful on vol. use tier status to check the status.
ID: 0f504264-81c8-46e4-85b8-348ea76123b3
================================================

[root@dhcp46-4 ~]# gluster v info vol
 
Volume Name: vol
Type: Tier
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.148:/rhs/brick2/b4
Brick2: 10.70.46.213:/rhs/brick2/b3
Brick3: 10.70.47.46:/rhs/brick2/b2
Brick4: 10.70.46.4:/rhs/brick2/b1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: 10.70.46.4:/rhs/brick1/b1
Brick6: 10.70.47.46:/rhs/brick1/b2
Brick7: 10.70.46.213:/rhs/brick1/b3
Brick8: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable
[root@dhcp46-4 ~]# 
===========================================
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)
===========================================
[root@dhcp46-4 ~]# gluster snapshot create snap2 vol no-timestamp
snapshot create: success: Snap snap2 created successfully
[root@dhcp46-4 ~]# gluster snapshot activate snap2
Snapshot activate: snap2: Snap activated successfully
[root@dhcp46-4 ~]# gluster snapshot clone clone2 snap2
snapshot clone: success: Clone clone2 created successfully
============================================


[root@dhcp46-4 ~]#  init 6
Connection to 10.70.46.4 closed by remote host.
Connection to 10.70.46.4 closed.
[ashah@localhost ~]$ ssh root.46.4
root.46.4's password: 
Last login: Tue May  3 20:37:13 2016 from dhcp-0-50.blr.redhat.com
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)

Bug verified on build glusterfs-3.7.9-3.el7rhgs.x86_64

Comment 12 errata-xmlrpc 2016-06-23 05:08:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240