Bug 1308837 - Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume.
Peers goes to rejected state after reboot of one node when quota is enabled o...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: snapshot (Show other bugs)
3.1
All Linux
unspecified Severity urgent
: ---
: RHGS 3.1.3
Assigned To: Avra Sengupta
Anil Shah
: Patch, ZStream
Depends On:
Blocks: 1268895 1299184 1316848 1329492
  Show dependency treegraph
 
Reported: 2016-02-16 04:18 EST by Shashank Raj
Modified: 2016-11-07 22:53 EST (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.7.9-3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1316848 (view as bug list)
Environment:
Last Closed: 2016-06-23 01:08:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Shashank Raj 2016-02-16 04:18:11 EST
Description of problem:
Peers goes to rejected state after reboot of one node when quota is enabled on cloned volume

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
2/2

Steps to Reproduce:
1.Create a tiered volume, start it, enable quota and attach tier on the volume.

Volume Name: tiervolume
Type: Tier
Volume ID: ec14de5c-45dc-4a3a-80f1-b5f5b569fab2
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/bricks/brick3/b3
Brick2: 10.70.35.141:/bricks/brick3/b3
Brick3: 10.70.35.228:/bricks/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/bricks/brick0/b0
Brick5: 10.70.35.141:/bricks/brick0/b0
Brick6: 10.70.35.142:/bricks/brick0/b0
Brick7: 10.70.35.140:/bricks/brick0/b0
Brick8: 10.70.35.228:/bricks/brick1/b1
Brick9: 10.70.35.141:/bricks/brick1/b1
Brick10: 10.70.35.142:/bricks/brick1/b1
Brick11: 10.70.35.140:/bricks/brick1/b1
Brick12: 10.70.35.228:/bricks/brick2/b2
Brick13: 10.70.35.141:/bricks/brick2/b2
Brick14: 10.70.35.142:/bricks/brick2/b2
Brick15: 10.70.35.140:/bricks/brick2/b2
Options Reconfigured:
features.barrier: disable
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

2.Create a snapshot of this volume and activate it.

3.Create a clone of this snapshot and start it. Observe in gluster volume info that quota is enabled on the cloned volume.

Volume Name: clone1
Type: Tier
Volume ID: 3dbf687c-c2cf-46c0-af4e-ca542c5bae0d
Status: Started
Number of Bricks: 15
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 3 = 3
Brick1: 10.70.35.142:/run/gluster/snaps/clone1/brick1/b3
Brick2: 10.70.35.141:/run/gluster/snaps/clone1/brick2/b3
Brick3: 10.70.35.228:/run/gluster/snaps/clone1/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick4: 10.70.35.228:/run/gluster/snaps/clone1/brick4/b0
Brick5: 10.70.35.141:/run/gluster/snaps/clone1/brick5/b0
Brick6: 10.70.35.142:/run/gluster/snaps/clone1/brick6/b0
Brick7: 10.70.35.140:/run/gluster/snaps/clone1/brick7/b0
Brick8: 10.70.35.228:/run/gluster/snaps/clone1/brick8/b1
Brick9: 10.70.35.141:/run/gluster/snaps/clone1/brick9/b1
Brick10: 10.70.35.142:/run/gluster/snaps/clone1/brick10/b1
Brick11: 10.70.35.140:/run/gluster/snaps/clone1/brick11/b1
Brick12: 10.70.35.228:/run/gluster/snaps/clone1/brick12/b2
Brick13: 10.70.35.141:/run/gluster/snaps/clone1/brick13/b2
Brick14: 10.70.35.142:/run/gluster/snaps/clone1/brick14/b2
Brick15: 10.70.35.140:/run/gluster/snaps/clone1/brick15/b2
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable

4. Reboot one of the node and when it comes up, observe that the peer status shows other nodes in "Peer rejected state"

[root@dhcp35-140 ~]#  gluster peer status
Number of Peers: 3

Hostname: dhcp35-228.lab.eng.blr.redhat.com
Uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
State: Peer Rejected (Connected)

Hostname: 10.70.35.141
Uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
State: Peer Rejected (Connected)

Hostname: 10.70.35.142
Uuid: 8ea53341-1055-4288-8692-b1adc8244168
State: Peer Rejected (Connected)

5.Following messages are observed in glusterd logs

[2016-02-16 08:53:28.318532] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-02-16 08:53:28.320218] I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707
The message "I [MSGID: 106163] [glusterd-handshake.c:1194:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30707" repeated 2 times between [2016-02-16 08:53:28.320218] and [2016-02-16 08:53:28.353309]
[2016-02-16 08:53:28.415908] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250
[2016-02-16 08:53:28.418308] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.141
[2016-02-16 08:53:28.418453] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.141 (0), ret: 0
[2016-02-16 08:53:28.435497] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 8ea53341-1055-4288-8692-b1adc8244168
[2016-02-16 08:53:28.436686] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer 10.70.35.142
[2016-02-16 08:53:28.436797] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.35.142 (0), ret: 0
[2016-02-16 08:53:28.460544] I [MSGID: 106490] [glusterd-handler.c:2539:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a
[2016-02-16 08:53:28.461872] E [MSGID: 106012] [glusterd-utils.c:2845:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume clone1 differ. local cksum = 1405646976, remote  cksum = 0 on peer dhcp35-228.lab.eng.blr.redhat.com
[2016-02-16 08:53:28.461972] I [MSGID: 106493] [glusterd-handler.c:3780:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to dhcp35-228.lab.eng.blr.redhat.com (0), ret: 0
[2016-02-16 08:53:28.475055] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 8ea53341-1055-4288-8692-b1adc8244168, host: 10.70.35.142, port: 0
[2016-02-16 08:53:28.478578] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: 66d2c49c-dd6c-4ba3-8840-e57e34dbaf3a, host: dhcp35-228.lab.eng.blr.redhat.com, port: 0
[2016-02-16 08:53:28.482286] I [MSGID: 106493] [glusterd-rpc-ops.c:481:__glusterd_friend_add_cbk] 0-glusterd: Received RJT from uuid: a7e0bb8a-d7bb-4d61-83e9-67de349bd250, host: 10.70.35.141, port: 0

6.After checking under /var/lib/glusterd/snaps/snap-name/snap-id and /var/lib/glusterd/vols/clone1, it is seen that "quota.cksum" file is missing which is a part of tiervolume under /var/lib/glusterd/vols/tiervolume

[root@dhcp35-228 tiervolume]# ls
bricks           tiervolume.10.70.35.140.bricks-brick0-b0.vol  tiervolume.10.70.35.142.bricks-brick2-b2.vol
cksum            tiervolume.10.70.35.140.bricks-brick1-b1.vol  tiervolume.10.70.35.142.bricks-brick3-b3.vol
info             tiervolume.10.70.35.140.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick0-b0.vol
node_state.info  tiervolume.10.70.35.141.bricks-brick0-b0.vol  tiervolume.10.70.35.228.bricks-brick1-b1.vol
quota.cksum      tiervolume.10.70.35.141.bricks-brick1-b1.vol  tiervolume.10.70.35.228.bricks-brick2-b2.vol
quota.conf       tiervolume.10.70.35.141.bricks-brick2-b2.vol  tiervolume.10.70.35.228.bricks-brick3-b3.vol
run              tiervolume.10.70.35.141.bricks-brick3-b3.vol  tiervolume-rebalance.vol
snapd.info       tiervolume.10.70.35.142.bricks-brick0-b0.vol  tiervolume.tcp-fuse.vol
tier             tiervolume.10.70.35.142.bricks-brick1-b1.vol  trusted-tiervolume.tcp-


Actual results:
quota.cksum file is not getting copied after creating snapshot and clone.

Expected results:
quota.cksum file should also get copied after creation of snapshot and clone and the peers should not go into peer rejected state once a node is rebooted.

Additional info:
Comment 3 Avra Sengupta 2016-02-22 00:45:52 EST
It looks great. Thanks Laura
Comment 5 Avra Sengupta 2016-03-17 04:32:53 EDT
Master URL: http://review.gluster.org/#/c/13760/ (IN REVIEW)
Comment 6 Mike McCune 2016-03-28 18:16:42 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 7 rjoseph 2016-04-27 02:24:58 EDT
Patches:

Upstream master : http://review.gluster.org/13760
Upstream release-3.7 : http://review.gluster.org/14047
Downstream : https://code.engineering.redhat.com/gerrit/73092
Comment 9 Anil Shah 2016-05-03 06:18:23 EDT
Volume Name: vol
Type: Distributed-Replicate
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.4:/rhs/brick1/b1
Brick2: 10.70.47.46:/rhs/brick1/b2
Brick3: 10.70.46.213:/rhs/brick1/b3
Brick4: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable


=========================

[root@dhcp46-4 ~]# gluster v attach-tier vol replica 2 10.70.46.4:/rhs/brick2/b1 10.70.47.46:/rhs/brick2/b2 10.70.46.213:/rhs/brick2/b3 10.70.46.148:/rhs/brick2/b4
volume attach-tier: success
Tiering Migration Functionality: vol: success: Attach tier is successful on vol. use tier status to check the status.
ID: 0f504264-81c8-46e4-85b8-348ea76123b3
================================================

[root@dhcp46-4 ~]# gluster v info vol
 
Volume Name: vol
Type: Tier
Volume ID: b1b9fcd8-8025-4884-8d7a-f99a346f3a18
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.148:/rhs/brick2/b4
Brick2: 10.70.46.213:/rhs/brick2/b3
Brick3: 10.70.47.46:/rhs/brick2/b2
Brick4: 10.70.46.4:/rhs/brick2/b1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: 10.70.46.4:/rhs/brick1/b1
Brick6: 10.70.47.46:/rhs/brick1/b2
Brick7: 10.70.46.213:/rhs/brick1/b3
Brick8: 10.70.46.148:/rhs/brick1/b4
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.bitrot: on
features.scrub: Active
features.barrier: disable
[root@dhcp46-4 ~]# 
===========================================
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)
===========================================
[root@dhcp46-4 ~]# gluster snapshot create snap2 vol no-timestamp
snapshot create: success: Snap snap2 created successfully
[root@dhcp46-4 ~]# gluster snapshot activate snap2
Snapshot activate: snap2: Snap activated successfully
[root@dhcp46-4 ~]# gluster snapshot clone clone2 snap2
snapshot clone: success: Clone clone2 created successfully
============================================


[root@dhcp46-4 ~]#  init 6
Connection to 10.70.46.4 closed by remote host.
Connection to 10.70.46.4 closed.
[ashah@localhost ~]$ ssh root@10.70.46.4
root@10.70.46.4's password: 
Last login: Tue May  3 20:37:13 2016 from dhcp-0-50.blr.redhat.com
[root@dhcp46-4 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.46
Uuid: 112df27c-d246-4b89-9b24-f52536da263c
State: Peer in Cluster (Connected)

Hostname: 10.70.46.213
Uuid: 0e6f19f6-3dde-487c-a10f-c1c53b37ed2b
State: Peer in Cluster (Connected)

Hostname: 10.70.46.148
Uuid: fc406ac0-2cd5-4aef-ab21-77707f7a17d0
State: Peer in Cluster (Connected)

Bug verified on build glusterfs-3.7.9-3.el7rhgs.x86_64
Comment 12 errata-xmlrpc 2016-06-23 01:08:15 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.