Description of problem: ======================= If a glusterd is down and snapshot is deactivated or activated from different nodes, activation and deactivation is successful. But when a node comes back the status of the snap is not updated to the node, it remains as it was before going down. For example: ============ >> Status of the snapshot initially from all the nodes: Started [root@inception ~]# gluster snapshot info RS2 | grep "Status" Status : Started [root@inception ~]# [root@rhs-arch-srv2 ~]# gluster snapshot info RS2 | grep "Status" Status : Started [root@rhs-arch-srv2 ~]# >> Stop the glusterd on one of the node: [root@rhs-arch-srv2 ~]# service glusterd status glusterd is stopped [root@rhs-arch-srv2 ~]# >> Deactivate the snapshot from one of the node [root@inception ~]# gluster snapshot deactivate RS2 Deactivating snap will make its data inaccessible. Do you want to continue? (y/n) y Snapshot deactivate: RS2: Snap deactivated successfully [root@inception ~]# >> Status on all the machines which are UP is deactivated [root@inception ~]# gluster snapshot info RS2 | grep "Status" Status : Stopped [root@inception ~]# >> Bring back the node which is down: [root@rhs-arch-srv2 ~]# service glusterd status glusterd (pid 20450) is running... [root@rhs-arch-srv2 ~]# >> Check the status of snap on all the nodes, it is deactivated on all the nodes except the node which was down and came back [root@inception ~]# gluster snapshot info RS2 | grep "Status" Status : Stopped [root@inception ~]# [root@rhs-arch-srv2 ~]# gluster snapshot info RS2 | grep "Status" Status : Started [root@rhs-arch-srv2 ~]# Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.6.0.25-1.el6rhs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Have a volume from multi cluster node 2. Create a snapshot 3. Bring down one of the node in cluster 4. Deactivate the snapshot, which should be successful 5. Bring back the node UP 6. Check the status of the snapshot Actual results: =============== It is deactivated on all the nodes except the node which is brought UP. Expected results: ================= Once the node is brought online, the handshake should be performed to set the correct status Additional info: ================= The only way to get the correct status is to activate the snapshot again using force and than deactivate when all nodes are UP
Based on discussion, removing the blocker flag from this
Please review and sign-off edited doc text.
Version : glusterfs 3.6.0.33 With the latest change that snapshots are going to be deactivated by default and we need to activate them specifically before using it, this bug takes higher priority to be fixed.
Scenario for comment 4 1. Create 4 node cluster 2. Create 6*2 volume 3. Start the volume 4. Create a snapshot of a volume (snap1) 5. Kill glusterd on node2 6. Activate the snapshot snap1 7. Activating snapshot should be successful and it should bring 9 brick process from node1,node3 and node4 to Online 8. Bring back the glusterd on node2 9. Once the glusterd comes back on node2, it doesn't start the snapshot brick process on node2 Network fluctuation, glusterd going down is a valid use case And activating/deactivating snapshot during that period will lead into inconsistent states of snapshots. Chances of hitting this now is very high. One way of preventing for this release is to not allow activate/deactivate if a node/glusterd is down until user explicitly issues activate/deactivate force.
RCA: During handshake of glusterd, we are not checking the version of snaps. If there is any change made to snap, the version will be incremented. So during handshake we have to do a check for version of peer snap and local snap. If version of snap details in local host is a lesser than peer data, then the data in local host must be updated. upstream patch : http://review.gluster.org/#/c/9664/
Version :glusterfs-3.7.1-4.el6rhs.x86_64 ======== Create a snapshot. It is deactivated by default Stop glusterd on node2 Activate the snapshot from Node1 - successful Bring back glusterd on Node2 Check gluster snapshot info from Node2 - Snapshot status shows 'Started' Bring down glusterd on Node4 while deactivating activated snapshot and check on Node4 when glusterd comes back up- gluster snapshot info shows Status 'Stopped' and status shows all bricks are not running Above is as expected. When a node is brought down and snapshot is activated when the node comes back the snapshot info still shows 'Stopped' and status shows bricks are not running Snapshot info from other nodes : =============================== gluster snapshot info Snap2_GMT-2015.06.23-09.37.26 Snapshot : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Created : 2015-06-23 09:37:26 Snap Volumes: Snap Volume Name : 3ee8f93e484540dcae8d55a64702e961 Origin Volume name : vol0 Snaps taken for vol0 : 2 Snaps available for vol0 : 1 Status : Started Node2 (which was rebooted) ========================== gluster snapshot info Snap2_GMT-2015.06.23-09.37.26 Snapshot : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Created : 2015-06-23 09:37:26 Snap Volumes: Snap Volume Name : 3ee8f93e484540dcae8d55a64702e961 Origin Volume name : vol0 Snaps taken for vol0 : 2 Snaps available for vol0 : 1 Status : Stopped [root@rhs-arch-srv2 ~]# gluster snapshot status Snap2_GMT-2015.06.23-09.37.26 Snap Name : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick1/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 7536 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick2/b1 Volume Group : RHS_vg1 Brick Running : No Brick PID : N/A Data Percentage : 0.13 LV Size : 29.66g Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick3/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 14376 Data Percentage : 0.13 LV Size : 29.66g Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick4/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 7975 Data Percentage : 0.13 LV Size : 29.66g Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick5/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 7554 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick6/b2 Volume Group : RHS_vg2 Brick Running : No Brick PID : N/A Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick7/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 14394 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick8/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 7993 Data Percentage : 0.03 LV Size : 7.26t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick9/b3 Volume Group : RHS_vg3 Brick Running : Yes Brick PID : 7572 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick10/b3 Volume Group : RHS_vg3 Brick Running : No Brick PID : N/A Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick11/b3 Volume Group : RHS_vg3 Brick Running : Yes Brick PID : 14412 Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick12/b4 Volume Group : RHS_vg4 Brick Running : Yes Brick PID : 8011 Data Percentage : 0.03 LV Size : 7.26t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick13/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 7590 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick14/b5 Volume Group : RHS_vg5 Brick Running : No Brick PID : N/A Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick15/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 14430 Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick16/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 8029 Data Percentage : 0.04 LV Size : 5.44t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick17/b6 Volume Group : RHS_vg6 Brick Running : Yes Brick PID : 7608 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick18/b6 Volume Group : RHS_vg6 Brick Running : No Brick PID : N/A Data Percentage : 0.04 LV Size : 5.44t The above case fails in a Node down scenario. Moving back to 'Assigned'
I tested the above case with 2*2 volume and it is working fine. There is a short delay to start the bricks after nodes come back to online. If you check the snapshot status at that time, it will show as offline. tested using latest available downstream build glusterfs-debuginfo-3.7.1-6.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-6.el6rhs.x86_64 glusterfs-server-3.7.1-6.el6rhs.x86_64 glusterfs-rdma-3.7.1-6.el6rhs.x86_64 glusterfs-3.7.1-6.el6rhs.x86_64 glusterfs-api-3.7.1-6.el6rhs.x86_64 glusterfs-cli-3.7.1-6.el6rhs.x86_64 glusterfs-devel-3.7.1-6.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-6.el6rhs.x86_64 glusterfs-libs-3.7.1-6.el6rhs.x86_64 glusterfs-fuse-3.7.1-6.el6rhs.x86_64 glusterfs-api-devel-3.7.1-6.el6rhs.x86_64
Version : glusterfs-3.7.1-6.el6rhs.x86_64 ======= Retried scenario as mentioned in Comment 8 and Description. Snapshot status shows started and all bricks are running after node reboot. Waited for a while before checking the status after node rebooted. Marking bug Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html