Bug 1122064
Summary: | [SNAPSHOT]: activate and deactivate doesn't do a handshake when a glusterd comes back | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
Component: | snapshot | Assignee: | Mohammed Rafi KC <rkavunga> | |
Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.0 | CC: | asengupt, asriram, mzywusko, nsathyan, rhs-bugs, rjoseph, senaik, storage-qa-internal, vagarwal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.1.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | SNAPSHOT | |||
Fixed In Version: | glusterfs-3.7.0-3.el6rhs | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1122377 (view as bug list) | Environment: | ||
Last Closed: | 2015-07-29 04:34:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1087818, 1122377, 1202842, 1219744, 1223636 |
Description
Rahul Hinduja
2014-07-22 13:09:57 UTC
Based on discussion, removing the blocker flag from this Please review and sign-off edited doc text. Version : glusterfs 3.6.0.33 With the latest change that snapshots are going to be deactivated by default and we need to activate them specifically before using it, this bug takes higher priority to be fixed. Scenario for comment 4 1. Create 4 node cluster 2. Create 6*2 volume 3. Start the volume 4. Create a snapshot of a volume (snap1) 5. Kill glusterd on node2 6. Activate the snapshot snap1 7. Activating snapshot should be successful and it should bring 9 brick process from node1,node3 and node4 to Online 8. Bring back the glusterd on node2 9. Once the glusterd comes back on node2, it doesn't start the snapshot brick process on node2 Network fluctuation, glusterd going down is a valid use case And activating/deactivating snapshot during that period will lead into inconsistent states of snapshots. Chances of hitting this now is very high. One way of preventing for this release is to not allow activate/deactivate if a node/glusterd is down until user explicitly issues activate/deactivate force. RCA: During handshake of glusterd, we are not checking the version of snaps. If there is any change made to snap, the version will be incremented. So during handshake we have to do a check for version of peer snap and local snap. If version of snap details in local host is a lesser than peer data, then the data in local host must be updated. upstream patch : http://review.gluster.org/#/c/9664/ Version :glusterfs-3.7.1-4.el6rhs.x86_64 ======== Create a snapshot. It is deactivated by default Stop glusterd on node2 Activate the snapshot from Node1 - successful Bring back glusterd on Node2 Check gluster snapshot info from Node2 - Snapshot status shows 'Started' Bring down glusterd on Node4 while deactivating activated snapshot and check on Node4 when glusterd comes back up- gluster snapshot info shows Status 'Stopped' and status shows all bricks are not running Above is as expected. When a node is brought down and snapshot is activated when the node comes back the snapshot info still shows 'Stopped' and status shows bricks are not running Snapshot info from other nodes : =============================== gluster snapshot info Snap2_GMT-2015.06.23-09.37.26 Snapshot : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Created : 2015-06-23 09:37:26 Snap Volumes: Snap Volume Name : 3ee8f93e484540dcae8d55a64702e961 Origin Volume name : vol0 Snaps taken for vol0 : 2 Snaps available for vol0 : 1 Status : Started Node2 (which was rebooted) ========================== gluster snapshot info Snap2_GMT-2015.06.23-09.37.26 Snapshot : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Created : 2015-06-23 09:37:26 Snap Volumes: Snap Volume Name : 3ee8f93e484540dcae8d55a64702e961 Origin Volume name : vol0 Snaps taken for vol0 : 2 Snaps available for vol0 : 1 Status : Stopped [root@rhs-arch-srv2 ~]# gluster snapshot status Snap2_GMT-2015.06.23-09.37.26 Snap Name : Snap2_GMT-2015.06.23-09.37.26 Snap UUID : 5961a313-62ea-41d0-8cad-0a8a0fafe766 Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick1/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 7536 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick2/b1 Volume Group : RHS_vg1 Brick Running : No Brick PID : N/A Data Percentage : 0.13 LV Size : 29.66g Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick3/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 14376 Data Percentage : 0.13 LV Size : 29.66g Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick4/b1 Volume Group : RHS_vg1 Brick Running : Yes Brick PID : 7975 Data Percentage : 0.13 LV Size : 29.66g Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick5/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 7554 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick6/b2 Volume Group : RHS_vg2 Brick Running : No Brick PID : N/A Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick7/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 14394 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick8/b2 Volume Group : RHS_vg2 Brick Running : Yes Brick PID : 7993 Data Percentage : 0.03 LV Size : 7.26t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick9/b3 Volume Group : RHS_vg3 Brick Running : Yes Brick PID : 7572 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick10/b3 Volume Group : RHS_vg3 Brick Running : No Brick PID : N/A Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick11/b3 Volume Group : RHS_vg3 Brick Running : Yes Brick PID : 14412 Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick12/b4 Volume Group : RHS_vg4 Brick Running : Yes Brick PID : 8011 Data Percentage : 0.03 LV Size : 7.26t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick13/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 7590 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick14/b5 Volume Group : RHS_vg5 Brick Running : No Brick PID : N/A Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv3.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick15/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 14430 Data Percentage : 0.03 LV Size : 7.26t Brick Path : rhs-arch-srv4.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick16/b5 Volume Group : RHS_vg5 Brick Running : Yes Brick PID : 8029 Data Percentage : 0.04 LV Size : 5.44t Brick Path : inception.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick17/b6 Volume Group : RHS_vg6 Brick Running : Yes Brick PID : 7608 Data Percentage : 0.05 LV Size : 1.80t Brick Path : rhs-arch-srv2.lab.eng.blr.redhat.com:/var/run/gluster/snaps/3ee8f93e484540dcae8d55a64702e961/brick18/b6 Volume Group : RHS_vg6 Brick Running : No Brick PID : N/A Data Percentage : 0.04 LV Size : 5.44t The above case fails in a Node down scenario. Moving back to 'Assigned' I tested the above case with 2*2 volume and it is working fine. There is a short delay to start the bricks after nodes come back to online. If you check the snapshot status at that time, it will show as offline. tested using latest available downstream build glusterfs-debuginfo-3.7.1-6.el6rhs.x86_64 glusterfs-client-xlators-3.7.1-6.el6rhs.x86_64 glusterfs-server-3.7.1-6.el6rhs.x86_64 glusterfs-rdma-3.7.1-6.el6rhs.x86_64 glusterfs-3.7.1-6.el6rhs.x86_64 glusterfs-api-3.7.1-6.el6rhs.x86_64 glusterfs-cli-3.7.1-6.el6rhs.x86_64 glusterfs-devel-3.7.1-6.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-6.el6rhs.x86_64 glusterfs-libs-3.7.1-6.el6rhs.x86_64 glusterfs-fuse-3.7.1-6.el6rhs.x86_64 glusterfs-api-devel-3.7.1-6.el6rhs.x86_64 Version : glusterfs-3.7.1-6.el6rhs.x86_64 ======= Retried scenario as mentioned in Comment 8 and Description. Snapshot status shows started and all bricks are running after node reboot. Waited for a while before checking the status after node rebooted. Marking bug Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html |