Description of problem: ======================= Currently, restore of volume is successful even when peer nodes which are participating in volume creation(bricks) is down or their glusterd's are down. There should be check in pre-validation before restoring the volume [root@snapshot-09 ~]# gluster volume info vol1 Volume Name: vol1 Type: Distributed-Replicate Volume ID: b93f3400-6be9-4181-a496-40c5ed22e481 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.220:/brick1/b1 Brick2: 10.70.43.20:/brick1/b1 Brick3: 10.70.43.186:/brick1/b1 Brick4: 10.70.43.70:/brick1/b1 [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster snapshot list vol1 Volume Name : vol1 Number of snaps taken : 1 Number of snaps available : 255 Snap Name : s1 Snap Time : 2014-02-20 07:36:42 Snap UUID : 5b03002a-2371-4478-a9f5-8db05eea890e [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster volume stop vol1 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: vol1: success [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.43.20 Uuid: 0a51103a-71f7-4a00-ae59-17711e4f8f9b State: Peer in Cluster (Disconnected) Hostname: 10.70.43.186 Uuid: 1786d00d-25b9-4ad6-b7d5-23c165bf600c State: Peer in Cluster (Connected) Hostname: 10.70.43.70 Uuid: 83a22dd5-47e1-423a-8ee9-2bb5230916fa State: Peer in Cluster (Disconnected) [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster snapshot restore -v vol1 s1 Snapshot restore: s1: Snap restored successfully [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster volume info vol1 Volume Name: vol1 Type: Distributed-Replicate Volume ID: 5b03002a-2371-4478-a9f5-8db05eea890e Status: Stopped Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.220:/run/gluster/snaps/5b03002a23714478a9f58db05eea890e/dev-VolGroup0-5b03002a23714478a9f58db05eea890e-brick/b1 Brick2: 10.70.43.20:/run/gluster/snaps/5b03002a23714478a9f58db05eea890e/dev-VolGroup0-5b03002a23714478a9f58db05eea890e-brick/b1 Brick3: 10.70.43.186:/run/gluster/snaps/5b03002a23714478a9f58db05eea890e/dev-VolGroup0-5b03002a23714478a9f58db05eea890e-brick/b1 Brick4: 10.70.43.70:/run/gluster/snaps/5b03002a23714478a9f58db05eea890e/dev-VolGroup0-5b03002a23714478a9f58db05eea890e-brick/b1 [root@snapshot-09 ~]# [root@snapshot-09 ~]# gluster volume start vol1 volume start: vol1: success [root@snapshot-09 ~]# Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.4.1.1.snap.feb17.2014git-1.el6.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Create and start the volume(2*2) from 4 nodes 2. Create a snapshot of volume 3. kill glusterd on node2 4. bring down the node4(poweroff) 5. offline the volume from node1 (gluster volume stop volume) 6. Restore the volume to snapshot taken at step2 Actual results: =============== Restore is successful Expected results: ================= Restore should fail with proper message Additional info: ================ When the glusterd of node2 is brought online, the volfiles on node1 which was restored is overwritten from node2 and shows the original volume information as [root@snapshot-09 ~]# gluster volume info vol1 Volume Name: vol1 Type: Distributed-Replicate Volume ID: b93f3400-6be9-4181-a496-40c5ed22e481 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.42.220:/brick1/b1 Brick2: 10.70.43.20:/brick1/b1 Brick3: 10.70.43.186:/brick1/b1 Brick4: 10.70.43.70:/brick1/b1 [root@snapshot-09 ~]#
Fix at http://review.gluster.org/#/c/7455/
Marking snapshot BZs to RHS 3.0.
Fixed with http://review.gluster.org/7455
Setting flags required to add BZs to RHS 3.0 Errata
Cant verify this bug until bz 1100282 is fixed
Please move this to ON_QA only when the dependent fix is moved to ON_QA.
During verification hit bz 1108652, marking it dependent for verification
Cannot verify this bug until bz 1108018 because after restore the version is mismatch and it doesnt handshake so no missed entry once the glusterd is back online. Moving the bug to modified state. Assign back to ON_QA once the dependent bug is fixed.
1100324 is an upstream bug therefore removing from depends on list. 1100282 is the corresponding downstream bug.
Verified with build: glusterfs-3.6.0.19-1.el6rhs.x86_64 With server side quorum support the snapshot restore should fail only when glusterd quorum doesnt match else it should be successful. Hence trying the case with 5 node cluster. Case1: When 3/5 machines were brought down, the restore fails as expected with following error message [root@inception ~]# gluster snapshot restore snap1 snapshot restore: failed: glusterds are not in quorum Snapshot command failed [root@inception ~]# Case2: When 2/5 machines were down, restore is successful as expected and the entries are registered into missed_snaps_list as [root@inception ~]# cat /var/lib/glusterd/snaps/missed_snaps_list d7f5e47b-70d8-457e-bce1-615d91c8591e:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:2:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick2/b0:3:1 b77af951-841b-427e-a7ca-2e9677a896ca:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:4:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick4/b0:3:1 [root@inception ~]# After glusterd start on one machine: [root@rhs-arch-srv2 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list d7f5e47b-70d8-457e-bce1-615d91c8591e:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:2:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick2/b0:3:2 b77af951-841b-427e-a7ca-2e9677a896ca:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:4:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick4/b0:3:1 [root@rhs-arch-srv2 ~]# After glusterd start on another machine: [root@rhs-arch-srv4 ~]# cat /var/lib/glusterd/snaps/missed_snaps_list d7f5e47b-70d8-457e-bce1-615d91c8591e:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:2:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick2/b0:3:2 b77af951-841b-427e-a7ca-2e9677a896ca:f17150a5-6099-4995-9e99-5a4fbebe9380=413a77c67519440a865e61ebc283f267:4:/var/run/gluster/snaps/413a77c67519440a865e61ebc283f267/brick4/b0:3:2 [root@rhs-arch-srv4 ~]# When all the machines are up and running, then the restore happens on the earlier brought down machines and all nodes in cluster are in sync. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html