Description of problem: ======================= Snapshot delete failed with error : [root@snapshot-01 ~]# gluster snapshot delete -v vol1 -s SNAP2 snapshot remove: failed: Commit failed on 10.70.43.151. Please check log file for details. Snapshot command failed Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.4.0.snap.dec30.2013git How reproducible: Steps to Reproduce: ================== 1.Create 2 volumes (vol1 , vol2) and start it 2.Create 2 snapshots on vol1 and 2 snapshots on vol2 3.List the snapshots gluster snapshot list vol1 Volume Name : vol1 Number of snaps taken : 2 Number of snaps available : 254 Snap Name : SNAP1 Snap Time : 2014-01-07 07:39:07 Snap ID : 4ef3d75c-1154-4a9d-b22b-d8fca63d51c3 Snap Name : SNAP2 Snap Time : 2014-01-07 07:39:37 Snap ID : 99d9a27a-08e2-4058-b0a3-2de6d42193f9 [root@snapshot-01 ~]# gluster snapshot list vol2 Volume Name : vol2 Number of snaps taken : 2 Number of snaps available : 254 Snap Name : SNAP1_vol2 Snap Time : 2014-01-07 07:41:06 Snap ID : d3761a00-2a67-440b-871b-eb73d658172e Snap Name : SNAP2_vol2 Snap Time : 2014-01-07 07:41:21 Snap ID : 4c854293-e2dd-4b19-a9da-caba23283cfd 3.Delete snapshots gluster snapshot delete -v vol1 -s SNAP1 snapshot remove: failed: Commit failed on 10.70.43.151. Please check log file for details. Snapshot command failed [root@snapshot-01 ~]# gluster snapshot delete -v vol1 -s SNAP2 snapshot remove: failed: Commit failed on 10.70.43.151. Please check log file for details. Snapshot command failed ---------------Part of log--------------- [2014-01-07 02:12:48.257146] W [glusterd-snapshot.c:3580:glusterd_remove_snap] 0-management: unmounting the path /run/gluster/snaps/SNAP1/dev-mapper-V olGroup0-SNAP1-brick (brick: /run/gluster/snaps/SNAP1/dev-mapper-VolGroup0-SNAP1-brick/a1) failed (Bad file descriptor) [2014-01-07 02:12:48.257233] E [glusterd-snapshot.c:3662:glusterd_brick_snapshot_remove] 0-management: failed to remove the snapshot /run/gluster/snap s/SNAP1/dev-mapper-VolGroup0-SNAP1-brick/a1 (/dev/mapper/VolGroup0-SNAP1) [2014-01-07 02:12:48.257260] E [glusterd-snapshot.c:3742:glusterd_do_snap_remove] 0-management: removing the bricks snapshots for the snap SNAP1 (volu me: vol1) failed [2014-01-07 02:12:48.257277] E [glusterd-snapshot.c:3893:glusterd_snapshot_remove_commit] 0-management: removing the snap SNAP1 failed [2014-01-07 02:12:48.257295] E [glusterd-snapshot.c:4375:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-01-07 02:12:48.257312] E [glusterd-mgmt-handler.c:552:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-01-07 02:12:48.257341] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2014-01-07 02:12:48.266895] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-07 02:12:48.266940] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-07 02:12:48.267053] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /run/gluster/snaps/SNAP1/dev-mapper-VolGroup0-SNAP1-b rick/a1 on port 49154 [2014-01-07 02:14:08.061592] W [glusterd-snapshot.c:3580:glusterd_remove_snap] 0-management: unmounting the path /run/gluster/snaps/SNAP2/dev-mapper-V olGroup0-SNAP2-brick (brick: /run/gluster/snaps/SNAP2/dev-mapper-VolGroup0-SNAP2-brick/a1) failed (Bad file descriptor) [2014-01-07 02:14:08.061661] E [glusterd-snapshot.c:3662:glusterd_brick_snapshot_remove] 0-management: failed to remove the snapshot /run/gluster/snap s/SNAP2/dev-mapper-VolGroup0-SNAP2-brick/a1 (/dev/mapper/VolGroup0-SNAP2) [2014-01-07 02:14:08.061684] E [glusterd-snapshot.c:3742:glusterd_do_snap_remove] 0-management: removing the bricks snapshots for the snap SNAP2 (volu me: vol1) failed [2014-01-07 02:14:08.061702] E [glusterd-snapshot.c:3893:glusterd_snapshot_remove_commit] 0-management: removing the snap SNAP2 failed [2014-01-07 02:14:08.061721] E [glusterd-snapshot.c:4375:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-01-07 02:14:08.061740] E [glusterd-mgmt-handler.c:552:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-01-07 02:14:08.061791] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2014-01-07 02:14:08.071172] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 =========================================================== Actual results: Expected results: Additional info:
http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1049353/
Have seen this issue multiple times, where snap delete fails but deletes the snap from few nodes not all. For example: ============ Scenario: 3 snaps present on the volume as listed below: [root@snapshot-01 ~]# gluster snapshot list Volume Name : vol0 Number of snaps taken : 3 Number of snaps available : 253 Snap Name : s1 Snap Time : 2014-01-17 07:37:11 Snap UUID : dc9a2ae8-ba36-4c80-b30b-b095b4af4ec9 Snap Name : s3 Snap Time : 2014-01-17 07:37:27 Snap UUID : 1fcccd77-1044-4e12-8a7f-0452c864a159 Snap Name : s2 Snap Time : 2014-01-17 07:37:35 Snap UUID : 8b60a028-4fa6-4149-98a3-cfa5478ceff5 Volume Name : vol1 Number of snaps taken : 0 Number of snaps available : 256 [root@snapshot-01 ~]# [root@snapshot-01 ~]# gluster snap delete vol0 -s s1 Deleting snap will erase all information about the snap. Do you want to continue? (y/n) y snapshot delete: failed: Commit failed on 10.70.43.151. Please check log file for details. Snapshot command failed [root@snapshot-01 ~]# [root@snapshot-01 ~]# [root@snapshot-01 ~]# gluster snap delete vol0 -s s1 Deleting snap will erase all information about the snap. Do you want to continue? (y/n) y snapshot delete: failed: snap s1 does not exist Snapshot command failed [root@snapshot-01 ~]# [root@snapshot-01 ~]# gluster snap list Volume Name : vol0 Number of snaps taken : 2 Number of snaps available : 254 Snap Name : s3 Snap Time : 2014-01-17 07:37:27 Snap UUID : 1fcccd77-1044-4e12-8a7f-0452c864a159 Snap Name : s2 Snap Time : 2014-01-17 07:37:35 Snap UUID : 8b60a028-4fa6-4149-98a3-cfa5478ceff5 Volume Name : vol1 Number of snaps taken : 0 Number of snaps available : 256 [root@snapshot-01 ~]# [root@snapshot-02 ~]# gluster snap list Volume Name : vol0 Number of snaps taken : 3 Number of snaps available : 253 Snap Name : s1 Snap Time : 2014-01-17 07:37:14 Snap UUID : dc9a2ae8-ba36-4c80-b30b-b095b4af4ec9 Snap Name : s3 Snap Time : 2014-01-17 07:37:30 Snap UUID : 1fcccd77-1044-4e12-8a7f-0452c864a159 Snap Name : s2 Snap Time : 2014-01-17 07:37:39 Snap UUID : 8b60a028-4fa6-4149-98a3-cfa5478ceff5 Volume Name : vol1 Number of snaps taken : 0 Number of snaps available : 256 [root@snapshot-02 ~]# If you note the above command outputs, the snap named s1 is deleted from snapshot-01 (where CLI was executed and failed) but snap is not deleted from other node in cluster (snapshot-02)
Clear steps to reproduce this case are as follows: 1. Create multiple snaps of a volume (lets call r1 r2 r3) 2. Delete all the snaps together using (gluster snap delete -s r1 r2 r3) . This returns successful removal of one snap r1 (bz 1048122). Note r2 and r3 doesn't gets deleted here. 3. Delete r2 snap as (snap delete vol -s r2). It fails with the above message, snap gets deleted from few nodes in cluster and remain on another nodes.
Another very simple case to reproduce this issue. 1. Create 10 snapshots of volume as "for i in {1..10} ; do gluster snapshot create vol2 -n y$i ; done" 2. snapshot creation should be successful 3. Try delete snaps again in for loop as "for i in {1..10} ; do gluster snapshot delete vol2 -s y$i ; done" Only few snaps gets successfully deleted, rest command fails and partial deletion from few nodes. Log Snippet: ============= [2014-01-17 04:29:57.156376] E [glusterd-mgmt.c:108:gd_mgmt_v3_collate_errors] 0-: Commit failed on 10.70.43.71. Please check log file for details. [2014-01-17 04:29:57.670433] E [glusterd-mgmt.c:1009:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2014-01-17 04:29:57.670565] E [glusterd-mgmt.c:1581:glusterd_mgmt_v3_initiate_snap_phases] 0-: Commit Op Failed [2014-01-17 04:29:57.677018] E [glusterd-mgmt.c:1601:glusterd_mgmt_v3_initiate_snap_phases] 0-: Brick Ops Failed [2014-01-17 04:30:01.531081] E [glusterd-snapshot.c:3460:glusterd_snapshot_remove_prevalidate] 0-management: snap y4 does not exist, (volume: vol2) [2014-01-17 04:30:01.531197] W [glusterd-snapshot.c:4568:glusterd_snapshot_prevalidate] 0-management: Snapshot remove validation failed [2014-01-17 04:30:01.531226] E [glusterd-mgmt.c:550:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Snapshot on local node [2014-01-17 04:30:01.531280] E [glusterd-mgmt.c:1562:glusterd_mgmt_v3_initiate_snap_phases] 0-: Pre Validation Failed [2014-01-17 04:30:06.658121] W [glusterd-snapshot.c:3524:glusterd_remove_snap] 0-management: unmounting the path /run/gluster/snaps/y5/dev-mapper-VolGroup0-y5-brick (brick: /run/gluster/snaps/y5/dev-mapper-V olGroup0-y5-brick/b2) failed (Bad file descriptor) [2014-01-17 04:30:06.658202] E [glusterd-snapshot.c:3606:glusterd_brick_snapshot_remove] 0-management: failed to remove the snapshot /run/gluster/snaps/y5/dev-mapper-VolGroup0-y5-brick/b2 (/dev/mapper/VolGro up0-y5) [2014-01-17 04:30:06.658237] E [glusterd-snapshot.c:3686:glusterd_do_snap_remove] 0-management: removing the bricks snapshots for the snap y5 (volume: vol2) failed [2014-01-17 04:30:06.658258] E [glusterd-snapshot.c:3837:glusterd_snapshot_remove_commit] 0-management: removing the snap y5 failed [2014-01-17 04:30:06.658297] E [glusterd-snapshot.c:4423:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-01-17 04:30:06.658330] E [glusterd-mgmt.c:964:glusterd_mgmt_v3_commit] 0-management: Commit failed for operation Snapshot on local node [2014-01-17 04:30:06.658369] E [glusterd-mgmt.c:1581:glusterd_mgmt_v3_initiate_snap_phases] 0-: Commit Op Failed [2014-01-17 04:30:06.658924] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-17 04:30:06.659177] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-17 04:30:06.659563] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /run/gluster/snaps/y5/dev-mapper-VolGroup0-y5-brick/b2 on port 49179 [2014-01-17 04:30:06.667001] E [glusterd-mgmt.c:1601:glusterd_mgmt_v3_initiate_snap_phases] 0-: Brick Ops Failed [2014-01-17 04:30:16.031422] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-17 04:30:16.031522] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-17 04:30:16.031688] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /run/gluster/snaps/y6/dev-mapper-VolGroup0-y6-brick/b2 on port 49180 [2014-01-17 04:31:36.632080] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2014-01-17 04:31:36.632181] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2014-01-17 04:31:36.632372] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /run/gluster/snaps/y7/dev-mapper-VolGroup0-y7-brick/b2 on port 49181 [2014-01-17 04:31:36.654105] E [glusterd-mgmt.c:108:gd_mgmt_v3_collate_errors] 0-: Commit failed on 10.70.43.151. Please check log file for details. [2014-01-17 04:31:37.168325] E [glusterd-mgmt.c:1009:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2014-01-17 04:31:37.168462] E [glusterd-mgmt.c:1581:glusterd_mgmt_v3_initiate_snap_phases] 0-: Commit Op Failed [2014-01-17 04:31:37.174441] E [glusterd-mgmt.c:1601:glusterd_mgmt_v3_initiate_snap_phases] 0-: Brick Ops Failed
The delete was failing because of the race condition between process kill and snap volume unmount. unmount was tried even before the brick process was killed. This issue is fixed in patch : http://review.gluster.org/#/c/6772/
With quite a few backend changes, we are hitting this issue again with Bad File Descriptor error. So moving this to assigned state.
This bug is fixed as a part of another patch. Link for that is http://review.gluster.org/#/c/7123/
Could reproduce this with build: glusterfs-3.4.1.6.snap.mar25.2014git-1.el6.x86_64 Scenario where it got reproduced: 1. Start creating 100 snaps of volume1 using "for i in {1..100} ; do gluster snapshot create snap$i volume1; done" 2. Start Creating 50 snaps of volume2 using "for i in {1..50}; do gluster snapshot create s$i volume2; done 3. Once the creation of 50 snaps s1 to s50 is successful, start deleting using "for i in {1..50} ; do gluster --mode=script snapshot delete s$i; done" Snapshot delete fails: ====================== [root@snapshot-10 ~]# for i in {1..50} ; do gluster --mode=script snapshot delete s$i; done snapshot delete: failed: Commit failed on 10.70.42.220. Please check log file for details. Snapshot command failed snapshot delete: failed: Commit failed on 10.70.42.220. Please check log file for details. Snapshot command failed snapshot delete: s3: snap removed successfully ^C [root@snapshot-10 ~]# But when u query the snapshot info its deleted from node: [root@snapshot-10 ~]# gluster snapshot info s1 Snapshot info : failed: Snap (s1) does not exist Snapshot command failed [root@snapshot-10 ~]# But it is not deleted on the node where another snapshot creation was in progress(snap1 to snap100) [root@snapshot-09 ~]# gluster snapshot info s1 Snapshot : s1 Snap UUID : cdc47124-a642-4439-a053-0fc742dc4938 Created : 2014-03-26 09:35:07 Snap Volumes: Snap Volume Name : e8fa7d8b51b04098b872a90c450fb50b Origin Volume name : vol1 Snaps taken for vol1 : 46 Snaps available for vol1 : 210 Status : Started [root@snapshot-09 ~]# Logs are as follows: ===================== [2014-03-26 04:17:48.510248] E [glusterd-utils.c:1715:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/f2609549e70d13f3aa87690cd1c1447c.socket error: Permission denied [2014-03-26 04:17:48.547055] E [glusterd-snapshot.c:993:glusterd_lvm_snapshot_remove] 0-management: failed to remove the snapshot /var/run/gluster/snaps/e8fa7d8b51b04098b872a90c450fb50b/dev-VolGroup0-e8fa7d8b51b04098b872a90c450fb50b-brick/b1 (/dev/mapper/VolGroup0-e8fa7d8b51b04098b872a90c450fb50b) [2014-03-26 04:17:48.547126] E [glusterd-snapshot.c:3196:glusterd_snapshot_remove_commit] 0-management: Failed to remove snap s1 [2014-03-26 04:17:48.547144] E [glusterd-snapshot.c:4387:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-03-26 04:17:48.547162] E [glusterd-mgmt-handler.c:543:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-03-26 04:17:48.547199] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2014-03-26 04:17:50.396834] E [glusterd-utils.c:1715:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/03040240a5afa81739d74cbaa5b41ff2.socket error: Permission denied [2014-03-26 04:17:50.486651] E [glusterd-snapshot.c:993:glusterd_lvm_snapshot_remove] 0-management: failed to remove the snapshot /var/run/gluster/snaps/b42fd0dabd4d48a2aff4849188e9ce31/dev-VolGroup0-b42fd0dabd4d48a2aff4849188e9ce31-brick/b1 (/dev/mapper/VolGroup0-b42fd0dabd4d48a2aff4849188e9ce31) [2014-03-26 04:17:50.486719] E [glusterd-snapshot.c:3196:glusterd_snapshot_remove_commit] 0-management: Failed to remove snap s2 [2014-03-26 04:17:50.486737] E [glusterd-snapshot.c:4387:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-03-26 04:17:50.486754] E [glusterd-mgmt-handler.c:543:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-03-26 04:17:50.486823] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2014-03-26 04:17:52.320335] E [glusterd-utils.c:1715:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/76cf228179e066c1e0e7b76016bc3325.socket error: Permission denied [2014-03-26 04:17:57.222179] E [glusterd-utils.c:1715:glusterd_brick_unlink_socket_file] 0-management: Failed to remove /var/run/657bc4cbb9ace06d29e2e339376a8511.socket error: Permission denied [2014-03-26 04:17:57.268608] E [glusterd-snapshot.c:993:glusterd_lvm_snapshot_remove] 0-management: failed to remove the snapshot /var/run/gluster/snaps/215383bb8d664a8ca1a92da4be6dbed5/dev-VolGroup0-215383bb8d664a8ca1a92da4be6dbed5-brick/b1 (/dev/mapper/VolGroup0-215383bb8d664a8ca1a92da4be6dbed5) [2014-03-26 04:17:57.268673] E [glusterd-snapshot.c:3196:glusterd_snapshot_remove_commit] 0-management: Failed to remove snap s4 [2014-03-26 04:17:57.268691] E [glusterd-snapshot.c:4387:glusterd_snapshot] 0-management: Failed to delete snapshot [2014-03-26 04:17:57.268708] E [glusterd-mgmt-handler.c:543:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-03-26 04:17:57.268745] E [rpcsvc.c:495:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
I am currently trying to figure out which patch introduced this bug again. The root cause may be these 2 things 1) LVM snapshot could not removed for some reason and hence gluster snapshot delete is failing. 2) Race between killing a brick process and unmounting the bricks after snapshot delete command is issued. ( unmounting is happening before a brick process is killed completely) 2nd was fixed with patch http://review.gluster.org/#/c/6772/. I'll investigate further to see how these 2 problems can be fixed.
Marking snapshot BZs to RHS 3.0.
Sorry. Wrong bug... Please ignore the comment 13
Patch http://review.gluster.org/#/c/7532/ fixes the issue
Verified with build: glusterfs-3.6.0-1.0.el6rhs.x86_64 Able to delete the snapshots in a loop. Initially had 256 snaps on the system: [root@snapshot09 ~]# gluster snapshot list | wc 256 256 1940 [root@snapshot09 ~]# Started deleting snaps: [root@snapshot09 ~]# for i in {1..256} ; do time gluster --mode=script snapshot delete snap$i; done All snaps are deleted successfully: [root@snapshot09 ~]# gluster snapshot list No snapshots present [root@snapshot09 ~]# Moving the bug to verified
*** Bug 1062122 has been marked as a duplicate of this bug. ***
Setting flags required to add BZs to RHS 3.0 Errata
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html