Description of problem: ======================= Restoring volume to a snapshot which was activated and deactivated couple of times failed with "Commit failed" message and gluster snapshot info gives "operation failed" output. Also gluster snapshot status shows the bricks are not running. Version-Release number of selected component (if applicable): ============================================================ glusterfs-3.6.0.14-1.el6rhs.x86_64 How reproducible: ================= 2/2 Steps to Reproduce: ================== 1.Create a dist-rep volume , Fuse and NFS mount the volume and create few files 2.Create snapshot of volume (vol1_snap1) 3.Below are the series of operations tried before trying to restore the snapshot 569 gluster snapshot create vol1_snap1 vol1 570 gluster snapshot info 571 gluster snapshot status vol1_snap1 572 gluster snapshot info vol1_snap1 573 gluster snapshot deactivate vol1_snap1 574 gluster snapshot info vol1_snap1 575 gluster snapshot status vol1_snap1 576 gluster snapshot deactivate vol1_snap1 577 service glusterd restart 578 gluster snapshot info snap1 579 gluster snapshot info vol1_snap1 580 gluster snapshot activate vol1_snap1 581 gluster snapshot info vol1_snap1 582 gluster snapshot status vol1_snap1 583 gluster snapshot deactivate vol1_snap1 584 gluster snapshot status vol1_snap1 585 gluster snapshot activate vol1_snap1 586 gluster snapshot status vol1_snap1 587 gluster v restore vol1_snap1 588 gluster snapshot restore vol1_snap1 589 gluster v stop vol0 590 gluster v stop vol1 591 gluster snapshot restore vol1_snap1 -----> commmit failed 592 gluster snapshot info vol1_snap1 5.Restored volume when the volume was in the started state, it failed with the appropriate message: # gluster snapshot restore vol1_snap1 snapshot restore: failed: Volume (vol1) has been started. Volume needs to be stopped before restoring a snapshot. Snapshot command failed 6.Stopped the volume gluster v stop vol1 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: vol1: success 7.Restore volume [root@snapshot14 ~]# gluster snapshot restore vol1_snap1 snapshot restore: failed: Commit failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details. Commit failed on 10.70.40.169. Please check log file for details. Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details. Snapshot command failed 8.Check gluster snapshot info [root@snapshot14 ~]# gluster snapshot info vol1_snap1 Snapshot info : failed: Operation failed Snapshot command failed 9.Snapshot list shows the snapshot is still present gluster snapshot list vol1_snap1 10.gluster snapshot status shows the bricks are not running [root@snapshot13 ~]# gluster snapshot status Snap Name : vol1_snap1 Snap UUID : c7ce5846-199b-4bce-a180-08bd02f8c728 Brick Path : snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick1/b1 Volume Group : VolGroup0 Brick Running : No Brick PID : N/A Data Percentage : 0.14 LV Size : 100.00g Brick Path : snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick2/b1 Volume Group : VolGroup0 Brick Running : No Brick PID : N/A Data Percentage : 0.14 LV Size : 100.00g Brick Path : snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick3/b1 Volume Group : VolGroup0 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 100.00g Brick Path : snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick4/b1 Volume Group : VolGroup0 Brick Running : No Brick PID : N/A Data Percentage : 0.20 LV Size : 100.00g Actual results: =============== Restoring snapshot fails Expected results: ================= Restore should be successful Additional info: ================ -------Part of log messages from snapshot13: ----------------------- [2014-06-09 12:00:58.240300] E [glusterd-utils.c:12395:glusterd_copy_quota_files] 0-management: /var/lib/glusterd/snaps/vol1_snap1/2c3b4ae3a7054710bb855ab88e0d2da1/quota.cksum not found [2014-06-09 12:00:58.240359] E [glusterd-snapshot.c:7126:gd_restore_snap_volume] 0-management: Failed to restore quota files for snap vol1_snap1 [2014-06-09 12:00:58.240389] E [glusterd-snapshot.c:745:glusterd_snapshot_restore] 0-management: Failed to restore snap for vol1_snap1 [2014-06-09 12:00:58.240407] W [glusterd-snapshot.c:5884:glusterd_snapshot] 0-management: Failed to restore snapshot [2014-06-09 12:00:58.240422] W [glusterd-mgmt.c:212:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed [2014-06-09 12:00:58.240441] E [glusterd-mgmt-handler.c:554:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-06-09 12:00:58.241647] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 2 times between [2014-06-09 12:00:40.911003] and [2014-06-09 12:00:57.211652] ---------part of log messages from snapshot14---------- [2014-06-09 12:00:58.956223] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details. [2014-06-09 12:00:58.956419] E [glusterd-mgmt.c:1173:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers [2014-06-09 12:00:58.956574] E [glusterd-mgmt.c:1894:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Commit Op Failed [2014-06-09 12:01:28.384927] E [glusterd-snapshot.c:2456:glusterd_snapshot_get_snap_detail] 0-management: Invalid snap status [2014-06-09 12:01:28.384977] E [glusterd-snapshot.c:2725:glusterd_handle_snapshot_info] 0-management: Failed to get snap detail of snap vol1_snap1 [2014-06-09 12:01:28.384996] W [glusterd-snapshot.c:6593:glusterd_handle_snapshot_fn] 0-management: Snapshot info failed The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 3 times between [2014-06-09 12:00:36.812226] and [2014-06-09 12:00:56.149184] The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 2 times between [2014-06-09 12:00:37.827659] and [2014-06-09 12:00:56.149575] [2014-06-09 12:21:39.594096] W [glusterd-snapshot.c:6489:glusterd_snapshot_postvalidate] 0-management: invalid snap command
sosreports : ========== http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/snapshots/1107575/
Was able to reproduce the issue with the following step (without activate/deactivate) 1) Create volume 2) Start volume 3) Create Data via nfs/gluster client 4) Create snap 5) service glusterd restart 6) restore snap(fails say stop volume) 7) Stop volume 8) restore snap [root@joeremote1 ~]# gluster snapshot restore snap1 snapshot restore: failed: Commit failed on joeremote2. Error Please check log file for details. Snapshot command failed 9) snap info [root@joeremote1 ~]# gluster snapshot info snap1 Snapshot info : failed: Operation failed Snapshot command failed 10) when checked for snap bricks or volume bricks all are down Clearly not an activate/deactive issue Will Investigate more on why this is happening.
The log suggests copying of quota config files failed [2014-06-10 10:56:38.558608] E [glusterd-utils.c:12250:glusterd_copy_quota_files] 0-management: /var/lib/glusterd/snaps/snap1/81c37f9740054c0d9f077195f7d5f3b9/quota.cksum not found [2014-06-10 10:56:38.558627] E [glusterd-snapshot.c:7130:gd_restore_snap_volume] 0-management: Failed to restore quota files for snap snap1 [2014-06-10 10:56:38.558642] E [glusterd-snapshot.c:744:glusterd_snapshot_restore] 0-management: Failed to restore snap for snap1 [2014-06-10 10:56:38.558651] W [glusterd-snapshot.c:5888:glusterd_snapshot] 0-management: Failed to restore snapshot [2014-06-10 10:56:38.558658] W [glusterd-mgmt.c:222:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed [2014-06-10 10:56:38.558665] E [glusterd-mgmt-handler.c:554:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot [2014-06-10 10:56:38.559101] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/81c37f9740054c0d9f077195f7d5f3b9/brick2/brick1 on port 49154 [2014-06-10 10:56:38.559542] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.43.166:1018 failed (Broken pipe) [2014-06-10 10:56:38.559560] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now [2014-06-10 10:56:38.559593] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/81c37f9740054c0d9f077195f7d5f3b9/brick3/brick2 on port 49155 [2014-06-10 10:56:38.559903] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.43.166:1017 failed (Broken pipe) [2014-06-10 10:56:38.559918] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now
This issue is fixed under the patch http://review.gluster.org/#/c/7934/ The issue is the quota.conf checksum mismatch when the glusterd is restart. The above patch fixes the issue. This bug is depended on the fix for the bug 1101483
The patch is merged upstream. I have tested the fix and it works.
https://code.engineering.redhat.com/gerrit/#/c/26641/1
Version : glusterfs-3.6.0.16-1.el6rhs.x86_64 Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 3. Could not reproduce the issue. Marking the bug as 'Verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html