Bug 1107575 - [SNAPSHOT]: Restoring snapshot to a volume fails with "Commit failed on <peer_node_name>", when glusterd is restarted after creation of the snapshot
Summary: [SNAPSHOT]: Restoring snapshot to a volume fails with "Commit failed on <peer...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: snapshot
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.0.0
Assignee: Joseph Elwin Fernandes
QA Contact: senaik
URL:
Whiteboard: SNAPSHOT
Depends On: 1098087
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-10 09:08 UTC by senaik
Modified: 2016-09-17 12:54 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.6.0.16-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-22 19:40:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description senaik 2014-06-10 09:08:02 UTC
Description of problem:
=======================
Restoring volume to a snapshot which was activated and deactivated couple of times failed with "Commit failed" message and gluster snapshot info gives "operation failed" output. Also gluster snapshot status shows the bricks are not running. 


Version-Release number of selected component (if applicable):
============================================================
glusterfs-3.6.0.14-1.el6rhs.x86_64

How reproducible:
=================
2/2

Steps to Reproduce:
==================
1.Create a dist-rep volume , Fuse and NFS mount the volume and create few files
2.Create snapshot of volume (vol1_snap1)
3.Below are the series of operations tried before trying to restore the snapshot

  569  gluster snapshot create vol1_snap1 vol1
  570  gluster snapshot info
  571  gluster snapshot status vol1_snap1
  572  gluster snapshot info vol1_snap1
  573  gluster snapshot deactivate vol1_snap1
  574  gluster snapshot info vol1_snap1
  575  gluster snapshot status vol1_snap1
  576  gluster snapshot deactivate vol1_snap1
  577  service glusterd restart
  578  gluster snapshot info snap1
  579  gluster snapshot info vol1_snap1
  580  gluster snapshot activate vol1_snap1
  581  gluster snapshot info vol1_snap1
  582  gluster snapshot status vol1_snap1
  583  gluster snapshot deactivate vol1_snap1
  584  gluster snapshot status vol1_snap1
  585  gluster snapshot activate vol1_snap1
  586  gluster snapshot status vol1_snap1
  587  gluster v restore vol1_snap1
  588  gluster snapshot restore vol1_snap1
  589  gluster v stop vol0
  590  gluster v stop vol1
  591  gluster snapshot restore vol1_snap1 -----> commmit failed 
  592  gluster snapshot info vol1_snap1


5.Restored volume when the volume was in the started state, it failed with the appropriate message:

# gluster snapshot restore vol1_snap1
snapshot restore: failed: Volume (vol1) has been started. Volume needs to be stopped before restoring a snapshot.
Snapshot command failed

6.Stopped the volume
 gluster v stop vol1
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: vol1: success

7.Restore volume

[root@snapshot14 ~]# gluster snapshot restore vol1_snap1
snapshot restore: failed: Commit failed on snapshot15.lab.eng.blr.redhat.com. Please check log file for details.
Commit failed on 10.70.40.169. Please check log file for details.
Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details.
Snapshot command failed

8.Check gluster snapshot info 
[root@snapshot14 ~]# gluster snapshot info vol1_snap1
Snapshot info : failed: Operation failed
Snapshot command failed

9.Snapshot list shows the snapshot is still present
gluster snapshot list
vol1_snap1

10.gluster snapshot status shows the bricks are not running 

[root@snapshot13 ~]# gluster snapshot status 

Snap Name : vol1_snap1
Snap UUID : c7ce5846-199b-4bce-a180-08bd02f8c728

	Brick Path        :   snapshot13.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick1/b1
	Volume Group      :   VolGroup0
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.14
	LV Size           :   100.00g


	Brick Path        :   snapshot14.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick2/b1
	Volume Group      :   VolGroup0
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.14
	LV Size           :   100.00g


	Brick Path        :   snapshot15.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick3/b1
	Volume Group      :   VolGroup0
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.20
	LV Size           :   100.00g


	Brick Path        :   snapshot16.lab.eng.blr.redhat.com:/var/run/gluster/snaps/2c3b4ae3a7054710bb855ab88e0d2da1/brick4/b1
	Volume Group      :   VolGroup0
	Brick Running     :   No
	Brick PID         :   N/A
	Data Percentage   :   0.20
	LV Size           :   100.00g


Actual results:
===============
Restoring snapshot fails 


Expected results:
=================
Restore should be successful 


Additional info:
================
-------Part of log messages from snapshot13: -----------------------

[2014-06-09 12:00:58.240300] E [glusterd-utils.c:12395:glusterd_copy_quota_files] 
0-management: /var/lib/glusterd/snaps/vol1_snap1/2c3b4ae3a7054710bb855ab88e0d2da1/quota.cksum not found
[2014-06-09 12:00:58.240359] E [glusterd-snapshot.c:7126:gd_restore_snap_volume] 0-management: Failed to restore quota files for snap vol1_snap1
[2014-06-09 12:00:58.240389] E [glusterd-snapshot.c:745:glusterd_snapshot_restore] 0-management: Failed to restore snap for vol1_snap1
[2014-06-09 12:00:58.240407] W [glusterd-snapshot.c:5884:glusterd_snapshot] 0-management: Failed to restore snapshot
[2014-06-09 12:00:58.240422] W [glusterd-mgmt.c:212:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed
[2014-06-09 12:00:58.240441] E [glusterd-mgmt-handler.c:554:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot
[2014-06-09 12:00:58.241647] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now
The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 2 times between [2014-06-09 12:00:40.911003] and [2014-06-09 12:00:57.211652]



---------part of log messages from snapshot14----------

[2014-06-09 12:00:58.956223] E [glusterd-mgmt.c:116:gd_mgmt_v3_collate_errors] 0-management: Commit failed on snapshot16.lab.eng.blr.redhat.com. Please check log file for details.
[2014-06-09 12:00:58.956419] E [glusterd-mgmt.c:1173:glusterd_mgmt_v3_commit] 0-management: Commit failed on peers
[2014-06-09 12:00:58.956574] E [glusterd-mgmt.c:1894:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Commit Op Failed
[2014-06-09 12:01:28.384927] E [glusterd-snapshot.c:2456:glusterd_snapshot_get_snap_detail] 0-management: Invalid snap status
[2014-06-09 12:01:28.384977] E [glusterd-snapshot.c:2725:glusterd_handle_snapshot_info] 0-management: Failed to get snap detail of snap vol1_snap1
[2014-06-09 12:01:28.384996] W [glusterd-snapshot.c:6593:glusterd_handle_snapshot_fn] 0-management: Snapshot info failed
The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd." repeated 3 times between [2014-06-09 12:00:36.812226] and [2014-06-09 12:00:56.149184]
The message "I [MSGID: 106006] [glusterd-handler.c:4241:__glusterd_nodesvc_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 2 times between [2014-06-09 12:00:37.827659] and [2014-06-09 12:00:56.149575]
[2014-06-09 12:21:39.594096] W [glusterd-snapshot.c:6489:glusterd_snapshot_postvalidate] 0-management: invalid snap command

Comment 3 Joseph Elwin Fernandes 2014-06-10 11:25:51 UTC
Was able to reproduce the issue with the following step (without activate/deactivate)

1) Create volume 
2) Start volume 
3) Create Data via nfs/gluster client
4) Create snap 
5) service glusterd restart
6) restore snap(fails say stop volume)
7) Stop volume 
8) restore snap 
[root@joeremote1 ~]# gluster snapshot restore snap1
snapshot restore: failed: Commit failed on joeremote2. Error Please check log file for details.
Snapshot command failed
9) snap info
[root@joeremote1 ~]# gluster snapshot info snap1
Snapshot info : failed: Operation failed
Snapshot command failed
10) when checked for snap bricks or volume bricks all are down

Clearly not an activate/deactive issue

Will Investigate more on why this is happening.

Comment 4 Joseph Elwin Fernandes 2014-06-10 16:12:24 UTC
The log suggests copying of quota config files failed

[2014-06-10 10:56:38.558608] E [glusterd-utils.c:12250:glusterd_copy_quota_files] 0-management: /var/lib/glusterd/snaps/snap1/81c37f9740054c0d9f077195f7d5f3b9/quota.cksum not found
[2014-06-10 10:56:38.558627] E [glusterd-snapshot.c:7130:gd_restore_snap_volume] 0-management: Failed to restore quota files for snap snap1
[2014-06-10 10:56:38.558642] E [glusterd-snapshot.c:744:glusterd_snapshot_restore] 0-management: Failed to restore snap for snap1
[2014-06-10 10:56:38.558651] W [glusterd-snapshot.c:5888:glusterd_snapshot] 0-management: Failed to restore snapshot
[2014-06-10 10:56:38.558658] W [glusterd-mgmt.c:222:gd_mgmt_v3_commit_fn] 0-management: Snapshot Commit Failed
[2014-06-10 10:56:38.558665] E [glusterd-mgmt-handler.c:554:glusterd_handle_commit_fn] 0-management: commit failed on operation Snapshot
[2014-06-10 10:56:38.559101] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/81c37f9740054c0d9f077195f7d5f3b9/brick2/brick1 on port 49154
[2014-06-10 10:56:38.559542] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.43.166:1018 failed (Broken pipe)
[2014-06-10 10:56:38.559560] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now
[2014-06-10 10:56:38.559593] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /var/run/gluster/snaps/81c37f9740054c0d9f077195f7d5f3b9/brick3/brick2 on port 49155
[2014-06-10 10:56:38.559903] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.43.166:1017 failed (Broken pipe)
[2014-06-10 10:56:38.559918] I [socket.c:2239:socket_event_handler] 0-transport: disconnecting now

Comment 5 Joseph Elwin Fernandes 2014-06-11 03:11:19 UTC
This issue is fixed under the patch 
http://review.gluster.org/#/c/7934/

The issue is the quota.conf checksum mismatch when the glusterd is restart. The above patch fixes the issue. This bug is depended on the fix for the bug 1101483

Comment 6 Joseph Elwin Fernandes 2014-06-11 03:13:24 UTC
The patch is merged upstream. I have tested the fix and it works.

Comment 7 Joseph Elwin Fernandes 2014-06-12 07:00:45 UTC
https://code.engineering.redhat.com/gerrit/#/c/26641/1

Comment 8 senaik 2014-06-12 10:10:40 UTC
Version : glusterfs-3.6.0.16-1.el6rhs.x86_64

Repeated the steps as mentioned in 'Steps to Reproduce' and Comment 3. Could not reproduce the issue. 

Marking the bug as 'Verified'

Comment 10 errata-xmlrpc 2014-09-22 19:40:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.