Bug 1085278

Summary: [SNAPSHOT]: After adding a new peer to the cluster, gluster snasphot create ,delete ,restore gives Post Validation error message
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: senaik
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.0CC: asengupt, josferna, rhinduja, rhs-bugs, rjoseph, ssamanta, storage-qa-internal, vagarwal, vmallika
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SNAPSHOT
Fixed In Version: glusterfs-3.6.0.12-1.el6rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-22 19:35:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1112250    
Bug Blocks:    

Description senaik 2014-04-08 09:32:04 UTC
Description of problem:
=======================
While snapshot creation is in progress, attach a new peer to the cluster. Snapshot creation gives the following message: 

snapshot create: failed: Post Validation failed on 10.70.42.227. Please check log file for details.
Snapshot command failed


Version-Release number of selected component (if applicable):
============================================================
glusterfs 3.4.1.7.snap.mar27.2014git


How reproducible:


Steps to Reproduce:
==================
1.Create a dist-repl volume and start it

2.Fuse and NFS mount the volume and create some files

3.Create snapshots on the volume. While snapshot creation is in progress, from another node attach a new peer to the cluster.

gluster peer probe 10.70.42.227
peer probe: success. 

[root@snapshot-02 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.43.74
Uuid: 97c4e585-0915-4a97-b610-79b10d7978e4
State: Peer in Cluster (Connected)

Hostname: 10.70.43.32
Uuid: 6cee10cf-5745-43f9-8e2d-df494fee3544
State: Peer in Cluster (Connected)

Hostname: 10.70.43.71
Uuid: 6aa084d3-9c8e-496c-afae-15144327ff22
State: Peer in Cluster (Connected)

Hostname: 10.70.42.227
Uuid: f838e4ce-04ea-4cb5-858c-1c1a9d672649
State: Peer in Cluster (Connected)


for i in {1..100} ; do gluster snapshot create snap_vol2_$i vol2 ; done
snapshot create: snap_vol2_1: snap created successfully
snapshot create: snap_vol2_2: snap created successfully
snapshot create: snap_vol2_3: snap created successfully
snapshot create: snap_vol2_4: snap created successfully
snapshot create: snap_vol2_5: snap created successfully
snapshot create: failed: Post Validation failed on 10.70.42.227. Please check log file for details.
Snapshot command failed
snapshot create: failed: Post Validation failed on 10.70.42.227. Please check log file for details.
Snapshot command failed

Snapshots are created successfully, but we get post validation failed on the newly added peer .

gluster snapshot list vol2
snap_vol2_1
snap_vol2_2
snap_vol2_3
snap_vol2_4
snap_vol2_5
snap_vol2_6
snap_vol2_7
snap_vol2_8
snap_vol2_9
snap_vol2_10
snap_vol2_11
snap_vol2_12


Actual results:
==============
After probe is successful , snapshot creation gives "Post Validation failed on the newly added peer" 


Expected results:
================
While snapshot creation is in progress, and a new peer is attached to the cluster, snap creation should continue successfully with no error message shown 


Additional info:

Comment 3 senaik 2014-04-08 10:19:57 UTC
Your comment was:
Some more issues after adding a new peer to the cluster :
========================================================
1)Snap-Delete 

 gluster snapshot delete snap_vol2_35
Deleting snap will erase all the information about the snap. Do you still want to continue? (y/n) y
snapshot delete: failed: Post Validation failed on 10.70.42.227. Please check log file for details.
Snapshot command failed

Snapshot is deleted, but with "Post validation" error message shown

3)Snap-restore

 gluster v stop vol2
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: vol2: success

[root@snapshot-02 ~]# gluster snapshot restore snappy 
snapshot restore: failed: Commit failed on 10.70.42.227. Please check log file for details.
Snapshot command failed

[root@snapshot-02 ~]# gluster v start vol2
volume start: vol2: success

Mounted the restored volume and checked for files. Restore is successful but throws Post Validation error

gluster v status vol2
Status of volume: vol2
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.43.74:/var/run/gluster/snaps/3752f9b2d53d46
3f8f0ced3d35328f21/dev-VolGroup0-3752f9b2d53d463f8f0ced
3d35328f21-brick/b2					49296	Y	13933
Brick 10.70.43.151:/var/run/gluster/snaps/3752f9b2d53d4
63f8f0ced3d35328f21/dev-VolGroup0-3752f9b2d53d463f8f0ce
d3d35328f21-brick/b2					49296	Y	21261
Brick 10.70.43.32:/var/run/gluster/snaps/3752f9b2d53d46
3f8f0ced3d35328f21/dev-VolGroup0-3752f9b2d53d463f8f0ced
3d35328f21-brick/b2					49295	Y	22599
Brick 10.70.43.71:/var/run/gluster/snaps/3752f9b2d53d46
3f8f0ced3d35328f21/dev-VolGroup0-3752f9b2d53d463f8f0ced
3d35328f21-brick/b2					49295	Y	24348
NFS Server on localhost					2049	Y	21273
Self-heal Daemon on localhost				N/A	Y	21280
NFS Server on 10.70.43.32				2049	Y	22611
Self-heal Daemon on 10.70.43.32				N/A	Y	22618
NFS Server on 10.70.42.227				2049	Y	20270
Self-heal Daemon on 10.70.42.227			N/A	Y	20277
NFS Server on 10.70.43.71				2049	Y	24360
Self-heal Daemon on 10.70.43.71				N/A	Y	24367
NFS Server on 10.70.43.74				2049	Y	13945
Self-heal Daemon on 10.70.43.74				N/A	Y	13952
 
Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks

Comment 5 Nagaprasad Sathyanarayana 2014-04-21 06:18:13 UTC
Marking snapshot BZs to RHS 3.0.

Comment 6 Avra Sengupta 2014-05-26 07:27:39 UTC
Fixed with http://review.gluster.org/7525

Comment 7 senaik 2014-06-23 06:40:16 UTC
Version : glusterfs 3.6.0.20 built on Jun 19 2014
========

Marking this bug as a dependant of bz 1104478 , as we are getting the error message "glusterd quorum not met" when a new node is attached to the cluster. 

snapshot create: success: Snap snap4 created successfully
snapshot create: failed: glusterds are not in quorum
Snapshot command failed
snapshot create: success: Snap snap6 created successfully

All glusterds were up and running on the nodes , but still we get the message that glusterd quorum is not met. 

----------------Part of log---------------------

name:snapshot15.lab.eng.blr.redhat.com
[2014-06-23 06:03:31.887252] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7e97d0f0-8ae9-40eb-b822-952cc5a8dc46, host
name:10.70.44.54
[2014-06-23 06:03:32.166226] W [glusterd-utils.c:12909:glusterd_snap_quorum_check_for_create] 0-management: glusterds are not in quorum
[2014-06-23 06:03:32.166352] W [glusterd-utils.c:13058:glusterd_snap_quorum_check] 0-management: Quorum checkfailed during snapshot create command
[2014-06-23 06:03:32.166374] W [glusterd-mgmt.c:1846:glusterd_mgmt_v3_initiate_snap_phases] 0-management: quorum check failed
[2014-06-23 06:03:32.166416] W [glusterd-snapshot.c:7012:glusterd_snapshot_postvalidate] 0-management: Snapshot create post-validation failed
[2014-06-23 06:03:32.166433] W [glusterd-mgmt.c:248:gd_mgmt_v3_post_validate_fn] 0-management: postvalidate operation failed
[2014-06-23 06:03:32.166451] E [glusterd-mgmt.c:1335:glusterd_mgmt_v3_post_validate] 0-management: Post Validation failed for operation Snapshot on local node
[2014-06-23 06:03:32.166467] E [glusterd-mgmt.c:1944:glusterd_mgmt_v3_initiate_snap_phases] 0-management: Post Validation Failed
[2014-06-23 06:03:33.972792] I [glusterd-handshake.c:1014:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000

Comment 8 senaik 2014-06-24 06:05:14 UTC
Raised a new bz to track the issue mentioned in Comment 7 .

Marking this bug as dependent of bz 1112250

Comment 9 Avra Sengupta 2014-07-24 07:51:02 UTC
Removing 1114403 from  the dependency list, as it's a clone of 1112250.

Comment 10 Vijaikumar Mallikarjuna 2014-08-08 07:20:11 UTC
I verified this bug executing the steps mentioned in the descrioption and didn't find any issues:

Created a 2 x 2 volume :

[root@snapshot-01 ~]# gluster pool list
UUID					Hostname	State
bd1f458d-09cf-481d-a0b8-dff4a8afb8d0	10.70.42.209	Disconnected 
a90793ca-58a4-429e-b39b-5ad1b88dafa7	localhost	Connected 

[root@snapshot-01 ~]# gluster v i
Volume Name: vol1
Type: Distributed-Replicate
Volume ID: ad2a01be-c045-412e-9c84-0696492beb19
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: s1:/rhs/brick1/dir
Brick2: s3:/brick0/dir
Brick3: s1:/rhs/brick2/dir
Brick4: s3:/brick1/dir
Options Reconfigured:
features.barrier: disable
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256


From one terminal started snapshot creation in loop:
[root@snapshot-03 ~]# for  i in {21..30}; do gluster snap create snap$i vol1; done
snapshot create: success: Snap snap21 created successfully
snapshot create: success: Snap snap22 created successfully
snapshot create: success: Snap snap23 created successfully
snapshot create: success: Snap snap24 created successfully
snapshot create: success: Snap snap25 created successfully
snapshot create: success: Snap snap26 created successfully
snapshot create: success: Snap snap27 created successfully
snapshot create: success: Snap snap28 created successfully
snapshot create: success: Snap snap29 created successfully
snapshot create: success: Snap snap30 created successfully

From another terminal attached a new peer
[root@snapshot-03 ~]# gluster peer probe s4
peer probe: success. 
[root@snapshot-03 ~]# gluster pool list
UUID					Hostname	State
a90793ca-58a4-429e-b39b-5ad1b88dafa7	10.70.42.16	Connected 
f1c5bfa4-997a-4c7e-990e-a45e68bb3c11	s4       	Connected 
bd1f458d-09cf-481d-a0b8-dff4a8afb8d0	localhost	Connected 



Marking the bug as verified

Comment 12 errata-xmlrpc 2014-09-22 19:35:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html