Bug 1277414

Summary: [Snapshot]: Snapshot restore stucks in post validation.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shashank Raj <sraj>
Component: snapshotAssignee: Avra Sengupta <asengupt>
Status: CLOSED ERRATA QA Contact: Anil Shah <ashah>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: asengupt, rcyriac, rhinduja, rhs-bugs, rjoseph, sashinde, storage-qa-internal
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: RHGS 3.1.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1300979 (view as bug list) Environment:
Last Closed: 2016-06-23 04:55:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1299184, 1300979, 1301030    
Attachments:
Description Flags
restore_failure_logs none

Description Shashank Raj 2015-11-03 09:40:53 UTC
Created attachment 1088849 [details]
restore_failure_logs

Description of problem:
After recursive restores of the snapshot, it stucks in post validation.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-5

How reproducible:
1/1

Steps to Reproduce:
1.Create a tiered volume and start it
2.Create 10 snapshots of the volume.
3.Do restore of snapshots one by one. Observe that during the 6th restore, the command fails as "Request timed out" and in logs it is observed that restore is stuck in postvalidation.
4.and after that all the snapshot commands are getting stuck and results in timeout.

Actual results:

After recursive restores of the snapshot, it stucks in post validation.

Expected results:

Additional info:

Logs are attached for the reference

Comment 2 Shashank Raj 2016-01-21 12:55:49 UTC
Hitting this issue again on the cloned volume from dist-rep volume, with the latest build where in second restore timed out and after that all the snapshot commands on that respective node is timing out.

BUILD: glusterfs-3.7.5-17

Steps followed:

1) Create a dist-replica volume and start it.
2) FUSE mount the volume and write some files from the mount point.
3) Create a snapshot of the volume and activate it.
4) Create a clone of the snapshot and mount it using FUSE.
5) Create data on the cloned volume from FUSE (file 1 to file10).
6) Create a snapshot of the cloned volume (snap1).
7) Create some more data on cloned volume from FUSE (file11 to file20).
8) Create another snapshot of the cloned volume (snap2).
9) Repeat steps 5 to 8 (until 50 files and 5 snaps).
10) Stop the cloned volume.
11) Restore the cloned volume to snap3 created above.
12) Start the volume and check for files. Cloned volume should have files from file1 to file30.
13) List the snapshots of the cloned volume. It should show all snapshots except snap3.
14) Stop the cloned volume and again restore the volume to snap5
15) observe that the restore timed out and all the other snapshot commands after that are getting timed out on that particular node from which we issued restore command.

sos report from the nodes are placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1277414

Comment 3 Avra Sengupta 2016-01-22 09:22:25 UTC
Reproducible every time with the following steps:

1. Create and start a volume and take 5 snapshots of it.
2. Stop the volume and restore it to snap1
3. Have an open fd at one of the brick backends (This step is to simulate umount failure on one of the nodes.)
4. Restore the volume to snap2.

Comment 5 Avra Sengupta 2016-03-10 09:49:43 UTC
Master URL: http://review.gluster.org/#/c/13282/ (MERGED)
Release 3.7 URL: http://review.gluster.org/#/c/13548/ (IN REVIEW)

Comment 6 Avra Sengupta 2016-03-11 08:10:04 UTC
Master URL: http://review.gluster.org/#/c/13282/ (MERGED)
Release 3.7 URL: http://review.gluster.org/#/c/13548/ (MERGED)

Comment 8 Anil Shah 2016-04-01 10:28:19 UTC
Did consecutive snapshot restore for 15 snapshots.Didn't see any failure or post validation failed error messages in logs.


Bug verified on build glusterfs-3.7.9-1.el7rhgs.x86_64

Comment 10 errata-xmlrpc 2016-06-23 04:55:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240