Bug 1277414
Summary: | [Snapshot]: Snapshot restore stucks in post validation. | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Shashank Raj <sraj> | ||||
Component: | snapshot | Assignee: | Avra Sengupta <asengupt> | ||||
Status: | CLOSED ERRATA | QA Contact: | Anil Shah <ashah> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | rhgs-3.1 | CC: | asengupt, rcyriac, rhinduja, rhs-bugs, rjoseph, sashinde, storage-qa-internal | ||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||
Target Release: | RHGS 3.1.3 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.7.9-1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1300979 (view as bug list) | Environment: | |||||
Last Closed: | 2016-06-23 04:55:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1299184, 1300979, 1301030 | ||||||
Attachments: |
|
Hitting this issue again on the cloned volume from dist-rep volume, with the latest build where in second restore timed out and after that all the snapshot commands on that respective node is timing out. BUILD: glusterfs-3.7.5-17 Steps followed: 1) Create a dist-replica volume and start it. 2) FUSE mount the volume and write some files from the mount point. 3) Create a snapshot of the volume and activate it. 4) Create a clone of the snapshot and mount it using FUSE. 5) Create data on the cloned volume from FUSE (file 1 to file10). 6) Create a snapshot of the cloned volume (snap1). 7) Create some more data on cloned volume from FUSE (file11 to file20). 8) Create another snapshot of the cloned volume (snap2). 9) Repeat steps 5 to 8 (until 50 files and 5 snaps). 10) Stop the cloned volume. 11) Restore the cloned volume to snap3 created above. 12) Start the volume and check for files. Cloned volume should have files from file1 to file30. 13) List the snapshots of the cloned volume. It should show all snapshots except snap3. 14) Stop the cloned volume and again restore the volume to snap5 15) observe that the restore timed out and all the other snapshot commands after that are getting timed out on that particular node from which we issued restore command. sos report from the nodes are placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1277414 Reproducible every time with the following steps: 1. Create and start a volume and take 5 snapshots of it. 2. Stop the volume and restore it to snap1 3. Have an open fd at one of the brick backends (This step is to simulate umount failure on one of the nodes.) 4. Restore the volume to snap2. Master URL: http://review.gluster.org/#/c/13282/ (MERGED) Release 3.7 URL: http://review.gluster.org/#/c/13548/ (IN REVIEW) Master URL: http://review.gluster.org/#/c/13282/ (MERGED) Release 3.7 URL: http://review.gluster.org/#/c/13548/ (MERGED) Did consecutive snapshot restore for 15 snapshots.Didn't see any failure or post validation failed error messages in logs. Bug verified on build glusterfs-3.7.9-1.el7rhgs.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 |
Created attachment 1088849 [details] restore_failure_logs Description of problem: After recursive restores of the snapshot, it stucks in post validation. Version-Release number of selected component (if applicable): glusterfs-3.7.5-5 How reproducible: 1/1 Steps to Reproduce: 1.Create a tiered volume and start it 2.Create 10 snapshots of the volume. 3.Do restore of snapshots one by one. Observe that during the 6th restore, the command fails as "Request timed out" and in logs it is observed that restore is stuck in postvalidation. 4.and after that all the snapshot commands are getting stuck and results in timeout. Actual results: After recursive restores of the snapshot, it stucks in post validation. Expected results: Additional info: Logs are attached for the reference