Bug 1412982

Summary: [Snapshot]: Glusterd fails to restart, complains "unable to restore snapshot"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: snapshotAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED CURRENTRELEASE QA Contact: Rahul Hinduja <rhinduja>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: rcyriac, rhs-bugs, rkavunga, storage-qa-internal
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 09:40:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sweta Anandpara 2017-01-13 10:28:33 UTC
Description of problem:
=======================

Had a 6 node cluster with 3.8.4-10 build and the cluster was imported to RHGS-Console. Had a 2*4+2 disperse volume 'disp' already created from CLI, and a replica volume 'testvol'. After following the below sequence of steps (during testing and debugging), finally reached a stage when glusterd failed to restart using 'systemctl'. 'glusterd -LDEBUG' however succeeded. But an attempt to stop/start glusterd using 'systemctl start' and 'systemctl stop' failed. The glusterd logs showed - Unable to restore a snapshot <snap1>'. On further debugging by Avra, the said snapshot <snap1> had half-baked data.

[root@dhcp46-239 ~]# cat snap_info 
snap-id=bfd2a355-73ef-4770-8526-cfe521dc4fd7
status=1
snap-restored=0
desc=
time-stamp=1484113346

Raising a 'medium' severity BZ for now. Will update this space if I hit this again.
Sequence of steps followed:

1. A 6node cluster with 3.8.4-10 build, imported to RHGS-C. 
2. Configured a snpashot schedule policy from console, to take a snapshot every hour. The first snapshot to be created by the scheduler failed. Manual creation of a snapshot from the console also failed.
3. The same thing was attempted from CLI, and it complained - "another transaction is in progress." The cmd_history.log was flooded with 'gluster volume status tasks' at the frequency of every minute or so. This was suspected to be a probe attempt from console - which would take a lock on the volume.
4. Hence, moved the nodes to maintenance mode @console, and did multiple attempts to take a snapshot from CLI. Out of the 4 attempts, 2 of them failed with the same error 'another transaction is in progress' and 2 of the attempts were successful in taking a snapshot. 
5. glusterd was started in LDEBUG mode to probe further the cause of the same. glusterd logs complained of a particular snapshot restore.
6. Took the setup back to reproduce this issue. This was when systemctl start/stop failed to start/stop glusterd. pkill <pid_of_glusterd> and then systemctl start also failed to get the system back to normal.
6. Left the setup for a day, found a kernel panic on one of the nodes, and lvm hang on two other nodes. 
7. journalctl logs/kernel panic - was not able to make out much. Got all the nodes back up. systemctl start/stop glusterd continued to fail.
8. Removed the said snapshot from /var/lib/glusterd/snaps/ folder. And was able to get back glusterd online.

Sosreports will be copied at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/


Version-Release number of selected component (if applicable):
==========================================================
3.8.4-10


How reproducible:
================
1:1

Comment 2 Sweta Anandpara 2017-01-13 10:58:53 UTC
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$ pwd
/home/repo/sosreports/1412982
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$ ls -lrt
total 355740
-rwxr-xr-x. 1 qe qe 32906176 Jan 13 15:55 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170112164030_node5.tar.xz
-rwxr-xr-x. 1 qe qe 31650592 Jan 13 15:55 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170112164030_node6.tar.xz
-rwxr-xr-x. 1 qe qe 63130736 Jan 13 15:55 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170112164030_node1.tar.xz
-rwxr-xr-x. 1 qe qe 68889584 Jan 13 15:55 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170112164030_node2.tar.xz
-rwxr-xr-x. 1 qe qe 75073036 Jan 13 15:55 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170112164030_node3.tar.xz
-rwxr-xr-x. 1 qe qe 92614860 Jan 13 15:55 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170112164030_node4.tar.xz
[qe@rhsqe-repo 1412982]$ 
[qe@rhsqe-repo 1412982]$

Comment 10 Mohammed Rafi KC 2018-11-19 09:40:37 UTC
The fix for this [1],[2],[3] was already present in rhgs-3.4.0 as part of the bug 1615578.


[1] : https://review.gluster.org/#/c/20747/
[2] : https://review.gluster.org/#/c/20770/
[3] : https://review.gluster.org/#/c/20854/

So closing this bug as it is already present.