Description of problem:
Had a 6 node cluster with 3.8.4-10 build and the cluster was imported to RHGS-Console. Had a 2*4+2 disperse volume 'disp' already created from CLI, and a replica volume 'testvol'. After following the below sequence of steps (during testing and debugging), finally reached a stage when glusterd failed to restart using 'systemctl'. 'glusterd -LDEBUG' however succeeded. But an attempt to stop/start glusterd using 'systemctl start' and 'systemctl stop' failed. The glusterd logs showed - Unable to restore a snapshot <snap1>'. On further debugging by Avra, the said snapshot <snap1> had half-baked data.
[root@dhcp46-239 ~]# cat snap_info
Raising a 'medium' severity BZ for now. Will update this space if I hit this again.
Sequence of steps followed:
1. A 6node cluster with 3.8.4-10 build, imported to RHGS-C.
2. Configured a snpashot schedule policy from console, to take a snapshot every hour. The first snapshot to be created by the scheduler failed. Manual creation of a snapshot from the console also failed.
3. The same thing was attempted from CLI, and it complained - "another transaction is in progress." The cmd_history.log was flooded with 'gluster volume status tasks' at the frequency of every minute or so. This was suspected to be a probe attempt from console - which would take a lock on the volume.
4. Hence, moved the nodes to maintenance mode @console, and did multiple attempts to take a snapshot from CLI. Out of the 4 attempts, 2 of them failed with the same error 'another transaction is in progress' and 2 of the attempts were successful in taking a snapshot.
5. glusterd was started in LDEBUG mode to probe further the cause of the same. glusterd logs complained of a particular snapshot restore.
6. Took the setup back to reproduce this issue. This was when systemctl start/stop failed to start/stop glusterd. pkill <pid_of_glusterd> and then systemctl start also failed to get the system back to normal.
6. Left the setup for a day, found a kernel panic on one of the nodes, and lvm hang on two other nodes.
7. journalctl logs/kernel panic - was not able to make out much. Got all the nodes back up. systemctl start/stop glusterd continued to fail.
8. Removed the said snapshot from /var/lib/glusterd/snaps/ folder. And was able to get back glusterd online.
Sosreports will be copied at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
Version-Release number of selected component (if applicable):
[qe@rhsqe-repo 1412982]$ hostname
[qe@rhsqe-repo 1412982]$ pwd
[qe@rhsqe-repo 1412982]$ ls -lrt
-rwxr-xr-x. 1 qe qe 32906176 Jan 13 15:55 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170112164030_node5.tar.xz
-rwxr-xr-x. 1 qe qe 31650592 Jan 13 15:55 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170112164030_node6.tar.xz
-rwxr-xr-x. 1 qe qe 63130736 Jan 13 15:55 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170112164030_node1.tar.xz
-rwxr-xr-x. 1 qe qe 68889584 Jan 13 15:55 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170112164030_node2.tar.xz
-rwxr-xr-x. 1 qe qe 75073036 Jan 13 15:55 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170112164030_node3.tar.xz
-rwxr-xr-x. 1 qe qe 92614860 Jan 13 15:55 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170112164030_node4.tar.xz
The fix for this ,, was already present in rhgs-3.4.0 as part of the bug 1615578.
 : https://review.gluster.org/#/c/20747/
 : https://review.gluster.org/#/c/20770/
 : https://review.gluster.org/#/c/20854/
So closing this bug as it is already present.