Description of problem: ======================= Had a 6 node cluster with 3.8.4-10 build and the cluster was imported to RHGS-Console. Had a 2*4+2 disperse volume 'disp' already created from CLI, and a replica volume 'testvol'. After following the below sequence of steps (during testing and debugging), finally reached a stage when glusterd failed to restart using 'systemctl'. 'glusterd -LDEBUG' however succeeded. But an attempt to stop/start glusterd using 'systemctl start' and 'systemctl stop' failed. The glusterd logs showed - Unable to restore a snapshot <snap1>'. On further debugging by Avra, the said snapshot <snap1> had half-baked data. [root@dhcp46-239 ~]# cat snap_info snap-id=bfd2a355-73ef-4770-8526-cfe521dc4fd7 status=1 snap-restored=0 desc= time-stamp=1484113346 Raising a 'medium' severity BZ for now. Will update this space if I hit this again. Sequence of steps followed: 1. A 6node cluster with 3.8.4-10 build, imported to RHGS-C. 2. Configured a snpashot schedule policy from console, to take a snapshot every hour. The first snapshot to be created by the scheduler failed. Manual creation of a snapshot from the console also failed. 3. The same thing was attempted from CLI, and it complained - "another transaction is in progress." The cmd_history.log was flooded with 'gluster volume status tasks' at the frequency of every minute or so. This was suspected to be a probe attempt from console - which would take a lock on the volume. 4. Hence, moved the nodes to maintenance mode @console, and did multiple attempts to take a snapshot from CLI. Out of the 4 attempts, 2 of them failed with the same error 'another transaction is in progress' and 2 of the attempts were successful in taking a snapshot. 5. glusterd was started in LDEBUG mode to probe further the cause of the same. glusterd logs complained of a particular snapshot restore. 6. Took the setup back to reproduce this issue. This was when systemctl start/stop failed to start/stop glusterd. pkill <pid_of_glusterd> and then systemctl start also failed to get the system back to normal. 6. Left the setup for a day, found a kernel panic on one of the nodes, and lvm hang on two other nodes. 7. journalctl logs/kernel panic - was not able to make out much. Got all the nodes back up. systemctl start/stop glusterd continued to fail. 8. Removed the said snapshot from /var/lib/glusterd/snaps/ folder. And was able to get back glusterd online. Sosreports will be copied at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): ========================================================== 3.8.4-10 How reproducible: ================ 1:1
[qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$ pwd /home/repo/sosreports/1412982 [qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$ ls -lrt total 355740 -rwxr-xr-x. 1 qe qe 32906176 Jan 13 15:55 sosreport-dhcp46-221.lab.eng.blr.redhat.com-20170112164030_node5.tar.xz -rwxr-xr-x. 1 qe qe 31650592 Jan 13 15:55 sosreport-dhcp46-222.lab.eng.blr.redhat.com-20170112164030_node6.tar.xz -rwxr-xr-x. 1 qe qe 63130736 Jan 13 15:55 sosreport-dhcp46-239.lab.eng.blr.redhat.com-20170112164030_node1.tar.xz -rwxr-xr-x. 1 qe qe 68889584 Jan 13 15:55 sosreport-dhcp46-240.lab.eng.blr.redhat.com-20170112164030_node2.tar.xz -rwxr-xr-x. 1 qe qe 75073036 Jan 13 15:55 sosreport-dhcp46-242.lab.eng.blr.redhat.com-20170112164030_node3.tar.xz -rwxr-xr-x. 1 qe qe 92614860 Jan 13 15:55 sosreport-dhcp46-218.lab.eng.blr.redhat.com-20170112164030_node4.tar.xz [qe@rhsqe-repo 1412982]$ [qe@rhsqe-repo 1412982]$
The fix for this [1],[2],[3] was already present in rhgs-3.4.0 as part of the bug 1615578. [1] : https://review.gluster.org/#/c/20747/ [2] : https://review.gluster.org/#/c/20770/ [3] : https://review.gluster.org/#/c/20854/ So closing this bug as it is already present.