Bug 1412982 - [Snapshot]: Glusterd fails to restart, complains "unable to restore snapshot"
Summary: [Snapshot]: Glusterd fails to restart, complains "unable to restore snapshot"
Reported: 2017-01-13 10:28 UTC by Sweta Anandpara
Modified: 2018-11-19 09:40 UTC (History)
Description Sweta Anandpara 2017-01-13 10:28:33 UTC
Description of problem:

Had a 6 node cluster with 3.8.4-10 build and the cluster was imported to RHGS-Console. Had a 2*4+2 disperse volume 'disp' already created from CLI, and a replica volume 'testvol'. After following the below sequence of steps (during testing and debugging), finally reached a stage when glusterd failed to restart using 'systemctl'. 'glusterd -LDEBUG' however succeeded. But an attempt to stop/start glusterd using 'systemctl start' and 'systemctl stop' failed. The glusterd logs showed - Unable to restore a snapshot <snap1>'. On further debugging by Avra, the said snapshot <snap1> had half-baked data.

[root@dhcp46-239 ~]# cat snap_info 

Raising a 'medium' severity BZ for now. Will update this space if I hit this again.
Sequence of steps followed:

1. A 6node cluster with 3.8.4-10 build, imported to RHGS-C. 
2. Configured a snpashot schedule policy from console, to take a snapshot every hour. The first snapshot to be created by the scheduler failed. Manual creation of a snapshot from the console also failed.
3. The same thing was attempted from CLI, and it complained - "another transaction is in progress." The cmd_history.log was flooded with 'gluster volume status tasks' at the frequency of every minute or so. This was suspected to be a probe attempt from console - which would take a lock on the volume.
4. Hence, moved the nodes to maintenance mode @console, and did multiple attempts to take a snapshot from CLI. Out of the 4 attempts, 2 of them failed with the same error 'another transaction is in progress' and 2 of the attempts were successful in taking a snapshot. 
5. glusterd was started in LDEBUG mode to probe further the cause of the same. glusterd logs complained of a particular snapshot restore.
6. Took the setup back to reproduce this issue. This was when systemctl start/stop failed to start/stop glusterd. pkill <pid_of_glusterd> and then systemctl start also failed to get the system back to normal.
6. Left the setup for a day, found a kernel panic on one of the nodes, and lvm hang on two other nodes. 
7. journalctl logs/kernel panic - was not able to make out much. Got all the nodes back up. systemctl start/stop glusterd continued to fail.
8. Removed the said snapshot from /var/lib/glusterd/snaps/ folder. And was able to get back glusterd online.

Comment 10 Mohammed Rafi KC 2018-11-19 09:40:37 UTC
The fix for this [1],[2],[3] was already present in rhgs-3.4.0 as part of the bug 1615578.

[1] : https://review.gluster.org/#/c/20747/
[2] : https://review.gluster.org/#/c/20770/
[3] : https://review.gluster.org/#/c/20854/

So closing this bug as it is already present.

