After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install
Description of problem:
We had an OCP 3.11.51 + OCS 3.11.1(gluster-block -gluster-block-0.2.1-30.el7rhgs.x86_64) with glusterfs + glusterfs-registry clusters.
Since block creations were failing (BZ#1660280) hence we had o use the workaround to overcome the issue and make block creations to start working in OCP 3.11.51.
OCS was uninstalled and then OCS playbook was re-run to install both glusterfs-app-storage and glusterfs-registry. The pods in glusterfs app storage namespace came up successfully but one of the glusterfs-registry pod failed to come up due to following error message:
Warning Unhealthy 4s (x23 over 9m) kubelet, dhcp46-153.lab.eng.blr.redhat.com Readiness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active gluster-blockd.service
On checking further , we saw the following error message in gluster-blockd logs:
[2018-12-19 08:49:06.940246] ERROR: tcmu-runner running, but targetcli doesn't list user:glfs handler [at gluster-blockd.c+391 :<blockNodeSanityCheck>]
1. Created an OCP 3.11.51 + OCS 3.11.1 setup with both glusterfs and glusterfs-registry pods
2. Block device creations were failing due to BZ#1660280
3. Uninstalled both glusterfs and registry clusters with wipefs option
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml -e "openshift_storage_glusterfs_wipe=True"
4. This setup had some old data in /etc/target folder of all the gluster nodes, hence deleted the contents of /etc/target folder.
5. edited the following template to use the workaround mentioned in mail for BZ#1660280
6. Re-installed OCS with both glusterfs and infra-storage(glusterfs-registry) pods.
7. All pods in glusterfs namespace came up fine but one out of the 3 glusterfs-regitry pods in infra-storage failed to come into Ready 1/1 state.
The gluster-blockd liveliness and readiness probve was failing as gluster-blockd service had failed to get stared on the concerned pod.
1. Checked tcmu-runner and gluster-block-target and gluster-blockd services . Only tcmu-runner service was running
2. Checked lsmod of target_core_user, its loaded
lsmod | grep target_core_user
target_core_user 35043 4
target_core_mod 342480 12 target_core_iblock,target_core_pscsi,iscsi_target_mod,target_core_file,target_core_user
uio 19338 1 target_core_user
3. Xiubo chekced in the gluster node and confirmed that even after cleaning the config(uninstall), some files in /sys/kernel/config/target/core/ still existed.
They resulted in gluster-blockd not able to start on new install.
[root@dhcp46-153 ~]# ls -ltrh /sys/kernel/config/target/core
drwxr-xr-x. 3 root root 0 Dec 17 14:53 alua
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_3 <---- extra files
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_1
More details are shared in the next comment.
Only once and that too for only one pod out of the 3 in infra-storage.
All OCS pods failed to come up in infra-storage(glusterfs-registry) namespace as one of the gluster pod reported issues with gluster-blockd.
Since uninstalled was done and config was cleaned up, re-install of OCS should have brought up all the gluster pods with both glusterd and glutser-blockd in running state.
As discussed at this comment https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c4, this is not a sane case, and hence we really don't have to worry about it :-)
Marking the 'qe-test-coverage' flag as '-' based on https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c9