1661388 – After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install

Bug 1661388 - After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install

Summary: After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't li...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Prasanna Kumar Kalever
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-21 04:50 UTC by Neha Berry
Modified:	2019-02-07 09:21 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 08:27:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Neha Berry 2018-12-21 04:50:26 UTC

After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install

Description of problem:
========================
We had an OCP 3.11.51 + OCS 3.11.1(gluster-block -gluster-block-0.2.1-30.el7rhgs.x86_64) with glusterfs + glusterfs-registry clusters. 
Since block creations were failing (BZ#1660280) hence we had o use the workaround to overcome the issue and make block creations to start working in OCP 3.11.51. 

OCS was uninstalled and then OCS playbook was re-run to install both glusterfs-app-storage and glusterfs-registry. The pods in glusterfs app storage namespace came up successfully but one of the glusterfs-registry pod failed to come up due to following error message:

  Warning  Unhealthy  4s (x23 over 9m)  kubelet, dhcp46-153.lab.eng.blr.redhat.com  Readiness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active gluster-blockd.service


On checking further , we saw the following error message in gluster-blockd logs:


[2018-12-19 08:49:06.940246] ERROR: tcmu-runner running, but targetcli doesn't list user:glfs handler [at gluster-blockd.c+391 :<blockNodeSanityCheck>]


Steps Performed
====================

1. Created an OCP 3.11.51 + OCS 3.11.1 setup with both glusterfs and glusterfs-registry pods

2. Block device creations were failing due to BZ#1660280

3. Uninstalled both glusterfs and registry clusters with wipefs option
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml -e "openshift_storage_glusterfs_wipe=True"

4. This setup had some old data in /etc/target folder of all the gluster nodes, hence deleted the contents of /etc/target folder.

5. edited the following template to use the workaround mentioned in mail for BZ#1660280
/usr/share/ansible/openshift-ansible/playbooks/roles/openshift_storage_glusterfs/files/glusterfs-template.yml

6. Re-installed OCS with both glusterfs and infra-storage(glusterfs-registry) pods. 

7. All pods in glusterfs namespace came up fine but one out of the 3 glusterfs-regitry pods in infra-storage failed to come into Ready 1/1 state. 

The gluster-blockd liveliness and readiness probve was failing as gluster-blockd service had failed to get stared on the concerned pod.

Note: 

1. Checked tcmu-runner and gluster-block-target and gluster-blockd services . Only tcmu-runner service was running
2. Checked lsmod of target_core_user, its loaded

lsmod | grep target_core_user
target_core_user       35043  4 
target_core_mod       342480  12 target_core_iblock,target_core_pscsi,iscsi_target_mod,target_core_file,target_core_user
uio                    19338  1 target_core_user

3. Xiubo chekced in the gluster node and confirmed that even after cleaning the config(uninstall), some files in /sys/kernel/config/target/core/ still existed.
They resulted in gluster-blockd not able to start on new install.


[root@dhcp46-153 ~]# ls -ltrh /sys/kernel/config/target/core
total 0
drwxr-xr-x. 3 root root 0 Dec 17 14:53 alua
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_3   <---- extra files
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_1   



More details are shared in the next comment.


How reproducible:
===============
Only once and that too for only one pod out of the 3 in infra-storage. 


Actual results:
================
All OCS pods failed to come up in infra-storage(glusterfs-registry) namespace as one of the gluster pod reported issues with gluster-blockd.

Expected results:
===================

Since uninstalled was done and config was cleaned up, re-install of OCS should have brought up all the gluster pods with both glusterd and glutser-blockd in running state.

Comment 9 Prasanna Kumar Kalever 2019-02-07 08:27:38 UTC

As discussed at this comment https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c4, this is not a sane case, and hence we really don't have to worry about it :-)

Comment 10 vinutha 2019-02-07 09:21:54 UTC

Marking the 'qe-test-coverage' flag as '-' based on https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c9

Note You need to log in before you can comment on or make changes to this bug.