Bug 1661388

Summary:	After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Neha Berry <nberry>
Component:	gluster-block	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED WONTFIX	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	bgoyal, kramdoss, nberry, pkarampu, pprakash, prasanna.kalever, rhs-bugs, sankarshan, vbellur, vinug, xiubli
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-07 08:27:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2018-12-21 04:50:26 UTC

After OCS uninstall, Gluster-blockd fails to start with "targetcli doesn't list user:glfs handler" on re-install

Description of problem:
========================
We had an OCP 3.11.51 + OCS 3.11.1(gluster-block -gluster-block-0.2.1-30.el7rhgs.x86_64) with glusterfs + glusterfs-registry clusters. 
Since block creations were failing (BZ#1660280) hence we had o use the workaround to overcome the issue and make block creations to start working in OCP 3.11.51. 

OCS was uninstalled and then OCS playbook was re-run to install both glusterfs-app-storage and glusterfs-registry. The pods in glusterfs app storage namespace came up successfully but one of the glusterfs-registry pod failed to come up due to following error message:

  Warning  Unhealthy  4s (x23 over 9m)  kubelet, dhcp46-153.lab.eng.blr.redhat.com  Readiness probe failed: /usr/local/bin/status-probe.sh
failed check: systemctl -q is-active gluster-blockd.service


On checking further , we saw the following error message in gluster-blockd logs:


[2018-12-19 08:49:06.940246] ERROR: tcmu-runner running, but targetcli doesn't list user:glfs handler [at gluster-blockd.c+391 :<blockNodeSanityCheck>]


Steps Performed
====================

1. Created an OCP 3.11.51 + OCS 3.11.1 setup with both glusterfs and glusterfs-registry pods

2. Block device creations were failing due to BZ#1660280

3. Uninstalled both glusterfs and registry clusters with wipefs option
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/uninstall.yml -e "openshift_storage_glusterfs_wipe=True"

4. This setup had some old data in /etc/target folder of all the gluster nodes, hence deleted the contents of /etc/target folder.

5. edited the following template to use the workaround mentioned in mail for BZ#1660280
/usr/share/ansible/openshift-ansible/playbooks/roles/openshift_storage_glusterfs/files/glusterfs-template.yml

6. Re-installed OCS with both glusterfs and infra-storage(glusterfs-registry) pods. 

7. All pods in glusterfs namespace came up fine but one out of the 3 glusterfs-regitry pods in infra-storage failed to come into Ready 1/1 state. 

The gluster-blockd liveliness and readiness probve was failing as gluster-blockd service had failed to get stared on the concerned pod.

Note: 

1. Checked tcmu-runner and gluster-block-target and gluster-blockd services . Only tcmu-runner service was running
2. Checked lsmod of target_core_user, its loaded

lsmod | grep target_core_user
target_core_user       35043  4 
target_core_mod       342480  12 target_core_iblock,target_core_pscsi,iscsi_target_mod,target_core_file,target_core_user
uio                    19338  1 target_core_user

3. Xiubo chekced in the gluster node and confirmed that even after cleaning the config(uninstall), some files in /sys/kernel/config/target/core/ still existed.
They resulted in gluster-blockd not able to start on new install.


[root@dhcp46-153 ~]# ls -ltrh /sys/kernel/config/target/core
total 0
drwxr-xr-x. 3 root root 0 Dec 17 14:53 alua
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_3   <---- extra files
drwxr-xr-x. 3 root root 0 Dec 20 14:55 user_1   



More details are shared in the next comment.


How reproducible:
===============
Only once and that too for only one pod out of the 3 in infra-storage. 


Actual results:
================
All OCS pods failed to come up in infra-storage(glusterfs-registry) namespace as one of the gluster pod reported issues with gluster-blockd.

Expected results:
===================

Since uninstalled was done and config was cleaned up, re-install of OCS should have brought up all the gluster pods with both glusterd and glutser-blockd in running state.

Comment 9 Prasanna Kumar Kalever 2019-02-07 08:27:38 UTC

As discussed at this comment https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c4, this is not a sane case, and hence we really don't have to worry about it :-)

Comment 10 vinutha 2019-02-07 09:21:54 UTC

Marking the 'qe-test-coverage' flag as '-' based on https://bugzilla.redhat.com/show_bug.cgi?id=1661388#c9