Bug 1637105

Summary:	TASK [openshift_storage_glusterfs : Verify heketi service] fails while trying to scale up OCP node with CNS
Product:	OpenShift Container Platform	Reporter:	Sudarshan Chaudhari <suchaudh>
Component:	Installer	Assignee:	Jose A. Rivera <jarrpa>
Status:	CLOSED NOTABUG	QA Contact:	Prasanth <pprakash>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.10.0	CC:	aos-bugs, bward, hchiramm, jarrpa, jkaur, jokerman, klaas, ksalunkh, mmccomas, sarumuga, suchaudh
Target Milestone:	---
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-10 14:57:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sudarshan Chaudhari 2018-10-08 15:58:14 UTC

Description of problem:

Scale Up task failing with the error:
~~~
TASK [openshift_storage_glusterfs : Verify heketi service] *****************************************************************************
fatal: [x01.example.com]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-ww2zqV/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-xvc8b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.240447", "end": "2018-10-02 16:47:14.190869", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2018-10-02 16:47:13.950422", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.retry
~~~

Test scenario:
- Replacing the OCP node having the CNS pod running on it by:
1. format the RHEL (host)
2. perform host preparation
3. clean the disk
4. perform the scale-up of node.

The playbook for scale-up, after the node scale-up initiates the gluster playbook which updates the heketi-topology to replace the OCP node from CNS node.

# ansible-playbook -i <inventory> /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml

But this playbook fails with above task and error.

Version-Release number of the following components:
# rpm -q openshift-ansible
openshift-ansible-roles-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-playbooks-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-docs-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch

How reproducible:
every time we try to replace the CNS node

Steps to Reproduce:
1. delete the host
2. get new host for old node a
3. perform host preparation
4. update inventory for new host and run the playbook.

Actual results:
Playbook is failing with the above-mentioned task

Expected results:
The CNS node should have also been replaced.
Additional info:

Similar upstream issues with relative error:
https://github.com/openshift/origin/issues/17947
https://github.com/openshift/openshift-ansible/issues/9068

Comment 17 Klaas Demter 2018-11-15 07:40:39 UTC

I'll try to summarize the issue for public display:
currently I am unable to replace a failed gluster storage node in CNS/OCS.
While setting HEKETI_ADMIN_KEY manually (extracting it from a deploymentconfig for example) in inventory as openshift_storage_glusterfs_heketi_admin_key=NNN makes the scale up playbook succeed it does not create the gluster storage on the node and it does not properly integrate with the 2 remaining (good) nodes.

This means either my thinking is wrong that I can replace a gluster node with the scale_up playbook or the scale_up playbook is not doing what it should do. --> either fix scale_up or provide me with an alternative way to replace a gluster node. I am open for ideas, however not being able to replace a failed node makes openshift with gluster totally unusable in production for me.

Comment 19 Klaas Demter 2018-11-16 07:04:57 UTC

And this seems to have not gotten here from the case: Same issues in 3.11 when I ran the last test a couple of weeks ago I used the latest 3.11

Comment 25 kedar 2019-01-15 05:47:45 UTC

Hello Team,

Any updates on this issue

Regards,
Kedar

Comment 28 Jose A. Rivera 2019-04-10 14:57:59 UTC

Attached customer cases a closed, and bug has been dormant for some time. Closing this BZ.