1637105 – TASK [openshift_storage_glusterfs : Verify heketi service] fails while trying to scale up OCP node with CNS

Bug 1637105 - TASK [openshift_storage_glusterfs : Verify heketi service] fails while trying to scale up OCP node with CNS

Summary: TASK [openshift_storage_glusterfs : Verify heketi service] fails while trying...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Jose A. Rivera
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-08 15:58 UTC by Sudarshan Chaudhari
Modified:	2019-04-12 17:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2019-04-10 14:57:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3890821	0	Upgrade	None	Error 'Error: Invalid JWT token: signature is invalid (client and server secrets may not match)'	2019-04-12 17:49:37 UTC

Description Sudarshan Chaudhari 2018-10-08 15:58:14 UTC

Description of problem:

Scale Up task failing with the error:
~~~
TASK [openshift_storage_glusterfs : Verify heketi service] *****************************************************************************
fatal: [x01.example.com]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-ww2zqV/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-xvc8b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.240447", "end": "2018-10-02 16:47:14.190869", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2018-10-02 16:47:13.950422", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.retry
~~~

Test scenario:
- Replacing the OCP node having the CNS pod running on it by:
1. format the RHEL (host)
2. perform host preparation
3. clean the disk
4. perform the scale-up of node.

The playbook for scale-up, after the node scale-up initiates the gluster playbook which updates the heketi-topology to replace the OCP node from CNS node.

# ansible-playbook -i <inventory> /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml

But this playbook fails with above task and error.

Version-Release number of the following components:
# rpm -q openshift-ansible
openshift-ansible-roles-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-playbooks-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-docs-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch

How reproducible:
every time we try to replace the CNS node

Steps to Reproduce:
1. delete the host
2. get new host for old node a
3. perform host preparation
4. update inventory for new host and run the playbook.

Actual results:
Playbook is failing with the above-mentioned task

Expected results:
The CNS node should have also been replaced.
Additional info:

Similar upstream issues with relative error:
https://github.com/openshift/origin/issues/17947
https://github.com/openshift/openshift-ansible/issues/9068

Comment 17 Klaas Demter 2018-11-15 07:40:39 UTC

I'll try to summarize the issue for public display:
currently I am unable to replace a failed gluster storage node in CNS/OCS.
While setting HEKETI_ADMIN_KEY manually (extracting it from a deploymentconfig for example) in inventory as openshift_storage_glusterfs_heketi_admin_key=NNN makes the scale up playbook succeed it does not create the gluster storage on the node and it does not properly integrate with the 2 remaining (good) nodes.

This means either my thinking is wrong that I can replace a gluster node with the scale_up playbook or the scale_up playbook is not doing what it should do. --> either fix scale_up or provide me with an alternative way to replace a gluster node. I am open for ideas, however not being able to replace a failed node makes openshift with gluster totally unusable in production for me.

Comment 19 Klaas Demter 2018-11-16 07:04:57 UTC

And this seems to have not gotten here from the case: Same issues in 3.11 when I ran the last test a couple of weeks ago I used the latest 3.11

Comment 25 kedar 2019-01-15 05:47:45 UTC

Hello Team,

Any updates on this issue

Regards,
Kedar

Comment 28 Jose A. Rivera 2019-04-10 14:57:59 UTC

Attached customer cases a closed, and bug has been dormant for some time. Closing this BZ.

Note You need to log in before you can comment on or make changes to this bug.