Bug 1637105

Summary: TASK [openshift_storage_glusterfs : Verify heketi service] fails while trying to scale up OCP node with CNS
Product: OpenShift Container Platform Reporter: Sudarshan Chaudhari <suchaudh>
Component: InstallerAssignee: Jose A. Rivera <jarrpa>
Status: CLOSED NOTABUG QA Contact: Prasanth <pprakash>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.10.0CC: aos-bugs, bward, hchiramm, jarrpa, jkaur, jokerman, klaas, ksalunkh, mmccomas, sarumuga, suchaudh
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-10 14:57:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sudarshan Chaudhari 2018-10-08 15:58:14 UTC
Description of problem:

Scale Up task failing with the error:
~~~
TASK [openshift_storage_glusterfs : Verify heketi service] *****************************************************************************
fatal: [x01.example.com]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-ww2zqV/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-xvc8b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.240447", "end": "2018-10-02 16:47:14.190869", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2018-10-02 16:47:13.950422", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.retry
~~~

Test scenario:
- Replacing the OCP node having the CNS pod running on it by:
 1. format the RHEL (host)
 2. perform host preparation
 3. clean the disk
 4. perform the scale-up of node.

The playbook for scale-up, after the node scale-up initiates the gluster playbook which updates the heketi-topology to replace the OCP node from CNS node. 

# ansible-playbook -i <inventory> /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml 

But this playbook fails with above task and error. 

Version-Release number of the following components:
# rpm -q openshift-ansible
openshift-ansible-roles-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-playbooks-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-docs-3.10.47-1.git.0.95bc2d2.el7_5.noarch
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch

How reproducible:
every time we try to replace the CNS node

Steps to Reproduce:
1. delete the host
2. get new host for old node a
3. perform host preparation
4. update inventory for new host and run the playbook.

Actual results:
Playbook is failing with the above-mentioned task 

Expected results:
The CNS node should have also been replaced. 
Additional info:

Similar upstream issues with relative error:
https://github.com/openshift/origin/issues/17947
https://github.com/openshift/openshift-ansible/issues/9068

Comment 17 Klaas Demter 2018-11-15 07:40:39 UTC
I'll try to summarize the issue for public display:
currently I am unable to replace a failed gluster storage node in CNS/OCS.
While setting HEKETI_ADMIN_KEY manually (extracting it from a deploymentconfig for example) in inventory as openshift_storage_glusterfs_heketi_admin_key=NNN makes the scale up playbook succeed it does not create the gluster storage on the node and it does not properly integrate with the 2 remaining (good) nodes.

This means either my thinking is wrong that I can replace a gluster node with the scale_up playbook or the scale_up playbook is not doing what it should do. --> either fix scale_up or provide me with an alternative way to replace a gluster node. I am open for ideas, however not being able to replace a failed node makes openshift with gluster totally unusable in production for me.

Comment 19 Klaas Demter 2018-11-16 07:04:57 UTC
And this seems to have not gotten here from the case: Same issues in 3.11 when I ran the last test a couple of weeks ago I used the latest 3.11

Comment 25 kedar 2019-01-15 05:47:45 UTC
Hello Team,

Any updates on this issue

Regards,
Kedar

Comment 28 Jose A. Rivera 2019-04-10 14:57:59 UTC
Attached customer cases a closed, and bug has been dormant for some time. Closing this BZ.