Description of problem: Scale Up task failing with the error: ~~~ TASK [openshift_storage_glusterfs : Verify heketi service] ***************************************************************************** fatal: [x01.example.com]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-ww2zqV/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-xvc8b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.240447", "end": "2018-10-02 16:47:14.190869", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2018-10-02 16:47:13.950422", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.retry ~~~ Test scenario: - Replacing the OCP node having the CNS pod running on it by: 1. format the RHEL (host) 2. perform host preparation 3. clean the disk 4. perform the scale-up of node. The playbook for scale-up, after the node scale-up initiates the gluster playbook which updates the heketi-topology to replace the OCP node from CNS node. # ansible-playbook -i <inventory> /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml But this playbook fails with above task and error. Version-Release number of the following components: # rpm -q openshift-ansible openshift-ansible-roles-3.10.47-1.git.0.95bc2d2.el7_5.noarch openshift-ansible-playbooks-3.10.47-1.git.0.95bc2d2.el7_5.noarch openshift-ansible-docs-3.10.47-1.git.0.95bc2d2.el7_5.noarch openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch How reproducible: every time we try to replace the CNS node Steps to Reproduce: 1. delete the host 2. get new host for old node a 3. perform host preparation 4. update inventory for new host and run the playbook. Actual results: Playbook is failing with the above-mentioned task Expected results: The CNS node should have also been replaced. Additional info: Similar upstream issues with relative error: https://github.com/openshift/origin/issues/17947 https://github.com/openshift/openshift-ansible/issues/9068
I'll try to summarize the issue for public display: currently I am unable to replace a failed gluster storage node in CNS/OCS. While setting HEKETI_ADMIN_KEY manually (extracting it from a deploymentconfig for example) in inventory as openshift_storage_glusterfs_heketi_admin_key=NNN makes the scale up playbook succeed it does not create the gluster storage on the node and it does not properly integrate with the 2 remaining (good) nodes. This means either my thinking is wrong that I can replace a gluster node with the scale_up playbook or the scale_up playbook is not doing what it should do. --> either fix scale_up or provide me with an alternative way to replace a gluster node. I am open for ideas, however not being able to replace a failed node makes openshift with gluster totally unusable in production for me.
And this seems to have not gotten here from the case: Same issues in 3.11 when I ran the last test a couple of weeks ago I used the latest 3.11
Hello Team, Any updates on this issue Regards, Kedar
Attached customer cases a closed, and bug has been dormant for some time. Closing this BZ.