Description of problem: Director deployed OCP 3.11: replacing an Infra node fails during TASK [openshift_storage_glusterfs : Verify heketi service]: TASK [openshift_storage_glusterfs : Verify heketi service] ********************* [0;31mfatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-b18knu/admin.kubeconfig", "rsh", "--namespace=default", "heketi-registry-1-mjsg9", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.434645", "end": "2018-11-02 07:08:09.487683", "msg": "non-zero return code", "rc": 255, "start": "2018-11-02 07:08:09.053038", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}[0m PLAY RECAP ********************************************************************* [0;32mlocalhost[0m : [0;32mok=36 [0m changed=0 unreachable=0 failed=0 [0;33mopenshift-infra-1[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-infra-2[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-infra-3[0m : [0;32mok=219 [0m [0;33mchanged=74 [0m unreachable=0 failed=0 [0;33mopenshift-master-0[0m : [0;32mok=81 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-master-1[0m : [0;32mok=81 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;31mopenshift-master-2[0m : [0;32mok=138 [0m [0;33mchanged=7 [0m unreachable=0 [0;31mfailed=1 [0m [0;33mopenshift-worker-0[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-worker-1[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-worker-2[0m : [0;32mok=42 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 INSTALLER STATUS *************************************************************** [0;32mInitialization : Complete (0:01:55)[0m [0;32mNode Bootstrap Preparation : Complete (0:03:58)[0m [0;32mNode Join : Complete (0:00:17)[0m Failure summary: 1. Hosts: openshift-master-2 Play: Reload glusterfs topology Task: Verify heketi service Message: [0;31mnon-zero return code[0m Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.1-0.20181013060867.ffbe879.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OCP overcloud with 3 masters + 3 infra nodes + 3 worker nodes with CNS enabled 2. Remove openshift-infra-0 node 3. Re-run overclud deploy to add openshift-infra-3 node Actual results: Fails during during TASK [openshift_storage_glusterfs : Verify heketi service] Expected results: No failure. Additional info: Attaching /var/lib/mistral
Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as a workaround on the env.
(In reply to Marius Cornea from comment #2) > Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as > a workaround on the env. For the infra node scale up, you would have to set the openshift-ansible openshift_storage_glusterfs_registry_heketi_admin_key variable instead. Now, I'm less sure about how to retrieve the heketi secret... According to the code, I would expect the secret to be named heketi-registry-admin-secret but all I can see in my environment is a heketi-storage-admin-secret secret. Possibly, the storage and registry share the same secret? Anyway, here is how to retrieve the heketi-storage-admin-secret secret: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d
It indeed succeeds scaling out infra node if I set `openshift_storage_glusterfs_registry_heketi_admin_key` to the output of: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d
Submitted a partial fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710 However, there is an issue with the name of the registry heketi secret (https://github.com/openshift/openshift-ansible/issues/10712) so the above patch does not completely fix the issue.
Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7
Workaround for this is actually in comment 4: https://bugzilla.redhat.com/show_bug.cgi?id=1645656#c4
Proposed a fix to openshift-ansible: https://github.com/openshift/openshift-ansible/pull/11072
Fix included in openshift-ansible-3.11.74-1.
The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix. https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible
TASK [openshift_storage_glusterfs : Verify heketi service] ********************* [0;32mok: [openshift-master-2][0m TASK [openshift_storage_glusterfs : Wait for GlusterFS pods] ******************* [1;30mFAILED - RETRYING: Wait for GlusterFS pods (30 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (29 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (28 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (27 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (26 retries left).[0m [0;32mok: [openshift-master-2][0m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605