Bug 1645656
| Summary: | Director deployed OCP 3.11: replacing an Infra node fails during TASK [openshift_storage_glusterfs : Verify heketi service] | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Marius Cornea <mcornea> |
| Component: | Installer | Assignee: | Martin André <m.andre> |
| Installer sub component: | openshift-ansible | QA Contact: | Johnny Liu <jialiu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | dbecker, jtrowbri, ltomasbo, m.andre, mburns, morazi |
| Version: | 3.11.0 | Keywords: | ZStream |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openshift-ansible-3.11.74-1.git.0.cde4c69.el7 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-26 09:07:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as a workaround on the env. (In reply to Marius Cornea from comment #2) > Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as > a workaround on the env. For the infra node scale up, you would have to set the openshift-ansible openshift_storage_glusterfs_registry_heketi_admin_key variable instead. Now, I'm less sure about how to retrieve the heketi secret... According to the code, I would expect the secret to be named heketi-registry-admin-secret but all I can see in my environment is a heketi-storage-admin-secret secret. Possibly, the storage and registry share the same secret? Anyway, here is how to retrieve the heketi-storage-admin-secret secret: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d It indeed succeeds scaling out infra node if I set `openshift_storage_glusterfs_registry_heketi_admin_key` to the output of: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d Submitted a partial fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710 However, there is an issue with the name of the registry heketi secret (https://github.com/openshift/openshift-ansible/issues/10712) so the above patch does not completely fix the issue. Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 Workaround for this is actually in comment 4: https://bugzilla.redhat.com/show_bug.cgi?id=1645656#c4 Proposed a fix to openshift-ansible: https://github.com/openshift/openshift-ansible/pull/11072 Fix included in openshift-ansible-3.11.74-1. The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix. https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible TASK [openshift_storage_glusterfs : Verify heketi service] ********************* [0;32mok: [openshift-master-2][0m TASK [openshift_storage_glusterfs : Wait for GlusterFS pods] ******************* [1;30mFAILED - RETRYING: Wait for GlusterFS pods (30 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (29 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (28 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (27 retries left).[0m [1;30mFAILED - RETRYING: Wait for GlusterFS pods (26 retries left).[0m [0;32mok: [openshift-master-2][0m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605 |
Description of problem: Director deployed OCP 3.11: replacing an Infra node fails during TASK [openshift_storage_glusterfs : Verify heketi service]: TASK [openshift_storage_glusterfs : Verify heketi service] ********************* [0;31mfatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-b18knu/admin.kubeconfig", "rsh", "--namespace=default", "heketi-registry-1-mjsg9", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.434645", "end": "2018-11-02 07:08:09.487683", "msg": "non-zero return code", "rc": 255, "start": "2018-11-02 07:08:09.053038", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}[0m PLAY RECAP ********************************************************************* [0;32mlocalhost[0m : [0;32mok=36 [0m changed=0 unreachable=0 failed=0 [0;33mopenshift-infra-1[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-infra-2[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-infra-3[0m : [0;32mok=219 [0m [0;33mchanged=74 [0m unreachable=0 failed=0 [0;33mopenshift-master-0[0m : [0;32mok=81 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-master-1[0m : [0;32mok=81 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;31mopenshift-master-2[0m : [0;32mok=138 [0m [0;33mchanged=7 [0m unreachable=0 [0;31mfailed=1 [0m [0;33mopenshift-worker-0[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-worker-1[0m : [0;32mok=41 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 [0;33mopenshift-worker-2[0m : [0;32mok=42 [0m [0;33mchanged=6 [0m unreachable=0 failed=0 INSTALLER STATUS *************************************************************** [0;32mInitialization : Complete (0:01:55)[0m [0;32mNode Bootstrap Preparation : Complete (0:03:58)[0m [0;32mNode Join : Complete (0:00:17)[0m Failure summary: 1. Hosts: openshift-master-2 Play: Reload glusterfs topology Task: Verify heketi service Message: [0;31mnon-zero return code[0m Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.1-0.20181013060867.ffbe879.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OCP overcloud with 3 masters + 3 infra nodes + 3 worker nodes with CNS enabled 2. Remove openshift-infra-0 node 3. Re-run overclud deploy to add openshift-infra-3 node Actual results: Fails during during TASK [openshift_storage_glusterfs : Verify heketi service] Expected results: No failure. Additional info: Attaching /var/lib/mistral