Hide Forgot
Description of problem: Director deployed OCP: replacing worker node fails during TASK [openshift_storage_glusterfs : Verify heketi service]: TASK [openshift_storage_glusterfs : Verify heketi service] ********************* fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []} PLAY RECAP ********************************************************************* localhost : ok=39 changed=0 unreachable=0 failed=0 openshift-infra-0 : ok=206 changed=23 unreachable=0 failed=0 openshift-infra-1 : ok=206 changed=23 unreachable=0 failed=0 openshift-infra-2 : ok=208 changed=23 unreachable=0 failed=0 openshift-master-0 : ok=326 changed=38 unreachable=0 failed=0 openshift-master-1 : ok=326 changed=38 unreachable=0 failed=0 openshift-master-2 : ok=531 changed=83 unreachable=0 failed=1 openshift-worker-1 : ok=206 changed=23 unreachable=0 failed=0 openshift-worker-2 : ok=207 changed=23 unreachable=0 failed=0 openshift-worker-3 : ok=216 changed=73 unreachable=0 failed=0 INSTALLER STATUS *************************************************************** Initialization : Complete (0:03:47) Node Bootstrap Preparation : Complete (0:06:34) Master Install : Complete (0:07:29) Node Join : Complete (0:00:39) Load Balancer Install : Complete (0:00:01) Failure summary: 1. Hosts: openshift-master-2 Play: Reload glusterfs topology Task: Verify heketi service Message: non-zero return code Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deplou openshift overcloud with 3 masters + 3 workers + 3 infra nodes 2. Delete 1 worker node 3. Re-run deploy to re-provision the node deleted in step 2 Actual results: openshift-ansible fails with: TASK [openshift_storage_glusterfs : Verify heketi service] ********************* fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []} Expected results: No failure. Additional info: The same error shows up when re-running overcloud deploy for a 2nd time without any changes.
Created attachment 1494991 [details] openshift.tar.gz Attaching /var/lib/mistral/openshift
Possibly fixed with https://review.openstack.org/#/c/611306/ ? New node detection was broken due to different service naming downstream. This patch fixed it.
Just remembered node replacement is targeted at OSP15 (https://bugzilla.redhat.com/show_bug.cgi?id=1591288), removing triaged information so that bug appears in our triage meeting.
Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105. Maybe one element of response: "By default, the GlusterFS playbooks will auto-generate a new heketi secret key for each run. You need to extract the key from the heketi config secret and set it as the value for "openshift_storage_glusterfs_heketi_admin_key" in your inventory file. That will reuse the existing key in your cluster when running the scaleup playbook."
(In reply to Martin André from comment #5) > Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105. > > Maybe one element of response: > > "By default, the GlusterFS playbooks will auto-generate a new heketi secret > key for each run. You need to extract the key from the heketi config secret > and set it as the value for "openshift_storage_glusterfs_heketi_admin_key" > in your inventory file. That will reuse the existing key in your cluster > when running the scaleup playbook." Yes, it looks like the same issue but imo we can't consider this as a viable solution to the problem because it involves manual actions of the operator on the overcloud nodes. Can we do these steps automatically on the tripleo side?
I can confirm that the scale up goes to completion if I set the openshift_storage_glusterfs_heketi_admin_key var to the existing heketi secret. Here is a one liner to get the heketi secret in an environment where openshift_storage_glusterfs_namespace=glusterfs (the default): sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d I've chatted a little bit with Jose A. Rivera about the issue and this looks like a regression in openshift-ansible. Our options are: 1) Fix the issue in openshift-ansible and ship it in 3.11 in time for OSP14 2) Document how to set the openshift_storage_glusterfs_heketi_admin_key variable for a scale up 3) Implement a way in tripleo to retrieve the secret and inject it to openshift-ansible before the scale up operation. Since we have a workaround, I suggest we remove the "blocker?" flag.
Submitted a fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710
Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7
Fix included in openshift-ansible-3.11.74-1.
The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix. https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605