Bug 1640382

Summary: Director deployed OCP: replacing worker node fails during TASK [openshift_storage_glusterfs : Verify heketi service]
Product: OpenShift Container Platform Reporter: Marius Cornea <mcornea>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: openshift-ansible QA Contact: Johnny Liu <jialiu>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: dbecker, jtrowbri, ltomasbo, m.andre, mburns, mlopes, morazi, racedoro, tsedovic
Version: 3.11.0Keywords: ZStream
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-ansible-3.11.74-1.git.0.cde4c69.el7 Doc Type: Known Issue
Doc Text:
On a director-deployed OpenShift environment, the GlusterFS playbooks auto-generate a new heketi secret key for each run. As a result of this, operations such as scale out or configuration changes on CNS deployments fail. As a workaround, complete the following steps: 1. Post-deployment, retrieve the heketi secret key. Use this command on one of the master nodes: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d 2. In an environment file, set the following parameters to that value: openshift_storage_glusterfs_heketi_admin_key openshift_storage_glusterfs_registry_heketi_admin_key As a result of this workaround, operations such as scale out or configuration changes on CNS deployments work as long as the parameters were manually extracted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 09:07:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift.tar.gz none

Description Marius Cornea 2018-10-17 23:23:26 UTC
Description of problem:
Director deployed OCP: replacing worker node fails during TASK [openshift_storage_glusterfs : Verify heketi service]:

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
localhost                  : ok=39   changed=0    unreachable=0    failed=0   
openshift-infra-0          : ok=206  changed=23   unreachable=0    failed=0   
openshift-infra-1          : ok=206  changed=23   unreachable=0    failed=0   
openshift-infra-2          : ok=208  changed=23   unreachable=0    failed=0   
openshift-master-0         : ok=326  changed=38   unreachable=0    failed=0   
openshift-master-1         : ok=326  changed=38   unreachable=0    failed=0   
openshift-master-2         : ok=531  changed=83   unreachable=0    failed=1   
openshift-worker-1         : ok=206  changed=23   unreachable=0    failed=0   
openshift-worker-2         : ok=207  changed=23   unreachable=0    failed=0   
openshift-worker-3         : ok=216  changed=73   unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:03:47)
Node Bootstrap Preparation  : Complete (0:06:34)
Master Install              : Complete (0:07:29)
Node Join                   : Complete (0:00:39)
Load Balancer Install       : Complete (0:00:01)


Failure summary:


  1. Hosts:    openshift-master-2
     Play:     Reload glusterfs topology
     Task:     Verify heketi service
     Message:  non-zero return code


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deplou openshift overcloud with 3 masters + 3 workers + 3 infra nodes
2. Delete 1 worker node
3. Re-run deploy to re-provision the node deleted in step 2

Actual results:
openshift-ansible fails with:

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}


Expected results:
No failure.

Additional info:

The same error shows up when re-running overcloud deploy for a 2nd time without any changes.

Comment 1 Marius Cornea 2018-10-17 23:25:46 UTC
Created attachment 1494991 [details]
openshift.tar.gz

Attaching /var/lib/mistral/openshift

Comment 2 Martin André 2018-10-18 11:31:52 UTC
Possibly fixed with https://review.openstack.org/#/c/611306/ ? New node detection was broken due to different service naming downstream. This patch fixed it.

Comment 3 Martin André 2018-10-22 07:53:57 UTC
Just remembered node replacement is targeted at OSP15 (https://bugzilla.redhat.com/show_bug.cgi?id=1591288), removing triaged information so that bug appears in our triage meeting.

Comment 5 Martin André 2018-10-24 10:26:58 UTC
Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105.

Maybe one element of response:

"By default, the GlusterFS playbooks will auto-generate a new heketi secret key for each run. You need to extract the key from the heketi config secret and set it as the value for "openshift_storage_glusterfs_heketi_admin_key" in your inventory file. That will reuse the existing key in your cluster when running the scaleup playbook."

Comment 6 Marius Cornea 2018-10-24 14:08:56 UTC
(In reply to Martin André from comment #5)
> Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105.
> 
> Maybe one element of response:
> 
> "By default, the GlusterFS playbooks will auto-generate a new heketi secret
> key for each run. You need to extract the key from the heketi config secret
> and set it as the value for "openshift_storage_glusterfs_heketi_admin_key"
> in your inventory file. That will reuse the existing key in your cluster
> when running the scaleup playbook."

Yes, it looks like the same issue but imo we can't consider this as a viable solution to the problem because it involves manual actions of the operator on the overcloud nodes. Can we do these steps automatically on the tripleo side?

Comment 7 Martin André 2018-10-25 06:23:01 UTC
I can confirm that the scale up goes to completion if I set the openshift_storage_glusterfs_heketi_admin_key var to the existing heketi secret.

Here is a one liner to get the heketi secret in an environment where openshift_storage_glusterfs_namespace=glusterfs (the default):

  sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d

I've chatted a little bit with Jose A. Rivera about the issue and this looks like a regression in openshift-ansible.

Our options are:
1) Fix the issue in openshift-ansible and ship it in 3.11 in time for OSP14
2) Document how to set the openshift_storage_glusterfs_heketi_admin_key variable for a scale up
3) Implement a way in tripleo to retrieve the secret and inject it to openshift-ansible before the scale up operation.

Since we have a workaround, I suggest we remove the "blocker?" flag.

Comment 8 Martin André 2018-11-19 13:45:15 UTC
Submitted a fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710

Comment 9 John Trowbridge 2018-11-19 15:14:02 UTC
Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7

Comment 12 Martin André 2019-01-25 06:53:56 UTC
Fix included in openshift-ansible-3.11.74-1.

Comment 15 Martin André 2019-02-20 10:57:06 UTC
The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix.

https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible

Comment 18 errata-xmlrpc 2019-06-26 09:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605