Bug 1640382 - Director deployed OCP: replacing worker node fails during TASK [openshift_storage_glusterfs : Verify heketi service]
Summary: Director deployed OCP: replacing worker node fails during TASK [openshift_sto...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Martin André
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-17 23:23 UTC by Marius Cornea
Modified: 2019-06-26 09:08 UTC (History)
9 users (show)

Fixed In Version: openshift-ansible-3.11.74-1.git.0.cde4c69.el7
Doc Type: Known Issue
Doc Text:
On a director-deployed OpenShift environment, the GlusterFS playbooks auto-generate a new heketi secret key for each run. As a result of this, operations such as scale out or configuration changes on CNS deployments fail. As a workaround, complete the following steps: 1. Post-deployment, retrieve the heketi secret key. Use this command on one of the master nodes: sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d 2. In an environment file, set the following parameters to that value: openshift_storage_glusterfs_heketi_admin_key openshift_storage_glusterfs_registry_heketi_admin_key As a result of this workaround, operations such as scale out or configuration changes on CNS deployments work as long as the parameters were manually extracted.
Clone Of:
Environment:
Last Closed: 2019-06-26 09:07:51 UTC
Target Upstream Version:


Attachments (Terms of Use)
openshift.tar.gz (6.50 MB, application/x-gzip)
2018-10-17 23:25 UTC, Marius Cornea
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-ansible pull 10710 None closed Retrieve heketi secret before setting CLI command 2020-07-04 18:40:28 UTC
Red Hat Product Errata RHBA-2019:1605 None None None 2019-06-26 09:07:59 UTC

Description Marius Cornea 2018-10-17 23:23:26 UTC
Description of problem:
Director deployed OCP: replacing worker node fails during TASK [openshift_storage_glusterfs : Verify heketi service]:

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
localhost                  : ok=39   changed=0    unreachable=0    failed=0   
openshift-infra-0          : ok=206  changed=23   unreachable=0    failed=0   
openshift-infra-1          : ok=206  changed=23   unreachable=0    failed=0   
openshift-infra-2          : ok=208  changed=23   unreachable=0    failed=0   
openshift-master-0         : ok=326  changed=38   unreachable=0    failed=0   
openshift-master-1         : ok=326  changed=38   unreachable=0    failed=0   
openshift-master-2         : ok=531  changed=83   unreachable=0    failed=1   
openshift-worker-1         : ok=206  changed=23   unreachable=0    failed=0   
openshift-worker-2         : ok=207  changed=23   unreachable=0    failed=0   
openshift-worker-3         : ok=216  changed=73   unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:03:47)
Node Bootstrap Preparation  : Complete (0:06:34)
Master Install              : Complete (0:07:29)
Node Join                   : Complete (0:00:39)
Load Balancer Install       : Complete (0:00:01)


Failure summary:


  1. Hosts:    openshift-master-2
     Play:     Reload glusterfs topology
     Task:     Verify heketi service
     Message:  non-zero return code


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.0-0.20181001174822.90afd18.0rc2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deplou openshift overcloud with 3 masters + 3 workers + 3 infra nodes
2. Delete 1 worker node
3. Re-run deploy to re-provision the node deleted in step 2

Actual results:
openshift-ansible fails with:

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
fatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-1W8rUl/admin.kubeconfig", "rsh", "--namespace=glusterfs", "heketi-storage-1-2lx5b", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.397903", "end": "2018-10-17 18:48:13.396208", "msg": "non-zero return code", "rc": 255, "start": "2018-10-17 18:48:12.998305", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}


Expected results:
No failure.

Additional info:

The same error shows up when re-running overcloud deploy for a 2nd time without any changes.

Comment 1 Marius Cornea 2018-10-17 23:25:46 UTC
Created attachment 1494991 [details]
openshift.tar.gz

Attaching /var/lib/mistral/openshift

Comment 2 Martin André 2018-10-18 11:31:52 UTC
Possibly fixed with https://review.openstack.org/#/c/611306/ ? New node detection was broken due to different service naming downstream. This patch fixed it.

Comment 3 Martin André 2018-10-22 07:53:57 UTC
Just remembered node replacement is targeted at OSP15 (https://bugzilla.redhat.com/show_bug.cgi?id=1591288), removing triaged information so that bug appears in our triage meeting.

Comment 5 Martin André 2018-10-24 10:26:58 UTC
Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105.

Maybe one element of response:

"By default, the GlusterFS playbooks will auto-generate a new heketi secret key for each run. You need to extract the key from the heketi config secret and set it as the value for "openshift_storage_glusterfs_heketi_admin_key" in your inventory file. That will reuse the existing key in your cluster when running the scaleup playbook."

Comment 6 Marius Cornea 2018-10-24 14:08:56 UTC
(In reply to Martin André from comment #5)
> Looks awfully similar to https://bugzilla.redhat.com/show_bug.cgi?id=1637105.
> 
> Maybe one element of response:
> 
> "By default, the GlusterFS playbooks will auto-generate a new heketi secret
> key for each run. You need to extract the key from the heketi config secret
> and set it as the value for "openshift_storage_glusterfs_heketi_admin_key"
> in your inventory file. That will reuse the existing key in your cluster
> when running the scaleup playbook."

Yes, it looks like the same issue but imo we can't consider this as a viable solution to the problem because it involves manual actions of the operator on the overcloud nodes. Can we do these steps automatically on the tripleo side?

Comment 7 Martin André 2018-10-25 06:23:01 UTC
I can confirm that the scale up goes to completion if I set the openshift_storage_glusterfs_heketi_admin_key var to the existing heketi secret.

Here is a one liner to get the heketi secret in an environment where openshift_storage_glusterfs_namespace=glusterfs (the default):

  sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d

I've chatted a little bit with Jose A. Rivera about the issue and this looks like a regression in openshift-ansible.

Our options are:
1) Fix the issue in openshift-ansible and ship it in 3.11 in time for OSP14
2) Document how to set the openshift_storage_glusterfs_heketi_admin_key variable for a scale up
3) Implement a way in tripleo to retrieve the secret and inject it to openshift-ansible before the scale up operation.

Since we have a workaround, I suggest we remove the "blocker?" flag.

Comment 8 Martin André 2018-11-19 13:45:15 UTC
Submitted a fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710

Comment 9 John Trowbridge 2018-11-19 15:14:02 UTC
Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7

Comment 12 Martin André 2019-01-25 06:53:56 UTC
Fix included in openshift-ansible-3.11.74-1.

Comment 15 Martin André 2019-02-20 10:57:06 UTC
The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix.

https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible

Comment 18 errata-xmlrpc 2019-06-26 09:07:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605


Note You need to log in before you can comment on or make changes to this bug.