Bug 1645656

Summary:	Director deployed OCP 3.11: replacing an Infra node fails during TASK [openshift_storage_glusterfs : Verify heketi service]
Product:	OpenShift Container Platform	Reporter:	Marius Cornea <mcornea>
Component:	Installer	Assignee:	Martin André <m.andre>
Installer sub component:	openshift-ansible	QA Contact:	Johnny Liu <jialiu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	dbecker, jtrowbri, ltomasbo, m.andre, mburns, morazi
Version:	3.11.0	Keywords:	ZStream
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift-ansible-3.11.74-1.git.0.cde4c69.el7	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-26 09:07:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2018-11-02 18:43:17 UTC

Description of problem:
Director deployed OCP 3.11: replacing an Infra node fails during TASK [openshift_storage_glusterfs : Verify heketi service]:

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
[0;31mfatal: [openshift-master-2]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-b18knu/admin.kubeconfig", "rsh", "--namespace=default", "heketi-registry-1-mjsg9", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "cluster", "list"], "delta": "0:00:00.434645", "end": "2018-11-02 07:08:09.487683", "msg": "non-zero return code", "rc": 255, "start": "2018-11-02 07:08:09.053038", "stderr": "Error: Invalid JWT token: signature is invalid (client and server secrets may not match)\ncommand terminated with exit code 255", "stderr_lines": ["Error: Invalid JWT token: signature is invalid (client and server secrets may not match)", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}[0m

PLAY RECAP *********************************************************************
[0;32mlocalhost[0m                  : [0;32mok=36  [0m changed=0    unreachable=0    failed=0   
[0;33mopenshift-infra-1[0m          : [0;32mok=41  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-2[0m          : [0;32mok=41  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;33mopenshift-infra-3[0m          : [0;32mok=219 [0m [0;33mchanged=74  [0m unreachable=0    failed=0   
[0;33mopenshift-master-0[0m         : [0;32mok=81  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;33mopenshift-master-1[0m         : [0;32mok=81  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;31mopenshift-master-2[0m         : [0;32mok=138 [0m [0;33mchanged=7   [0m unreachable=0    [0;31mfailed=1   [0m
[0;33mopenshift-worker-0[0m         : [0;32mok=41  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-1[0m         : [0;32mok=41  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   
[0;33mopenshift-worker-2[0m         : [0;32mok=42  [0m [0;33mchanged=6   [0m unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
[0;32mInitialization              : Complete (0:01:55)[0m
[0;32mNode Bootstrap Preparation  : Complete (0:03:58)[0m
[0;32mNode Join                   : Complete (0:00:17)[0m


Failure summary:


  1. Hosts:    openshift-master-2
     Play:     Reload glusterfs topology
     Task:     Verify heketi service
     Message:  [0;31mnon-zero return code[0m

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060867.ffbe879.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP overcloud with 3 masters + 3 infra nodes + 3 worker nodes with CNS enabled
2. Remove openshift-infra-0 node
3. Re-run overclud deploy to add openshift-infra-3 node

Actual results:
Fails during during TASK [openshift_storage_glusterfs : Verify heketi service]

Expected results:
No failure.

Additional info:
Attaching /var/lib/mistral

Comment 2 Marius Cornea 2018-11-02 18:44:53 UTC

Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as a workaround on the env.

Comment 3 Martin André 2018-11-15 08:58:38 UTC

(In reply to Marius Cornea from comment #2)
> Note: https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7 was applied as
> a workaround on the env.

For the infra node scale up, you would have to set the openshift-ansible openshift_storage_glusterfs_registry_heketi_admin_key variable instead.

Now, I'm less sure about how to retrieve the heketi secret... According to the code, I would expect the secret to be named heketi-registry-admin-secret but all I can see in my environment is a heketi-storage-admin-secret secret. Possibly, the storage and registry share the same secret?

Anyway, here is how to retrieve the heketi-storage-admin-secret secret:

sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d

Comment 4 Martin André 2018-11-15 15:12:59 UTC

It indeed succeeds scaling out infra node if I set `openshift_storage_glusterfs_registry_heketi_admin_key` to the output of:

  sudo oc get secret heketi-storage-admin-secret --namespace glusterfs -o json | jq -r .data.key | base64 -d

Comment 5 Martin André 2018-11-19 13:49:34 UTC

Submitted a partial fix in openshift-ansible: https://github.com/openshift/openshift-ansible/pull/10710

However, there is an issue with the name of the registry heketi secret (https://github.com/openshift/openshift-ansible/issues/10712) so the above patch does not completely fix the issue.

Comment 6 John Trowbridge 2018-11-19 15:12:29 UTC

Removing blocker flag because we have a workaround. https://bugzilla.redhat.com/show_bug.cgi?id=1640382#c7

Comment 7 John Trowbridge 2018-11-19 15:15:19 UTC

Workaround for this is actually in comment 4: https://bugzilla.redhat.com/show_bug.cgi?id=1645656#c4

Comment 9 Martin André 2019-01-24 18:51:47 UTC

Proposed a fix to openshift-ansible: https://github.com/openshift/openshift-ansible/pull/11072

Comment 10 Martin André 2019-01-25 06:52:57 UTC

Fix included in openshift-ansible-3.11.74-1.

Comment 13 Martin André 2019-02-20 10:56:24 UTC

The ose-ansible container image was updated to v3.11.82-5 on the registry and should have the fix.

https://access.redhat.com/containers/?tab=tags#/registry.access.redhat.com/openshift3/ose-ansible

Comment 15 Marius Cornea 2019-04-10 15:45:28 UTC

TASK [openshift_storage_glusterfs : Verify heketi service] *********************
[0;32mok: [openshift-master-2][0m

TASK [openshift_storage_glusterfs : Wait for GlusterFS pods] *******************
[1;30mFAILED - RETRYING: Wait for GlusterFS pods (30 retries left).[0m
[1;30mFAILED - RETRYING: Wait for GlusterFS pods (29 retries left).[0m
[1;30mFAILED - RETRYING: Wait for GlusterFS pods (28 retries left).[0m
[1;30mFAILED - RETRYING: Wait for GlusterFS pods (27 retries left).[0m
[1;30mFAILED - RETRYING: Wait for GlusterFS pods (26 retries left).[0m
[0;32mok: [openshift-master-2][0m

Comment 17 errata-xmlrpc 2019-06-26 09:07:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605