Bug 1533582 - OCP 3.7 with CNS deployment fails with "Error: signature is invalid" during "heketi-cli cluster list"
Summary: OCP 3.7 with CNS deployment fails with "Error: signature is invalid" during "...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Jose A. Rivera
QA Contact: Wenkai Shi
URL:
Whiteboard:
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2018-01-11 16:42 UTC by Thom Carlin
Modified: 2019-06-28 16:04 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-09 15:30:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1537500 0 unspecified CLOSED [RFE] Improve "Signature is invalid" error message 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1545452 0 unspecified CLOSED [DOCS] [RFE] Add cautionary note about re-running CNS (gluster) playbooks for installation 2021-02-22 00:41:40 UTC

Internal Links: 1537500 1545452

Description Thom Carlin 2018-01-11 16:42:29 UTC
Description of problem:

During initial deployment of new OCP 3.7/CNS deployment, receive error while playbook is deploying heketi

Version-Release number of selected component (if applicable):

3.7

How reproducible:

100%

Steps to Reproduce:
1. Follow steps to deploy 3.7 with CNS.  I used VMs as target systems (RPM, non-containerized)
2. At step 2.6.5.1 https://access.redhat.com/documentation/en-us/openshift_container_platform/3.7/html/installation_and_configuration/installing-a-cluster#running-the-advanced-installation receive the error
3.

Actual results:

Failed install

 fatal: [<<hostname>>]: FAILED! => {"changed": false, "cmd": ["oc", "rsh", "--namespace=glusterfs", "deploy-heketi-storage-1-pw88h", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "<<string>>", "cluster", "list"], "delta": "0:00:01.506115", "end": "2018-01-11 08:54:32.386670", "failed": true, "msg": "non-zero return code", "rc": 255, "start": "2018-01-11 08:54: 30.880555", "stderr": "Error: signature is invalid\ncommand terminated with exit code 255", "stderr_lines": ["Error: signature is invalid", "command terminated with exit code 255"], "stdout": "", "stdout_lines": []}

Expected results:

Successful install

Additional info:

Reviewing deploy-heketi-storage pod logs now...More information as it develops

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Thom Carlin 2018-01-11 20:05:35 UTC
From logs:

[negroni] Started GET /queue/9934d276c640b5611ecf29471b8f8c6a
[negroni] Completed 303 See Other in 314.717µs
[negroni] Started GET /volumes/9dfe8ee6c6a523ed4d204a829b88f0fe
[negroni] Completed 200 OK in 2.264309ms
[negroni] Started GET /backup/db
[negroni] Completed 200 OK in 9.681242ms
[negroni] Started GET /clusters
[negroni] Completed 401 Unauthorized in 2.478413ms

Determine HEKETI_ADMIN_KEY:
oc describe dc deploy-heketi-storage | egrep HEKETI_ADMIN_KEY | cut -b27-

oc rsh deploy-heketi-storage-number-string
# heketi-cli --user admin --secret <<HEKETI_ADMIN_KEY>> cluster list 
[...Cluster list...]
# exit

 ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml -vvv >/tmp/playbook_output.log 2>/tmp/playbook_output.err
The 2 files will be uploaded as private attachments

Comment 2 Thom Carlin 2018-01-11 20:17:41 UTC
Also uploading /etc/ansible/hosts (inventory) file

Comment 4 Thom Carlin 2018-01-11 20:55:20 UTC
Note: "Signature is invalid" does not refer to the image signature

Instead, it refers to the JSON Web Token (JWT) https://jwt.io/introduction/

Comment 7 Thom Carlin 2018-01-11 22:06:21 UTC
There appears to be a difference between:
* dc/deploy-heketi-storage (HEKETI_ADMIN_KEY)
* key in /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/glusterfs_common.yml.  The decoded value is null since the oc secret heketi-storage-admin-secret was not found

Comment 8 Jose A. Rivera 2018-01-19 23:26:06 UTC
Sorry for the delay, I've been traveling.

Are you still seeing the problem? Does it persist if you retry the installer? Note that we currently do not support running the GlusterFS playbook more than once in a given environment, even if it fails. You'll have to either reset the entire environment, manually remove the GlusterFS installation, or try your luck with the openshift_storage_glusterfs_wipe option (also not officially supported). You could also abandon the openshift-ansible route and go to using cns-deploy, documented in the CNS documentation.

Also, you have a typo: openshift_storage_glsuterfs_storageclass=true should have "glusterfs", not glsuterfs".

Comment 9 Thom Carlin 2018-01-22 14:32:20 UTC
Thank you for spotting the typo -- it has been corrected.

After retrying installer, it failed at the same place.

Please explain "Note that we currently do not support running the GlusterFS playbook more than once in a given environment, even if it fails. You'll have to either reset the entire environment, manually remove the GlusterFS installation, or try your luck with the openshift_storage_glusterfs_wipe option (also not officially supported). ":
1) How does an end-user determine if the GlusterFS playbook has been run already?  
2) Could we add a fact to determine so as a fail-safe?
3) How can an end-user reset the entire environment?
4) How can an end-user manually remove the GlusterFS installation?
If openshift_storage_glusterfs_wipe is not supported, we don't need those instructions
5) To abandon the openshift-ansible route, would the steps be to remove all "glusterfs" related commands in /etc/ansible/hosts and follow the steps in the CNS doc (as was done in previous docs)?

From this, I see the following next steps:
A) Improving the "Error: signature is invalid" message
B) The fact to determine if the GlusterFS playbook has run should be added
C) A doc bz that incorporating the above information be created (once the information from 3 to 5 are supplied)

Comment 10 Jose A. Rivera 2018-01-22 15:30:45 UTC
(In reply to Thom Carlin from comment #9)
> Please explain "Note that we currently do not support running the GlusterFS
> playbook more than once in a given environment, even if it fails. You'll
> have to either reset the entire environment, manually remove the GlusterFS
> installation, or try your luck with the openshift_storage_glusterfs_wipe
> option (also not officially supported). ":
>
> 1) How does an end-user determine if the GlusterFS playbook has been run
> already?  

By remembering if they did. There's no way to discover if they did.

> 2) Could we add a fact to determine so as a fail-safe?

Not desirable, there's already work to make it more idempotent so this would be a bit of a step back.

> 3) How can an end-user reset the entire environment?
> 4) How can an end-user manually remove the GlusterFS installation?
> If openshift_storage_glusterfs_wipe is not supported, we don't need those
> instructions
> 5) To abandon the openshift-ansible route, would the steps be to remove all
> "glusterfs" related commands in /etc/ansible/hosts and follow the steps in
> the CNS doc (as was done in previous docs)?

These all seem to be the same question, so: yes. You can also just delete the project containing CNS to delete all its related pods and other resources if you don't care about preserving data integrity.

> From this, I see the following next steps:
> A) Improving the "Error: signature is invalid" message
> B) The fact to determine if the GlusterFS playbook has run should be added
> C) A doc bz that incorporating the above information be created (once the
> information from 3 to 5 are supplied)

And yes, this all belongs in a separate BZ. :)

Comment 11 Thom Carlin 2018-01-23 11:51:19 UTC
Sample steps to reset/remove:
1) oc delete <<project_name>>
2) Repeat "oc get project"/"oc get namespace" until the project disappears
3) On each storage node:
   A) vgremove <<vg_name_for_cns>>
   B) pvremove <<pv_name_for_cns>>

Please note these steps have not been fully vetted and GSS should be consulted before this procedure is followed.  Data loss and system downtime are potential side-effects of errors/issues.

I removed the CNS environment and reran the installer.  Currently, I see issues with heketi-cli topology load:
"[...]
Creating node <<node3_fqdn>> ... Unable to create node: Unable to execute command on glusterfs-storage-gtkll: peer probe: failed: <<node3_IP_address>> is either already part of another cluster or having volumes configured"

Since these errors have now drifted off the original error, I'm OK with closing this bz.

Comment 12 Thom Carlin 2018-01-23 12:07:40 UTC
Note: https://bugzilla.redhat.com/show_bug.cgi?id=1533582#c9 step C) has been submitted as a future KCS article


Note You need to log in before you can comment on or make changes to this bug.