Bug 1717469 - Introspection failed: Could not establish a connection to the Zaqar websocket
Summary: Introspection failed: Could not establish a connection to the Zaqar websocket
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On: 1719354
Blocks: 1719265
TreeView+ depends on / blocked
 
Reported: 2019-06-05 14:30 UTC by Filip Hubík
Modified: 2019-07-17 10:58 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1719265 (view as bug list)
Environment:
Last Closed: 2019-07-17 10:58:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stackrc_osp14_zaqarfail (811 bytes, text/plain)
2019-06-06 10:12 UTC, Filip Hubík
no flags Details
instackenv.json_osp14_zaqarfail (1.00 KB, text/plain)
2019-06-06 10:18 UTC, Filip Hubík
no flags Details

Description Filip Hubík 2019-06-05 14:30:33 UTC
Description of problem:

OSP14 deployment steps fail on introspection:

$ source ~/stackrc
$ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json
STDERR:
Could not establish a connection to the Zaqar websocket. The command was sent but the answer could not be read.
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)

Version-Release number of selected component (if applicable):
OSP14, puddle 2019-05-15.1, 31.1

How reproducible:
Always

Steps to Reproduce:
1. Deploy OSP14 using InfraRed
2. Introspection fails

Additional info:
It could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1702918 issues - but I see we have newer python-tripleoclient-10.6.2-0.20190425150604.ba03c5e.el7ost.noarch present so I assume this is something new, likely part of new respin of containers.

Comment 1 Adriano Petrich 2019-06-05 16:18:39 UTC
That seems odd.

Could you verify that there's a OS_CACERT set in the stackrc?

something like
export OS_CACERT="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem"

Do other actions work or is it just the "overcloud node import" that fails?

Comment 2 Adriano Petrich 2019-06-05 16:21:28 UTC
Also could you try restarting the mistral and zaqar processes to cover the low hanging fruit?

Comment 3 Filip Hubík 2019-06-06 10:12:16 UTC
Thanks to jschlueter's tip, I was able to workaround this specific issue downgrading python-websocket-client-0.32.0-116.el7, though nodes are still not able to reach manageable state so far. Upgrading back to python-websocket-client-0.56.0-1.git3c25814.el7 reproduces the error again.

As for OS_CACERT, it is not set, see attached stackrc file. Restart of any mistral or zaqar container doesn't have any effect.

As for other commands, not sure what is meant here, but ironic service seems to be responding to "openstack baremetal node xyz" commands in both cases.

Comment 4 Filip Hubík 2019-06-06 10:12:49 UTC
Created attachment 1577836 [details]
stackrc_osp14_zaqarfail

Comment 5 Filip Hubík 2019-06-06 10:18:34 UTC
Created attachment 1577837 [details]
instackenv.json_osp14_zaqarfail

Comment 6 Adriano Petrich 2019-06-06 10:44:44 UTC
Thank you for the files and comment.

Comment 7 Adriano Petrich 2019-06-06 10:56:23 UTC
Just to make sense. RHEL 7.6 updated python-websocketclient that is not compatible with our deployment as it stands. 

https://bugzilla.redhat.com/show_bug.cgi?id=1702715#c12 has a workaround for the issue until the problem is fixed. 

so I'm closing this bug as a duplicate of the main bug.

*** This bug has been marked as a duplicate of bug 1702715 ***

Comment 8 Filip Hubík 2019-06-06 10:56:43 UTC
Note, when I try to:

stack@uc $ export OS_CACERT="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem"
stack@uc $ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json                                                                                                                  
Failed to discover available identity versions when contacting https://192.168.24.2:13000/. Attempting to parse version from URL.
Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. SSL exception connecting to https://192.168.24.2:13000/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)

Comment 9 Adriano Petrich 2019-06-06 10:58:34 UTC
what about export WEBSOCKET_CLIENT_CA_BUNDLE="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem" ? Could you try that please?

Comment 10 Adriano Petrich 2019-06-06 10:59:17 UTC
setting a more appropriate duplicate bug

*** This bug has been marked as a duplicate of bug 1714205 ***

Comment 11 Adriano Petrich 2019-06-06 11:06:57 UTC
So the workaround works but you have to use the rpm cert.

(undercloud) [stack@undercloud-0 ~]$ export WEBSOCKET_CLIENT_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem 
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json
Waiting for messages on queue 'tripleo' with no timeout.

Comment 12 Filip Hubík 2019-06-07 09:56:33 UTC
Yes, the "export WEBSOCKET_CLIENT_CA_BUNDLE" (appended to ~/stackrc file) trick workarounds the issue and deployment passes (IR w/a merged https://review.gerrithub.io/c/redhat-openstack/infrared/+/457097), however it looks like we are still missing related changes:

1) in case this is python-websocket's fault - this BZ should be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1702715
2) in case this is tripleo's fault - I assume we should have https://review.opendev.org/#/c/633024 included (Rocky), but so far I don't see these changes on undercloud node directly, maybe I am looking in wrong place?

@Adriano afaik this can not be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1714205 since that one is targeted against OSP13, for OSP14 fixes and package tracking we need separate BZ (this one) afaik.

Considering there is none development in 1) (https://bugzilla.redhat.com/show_bug.cgi?id=1702715) and it is closed_errata + we have significant development around 2) I assume this should be reopened and track required fixes specific to OSP14 only.

Comment 20 Carlos Camacho 2019-06-12 11:53:21 UTC
Hi folks, the final issue with this BZ is tracked in [1], the culprit of this issue was/is with the cert bundle checking, in the new version is dropped the default certs path.

The workaround is to use WEBSOCKET_CLIENT_CA_BUNDLE=/etc/pki/tls/certs/ca-bundle.crt with the correct path.

There is an async deliver of this fix for several OSP versions tracked same therein [1].


[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1719354


Note You need to log in before you can comment on or make changes to this bug.