Bug 1717469

Summary: Introspection failed: Could not establish a connection to the Zaqar websocket
Product: Red Hat OpenStack Reporter: Filip Hubík <fhubik>
Component: python-tripleoclientAssignee: RHOS Maint <rhos-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 14.0 (Rocky)CC: apetrich, bdobreli, beth.white, ccamacho, dvd, hbrock, jslagle, jstransk, mburns, mschuppe, nchandek
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1719265 (view as bug list) Environment:
Last Closed: 2019-07-17 10:58:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1719354    
Bug Blocks: 1719265    
Attachments:
Description Flags
stackrc_osp14_zaqarfail
none
instackenv.json_osp14_zaqarfail none

Description Filip Hubík 2019-06-05 14:30:33 UTC
Description of problem:

OSP14 deployment steps fail on introspection:

$ source ~/stackrc
$ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json
STDERR:
Could not establish a connection to the Zaqar websocket. The command was sent but the answer could not be read.
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)

Version-Release number of selected component (if applicable):
OSP14, puddle 2019-05-15.1, 31.1

How reproducible:
Always

Steps to Reproduce:
1. Deploy OSP14 using InfraRed
2. Introspection fails

Additional info:
It could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1702918 issues - but I see we have newer python-tripleoclient-10.6.2-0.20190425150604.ba03c5e.el7ost.noarch present so I assume this is something new, likely part of new respin of containers.

Comment 1 Adriano Petrich 2019-06-05 16:18:39 UTC
That seems odd.

Could you verify that there's a OS_CACERT set in the stackrc?

something like
export OS_CACERT="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem"

Do other actions work or is it just the "overcloud node import" that fails?

Comment 2 Adriano Petrich 2019-06-05 16:21:28 UTC
Also could you try restarting the mistral and zaqar processes to cover the low hanging fruit?

Comment 3 Filip Hubík 2019-06-06 10:12:16 UTC
Thanks to jschlueter's tip, I was able to workaround this specific issue downgrading python-websocket-client-0.32.0-116.el7, though nodes are still not able to reach manageable state so far. Upgrading back to python-websocket-client-0.56.0-1.git3c25814.el7 reproduces the error again.

As for OS_CACERT, it is not set, see attached stackrc file. Restart of any mistral or zaqar container doesn't have any effect.

As for other commands, not sure what is meant here, but ironic service seems to be responding to "openstack baremetal node xyz" commands in both cases.

Comment 4 Filip Hubík 2019-06-06 10:12:49 UTC
Created attachment 1577836 [details]
stackrc_osp14_zaqarfail

Comment 5 Filip Hubík 2019-06-06 10:18:34 UTC
Created attachment 1577837 [details]
instackenv.json_osp14_zaqarfail

Comment 6 Adriano Petrich 2019-06-06 10:44:44 UTC
Thank you for the files and comment.

Comment 7 Adriano Petrich 2019-06-06 10:56:23 UTC
Just to make sense. RHEL 7.6 updated python-websocketclient that is not compatible with our deployment as it stands. 

https://bugzilla.redhat.com/show_bug.cgi?id=1702715#c12 has a workaround for the issue until the problem is fixed. 

so I'm closing this bug as a duplicate of the main bug.

*** This bug has been marked as a duplicate of bug 1702715 ***

Comment 8 Filip Hubík 2019-06-06 10:56:43 UTC
Note, when I try to:

stack@uc $ export OS_CACERT="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem"
stack@uc $ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json                                                                                                                  
Failed to discover available identity versions when contacting https://192.168.24.2:13000/. Attempting to parse version from URL.
Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. SSL exception connecting to https://192.168.24.2:13000/: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:618)

Comment 9 Adriano Petrich 2019-06-06 10:58:34 UTC
what about export WEBSOCKET_CLIENT_CA_BUNDLE="/etc/pki/ca-trust/source/anchors/cm-local-ca.pem" ? Could you try that please?

Comment 10 Adriano Petrich 2019-06-06 10:59:17 UTC
setting a more appropriate duplicate bug

*** This bug has been marked as a duplicate of bug 1714205 ***

Comment 11 Adriano Petrich 2019-06-06 11:06:57 UTC
So the workaround works but you have to use the rpm cert.

(undercloud) [stack@undercloud-0 ~]$ export WEBSOCKET_CLIENT_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem 
(undercloud) [stack@undercloud-0 ~]$ openstack overcloud node import --instance-boot-option=local /home/stack/instackenv.json
Waiting for messages on queue 'tripleo' with no timeout.

Comment 12 Filip Hubík 2019-06-07 09:56:33 UTC
Yes, the "export WEBSOCKET_CLIENT_CA_BUNDLE" (appended to ~/stackrc file) trick workarounds the issue and deployment passes (IR w/a merged https://review.gerrithub.io/c/redhat-openstack/infrared/+/457097), however it looks like we are still missing related changes:

1) in case this is python-websocket's fault - this BZ should be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1702715
2) in case this is tripleo's fault - I assume we should have https://review.opendev.org/#/c/633024 included (Rocky), but so far I don't see these changes on undercloud node directly, maybe I am looking in wrong place?

@Adriano afaik this can not be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1714205 since that one is targeted against OSP13, for OSP14 fixes and package tracking we need separate BZ (this one) afaik.

Considering there is none development in 1) (https://bugzilla.redhat.com/show_bug.cgi?id=1702715) and it is closed_errata + we have significant development around 2) I assume this should be reopened and track required fixes specific to OSP14 only.

Comment 20 Carlos Camacho 2019-06-12 11:53:21 UTC
Hi folks, the final issue with this BZ is tracked in [1], the culprit of this issue was/is with the cert bundle checking, in the new version is dropped the default certs path.

The workaround is to use WEBSOCKET_CLIENT_CA_BUNDLE=/etc/pki/tls/certs/ca-bundle.crt with the correct path.

There is an async deliver of this fix for several OSP versions tracked same therein [1].


[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1719354