Bug 1945760 - [OSP16.2] TLS-e live volume back migration failure, with error 'blockdev-add Cannot Read from TLS channel'
Summary: [OSP16.2] TLS-e live volume back migration failure, with error 'blockdev-add ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: David Vallee Delisle
QA Contact: James Parker
URL:
Whiteboard:
Depends On: 1965124
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-01 20:29 UTC by James Parker
Modified: 2021-09-15 07:14 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.5.1-2.20210430004816.cbef0f2.el8ost puppet-nova-15.7.1-2.20210423004733.43cd2b4.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 07:13:42 UTC
Target Upstream Version: Train
Embargoed:


Attachments (Terms of Use)
Source Migration libvirtd logs with debug level enabled (771.42 KB, text/plain)
2021-04-08 00:00 UTC, James Parker
no flags Details
Destination target node's libvirtd logs with debug level enabled (623.18 KB, text/plain)
2021-04-08 00:00 UTC, James Parker
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 787249 0 None MERGED Introducing default_tls_verify 2021-04-26 08:46:54 UTC
OpenStack gerrit 787979 0 None MERGED Missing client certificate for live-migration with TLS 2021-04-29 08:48:29 UTC
OpenStack gerrit 788242 0 None MERGED [train-only] QemuDefaultTLSVerify should be false 2021-04-30 06:58:21 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:14:03 UTC

Comment 3 James Parker 2021-04-08 00:00:03 UTC
Created attachment 1770070 [details]
Source Migration libvirtd logs with debug level enabled

Comment 4 James Parker 2021-04-08 00:00:38 UTC
Created attachment 1770071 [details]
Destination target node's libvirtd logs with debug level enabled

Comment 6 Kashyap Chamarthy 2021-04-08 12:37:53 UTC
Something is off in the TLS setup.  Can you please test with the following?

Both on your compute nodes you are missing the below config attributes 
in your /etc/libvirt/qemu.conf:

    default_tls_x509_cert_dir = "/etc/pki/qemu"
    default_tls_x509_verify = 1

(See: https://docs.openstack.org/nova/latest/admin/secure-live-migration-with-qemu-native-tls.html)

And this is what you have on your source and destination 'nova_libvirt'
containers:

    [root@compute-1 /]#  egrep -v '^$|^#' /etc/libvirt/qemu.conf
    max_files = 32768
    max_processes = 131072
    vnc_tls = 1
    vnc_tls_x509_verify = 1
    nbd_tls = 1
    migration_port_min = 61152
    migration_port_max = 61215

Note the missing default_tls_* attributes.

                - - -

Also, running `virt-pki-validate` fails on both source and destination
'nova_libvirt' containers:

    [root@compute-0 qemu]# virt-pki-validate
    Found /usr/bin/certtool
    Found CA certificate /etc/pki/CA/cacert.pem for Certificate Authority
    The CA certificate and the client certificate do not match
    CA organization: Certificate Authority
    Client organization: REDHAT.LOCAL
    Found client certificate /etc/pki/libvirt/clientcert.pem for compute-0.ctlplane.redhat.local
    Found client private key /etc/pki/libvirt/private/clientkey.pem
    The client private key need to be read by client tools
    as root do: chmod 644 /etc/pki/libvirt/private/clientkey.pem
    The CA certificate and the server certificate do not match
    CA organization: Certificate Authority
    Server organization: REDHAT.LOCAL
    The server certificate does not seem to match the host name
    hostname: "compute-0.redhat.local"
    Server certificate CN: "compute-0.ctlplane.redhat.local"
    Found server certificate /etc/pki/libvirt/servercert.pem for compute-0.ctlplane.redhat.local
    Found server private key /etc/pki/libvirt/private/serverkey.pem
    Make sure /etc/sysconfig/libvirtd is setup to listen to
    TCP/IP connections and restart the libvirtd service
    Make sure /etc/sysconfig/iptables is setup to allow
    incoming TCP/IP connections on port 16514 and
    restart the iptables service
    [root@compute-0 qemu]#

And similar the other compute.

Comment 7 Kashyap Chamarthy 2021-04-08 12:42:12 UTC
@David: Please see my comment#6 — seems like TripleO is not setting the required config attribute: 'default_tls_x509_cert_dir'?

Comment 9 Kashyap Chamarthy 2021-04-08 14:34:19 UTC
Dan, any thoughts on this 'blockdev-add' failure that seems to be coming
from QEMU's I/O channels TLS driver?

Context: The OpenStack test that is failing here is live-migrating an
instance with a disk attached to it in a non-shared storage, and TLS
(misconfigured?).  So NBD is involved here.

And here's how 'blockdev-add' is erroring out:
-----------------------------------------------------------------------
...
2021-04-07 23:43:51.427+0000: 23837: debug : qemuMonitorJSONCheckErrorFull:404 : unable to execute QEMU command {"execute":"blockdev-add","arguments":{"driver":"nbd","server":{"type":"inet","host":"compute-0.ctlplane.redhat.local","port":"61153"},"export":"drive-virtio-disk0","tls-creds":"objlibvirt_migrate_tls0","node-name":"migration-vda-storage","read-only":false,"discard":"unmap"},"id":"libvirt-388"}: {"id":"libvirt-388","error":{"class":"GenericError","desc":"Failed to read option reply: Cannot read from TLS channel: Software caused connection abort"}}

2021-04-07 23:43:51.427+0000: 23837: error : qemuMonitorJSONCheckErrorFull:418 : internal error: unable to execute QEMU command 'blockdev-add': Failed to read option reply: Cannot read from TLS channel: Software caused connection abort
...
-----------------------------------------------------------------------

I see the "check TLS authorization" part from the QEMU I/O test
233.out matches the above signature:

    https://git.qemu.org/gitweb.cgi?p=qemu.git;a=blob;f=tests/qemu-iotests/233.out#l61

Basing on that, I'm deducing that TLS setup here in this QE env is
broken.  And the `virt-pki-validate` output in comment#6 seems to
indicate that too.

            - - -

Meanwhile, here are the config settings in qemu.conf on both source and
destination compute nodes:

    $> egrep -v '^$|^#' /etc/libvirt/qemu.conf
    max_files = 32768
    max_processes = 131072
    vnc_tls = 1
    vnc_tls_x509_verify = 1
    nbd_tls = 1
    migration_port_min = 61152
    migration_port_max = 61215
    # these two were added in a subsequent test; but even without these
    # the migration fails the same
    default_tls_x509_cert_dir = "/etc/pki/qemu"
    default_tls_x509_verify = 1


And libvirtd.conf from both source and destination:

    $> egrep -v '^$|^#' /etc/libvirt/libvirtd.conf
    listen_tls=1
    listen_tcp=0
    listen_addr="192.168.24.37"
    unix_sock_group="libvirt"
    unix_sock_ro_perms="0777"
    unix_sock_rw_perms="0770"
    auth_unix_ro="none"
    auth_unix_rw="none"
    auth_tls="sasl"
    tls_priority="NORMAL:-VERS-SSL3.0:-VERS-TLS-ALL:+VERS-TLS1.2"

Comment 10 Daniel Berrangé 2021-04-08 14:49:59 UTC
Note virt-pki-validate is validating *libvirt's* TLS setup.

The problem here is with *QEMU's* TLS setup - this is files in /etc/pki/qemu.

There is only ca-cert.pem and server-cert.pem in /etc/pki/qemu

The QMP command shown here is a NBD client connection failing.

I can't see how the NBD server is configured, but if the server is attempting todo client certificate validtion, then it will fail, because there's no client-cert.pem for the client to send. 

This is a plausible reason why you'd see this error message on the client, as the server will ungracefully drop the connection after the TLS handshake when it finds no client cert was sent
.

Comment 27 errata-xmlrpc 2021-09-15 07:13:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.