Description of problem: With TLS Everywhere and Instance HA , /var/lib/nova/instanceha/check-run-nova-compute strace due to missing ca-cert.pem. Unless it's a misconfiguration in the customer templates, we might be missing something like this [1]. ++ cat /run_command + CMD='/var/lib/nova/instanceha/check-run-nova-compute ' + ARGS= + [[ ! -n '' ]] + . kolla_extend_start ++ [[ ! -d /var/log/kolla/nova ]] +++ stat -c %a /var/log/kolla/nova ++ [[ 2755 != \7\5\5 ]] ++ chmod 755 /var/log/kolla/nova ++ . /usr/local/bin/kolla_nova_extend_start +++ [[ ! -d /var/lib/nova/instances ]] + echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\''' + exec /var/lib/nova/instanceha/check-run-nova-compute Running command: '/var/lib/nova/instanceha/check-run-nova-compute ' Checking 439 migrations Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers timer() File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__ cb(*args, **kw) File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 168, in _do_send waiter.switch(result) File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main result = function(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/utils.py", line 906, in context_wrapper return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7139, in _live_migration_operation LOG.error("Live Migration failure: %s", e, instance=instance) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7133, in _live_migration_operation bandwidth=CONF.libvirt.live_migration_bandwidth) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 674, in migrate destination, params=params, flags=flags) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute six.reraise(c, e, tb) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1779, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: internal error: unable to execute QEMU command 'object-add': Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory [1] https://review.opendev.org/758572 Version-Release number of selected component (if applicable): Latest How reproducible: This environment Steps to Reproduce: 1. Do a minor update and try to live-migrate VMs to the newly updated computes. 2. 3. Actual results: nova_compute won't come up to a healthy state Expected results: It should come up to a healthy state Additional info:
Oct 16 11:34:58 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:34:58.124+0000: 560626: info : libvirt version: 4.5.0, package: 33.el7_8.1 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-03-23-10:21:01, x86 -vm-28.build.eng.bos.redhat.com) Oct 16 11:34:58 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:34:58.124+0000: 560626: info : hostname: overcloud-compute-0.localdomain Oct 16 11:34:58 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:34:58.124+0000: 560626: debug : virLogParseOutputs:1760 : outputs=3:stderr Oct 16 11:34:58 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:34:58.124+0000: 560626: debug : virLogParseOutput:1588 : output=3:stderr Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.624+0000: 560645: info : libvirt version: 4.5.0, package: 33.el7_8.1 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2020-03-23-10:21:01, x86 -vm-28.build.eng.bos.redhat.com) Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.624+0000: 560645: info : hostname: overcloud-compute-0.localdomain Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.624+0000: 560645: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command 'object-del': object 'objlibvirt_migrate_t ls0' not found Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.626+0000: 560645: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command 'object-add': Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.867+0000: 560645: error : virNetClientProgramDispatchError:174 : migration successfully aborted Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.873+0000: 560645: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command 'object-del': object 'objlibvirt_migrate_t ls0' not found Oct 16 11:36:46 overcloud-compute-0.localdomain dockerd-current[433488]: 2020-10-16 11:36:46.874+0000: 560645: error : qemuMonitorJSONCheckError:396 : internal error: unable to execute QEMU command 'object-del': object 'libvirt_migrate-secr et0' not found
Adding Martin Schuppert -- author of original patch upstream for his comment. Martin, Does Dave's patch make sense here?
(In reply to Ade Lee from comment #9) > Adding Martin Schuppert -- author of original patch upstream for his > comment. > > Martin, > > Does Dave's patch make sense here? I am not convinced it is the fix for the issue as libvirt container is the one talking to the qemu process running the instance where we should have the cert [1]. We hopefully have a reproducer env soon to investigate. Cheers, Martin https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/nova/nova-libvirt-container-puppet.yaml#L781
We have reproduced the issue and it is now understood. The generation and configuration of the certificates is correct, which we can confirm as migration of new created instances after the update is working ok. The issue is that the certificates for the tls nbd block migration get created during the update. They did not exist in the libvirtd container when the existing instances were created. During libvirt container create the certificates get merged into the container directory tree using the kolla_config mechanism. They are not a direct bind mount from the host. Therefor the qemu processes of the existing instances don't have that information and the nbd setup process fails with the seen error, which we can also confirm when strace a qemu process of an instance created before the update during a live migrate: 116406 stat("/etc/pki/qemu/ca-cert.pem", 0x7fff6f4ec390) = -1 ENOENT (No such file or directory) 116406 sendmsg(25, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="{\"id\": \"libvirt-2611\", \"error\": {\"class\": \"GenericError\", \"desc\": \"Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory\"}}\r\n", iov_len=153}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 153 The immediate solution is to run an overcloud deploy and specify not to use TLS transport for nbd, which configures the same configuration as before the minor update: parameter_defaults: UseTLSTransportForNbd: False For a transition to use UseTLSTransportForNbd: true, we need enhance the THT to support the following transition path: 1) create the required nbd certificates also with "UseTLSTransportForNbd: False", or use bind mounts for the cert directories instead of merging them into the directory tree on container create. This would also have the benefit that there is no action required when the nbd certs change. 2) all instances need to be migrated once that qemu process runs with an environment which has all the certificate information 3) enable "UseTLSTransportForNbd: True" for the overcloud deployment After that all instances have the required information to do live migration with "UseTLSTransportForNbd: True".
For completeness: * the above mentioned transition procedure, which involves to migrate all instances to get a fresh qemu process which has the required certificates in its tree, is not required if the instances can be switched off and on again. * with 'UseTLSTransportForNbd: False' which sets live_migration_with_native_tls back to false, libvirtd still uses tls encryption. It is the block migration stream which is not encrypted, like in previous releases.
*** Bug 1883745 has been marked as a duplicate of this bug. ***
Patches are in progress, until they land and we get it in a release, we have changed the default for UseTLSTransportForNbd to False with https://bugzilla.redhat.com/show_bug.cgi?id=1894892
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0932