Bug 1891014 - [Doc text] With TLS Everywhere live migration fails for existing instances due to missing ca-cert.pem
Summary: [Doc text] With TLS Everywhere live migration fails for existing instances du...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z13
: 13.0 (Queens)
Assignee: RHOS Documentation Team
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On: 1888951 1893113
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-23 14:30 UTC by Martin Schuppert
Modified: 2021-01-04 14:10 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
There is currently a known issue with TLS Everywhere environments when live migrating instances during a minor update. With the introduction of support for full QEMU-native TLS encryption when live migrating (BZ1754791), instance live migration is failing when performing a minor update on a RHOSP deployment that has running instances. This is because the certificates for the TLS NBD block migration, that do not already exist in the libvirtd container, are created during the update. The certificates are merged into the container directory tree during creation of the libvirt container, instead of being directly bind mounted from the host. Therefore, the QEMU processes of the instances that need migrated during the update do not get the new certificate automatically and the NBD setup process fails with the following error: libvirtError: internal error: unable to execute QEMU command 'object-add': Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory Live migration works for instances created after the update. Workaround: You can use one of the following options to workaround this issue: * Stop and start the instances that fail to live migrate after the update is complete, so that new QEMU processes get created by libvirt container that has the certificate details. * Add the following configuration to the overcloud to disable TLS transport encryption for NBD, and deploy the overcloud: parameter_defaults: UseTLSTransportForNbd: False
Clone Of: 1888951
Environment:
Last Closed: 2021-01-04 14:10:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Martin Schuppert 2020-10-23 14:30:52 UTC
+++ This bug was initially created as a clone of Bug #1888951 +++

Description of problem:
With TLS Everywhere and Instance HA , /var/lib/nova/instanceha/check-run-nova-compute strace due to missing ca-cert.pem.   Unless it's a misconfiguration in the customer templates, we might be missing something like this [1].

++ cat /run_command
+ CMD='/var/lib/nova/instanceha/check-run-nova-compute '
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/nova ]]
+++ stat -c %a /var/log/kolla/nova
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/nova
++ . /usr/local/bin/kolla_nova_extend_start
+++ [[ ! -d /var/lib/nova/instances ]]
+ echo 'Running command: '\''/var/lib/nova/instanceha/check-run-nova-compute '\'''
+ exec /var/lib/nova/instanceha/check-run-nova-compute
Running command: '/var/lib/nova/instanceha/check-run-nova-compute '
Checking 439 migrations
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/utils.py", line 906, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7139, in _live_migration_operation
    LOG.error("Live Migration failure: %s", e, instance=instance)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7133, in _live_migration_operation
    bandwidth=CONF.libvirt.live_migration_bandwidth)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 674, in migrate
    destination, params=params, flags=flags)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1779, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: internal error: unable to execute QEMU command 'object-add': Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory

[1] https://review.opendev.org/758572


Version-Release number of selected component (if applicable):
Latest

How reproducible:
This environment

Steps to Reproduce:
1. Do a minor update and try to live-migrate VMs to the newly updated computes.
2.
3.

Actual results:
nova_compute won't come up to a healthy state

Expected results:
It should come up to a healthy state

Additional info:

--- Additional comment from Martin Schuppert on 2020-10-22 08:29:21 UTC ---

We have reproduced the issue and it is now understood. The generation and configuration of the certificates is correct,
which we can confirm as migration of new created instances after the update is working ok.

The issue is that the certificates for the tls nbd block migration get created during the update. They did not exist
in the libvirtd container when the existing instances were created. During libvirt container create the certificates
get merged into the container directory tree using the kolla_config mechanism. They  are not a direct bind mount from
the host. Therefor the qemu processes of the existing instances don't have that information and the nbd setup process
fails with the seen error, which we can also confirm when strace a qemu process of an instance created before the update
during a live migrate:

116406 stat("/etc/pki/qemu/ca-cert.pem", 0x7fff6f4ec390) = -1 ENOENT (No such file or directory)
116406 sendmsg(25, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="{\"id\": \"libvirt-2611\", \"error\": {\"class\": \"GenericError\", \"desc\": \"Unable to access credentials /etc/pki/qemu/ca-cert.pem: No such file or directory\"}}\r\n", iov_len=153}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 153

The immediate solution is to run an overcloud deploy and specify not to use TLS transport for nbd, which configures
the same configuration as before the minor update:

parameter_defaults:
    UseTLSTransportForNbd: False

For a transition to use UseTLSTransportForNbd: true, we need enhance the THT to support the following transition path:
1) create the required nbd certificates also with "UseTLSTransportForNbd: False", or use bind mounts for the
   cert directories instead of merging them into the directory tree on container create. This would also have
   the benefit that there is no action required when the nbd certs change.
2) all instances need to be migrated once that qemu process runs with an environment which has all the certificate information
3) enable "UseTLSTransportForNbd: True" for the overcloud deployment

After that all instances have the required information to do live migration with "UseTLSTransportForNbd: True".

--- Additional comment from Martin Schuppert on 2020-10-22 10:49:44 UTC ---

For completeness:

* the above mentioned transition procedure, which involves to migrate all instances to get a fresh
  qemu process which has the required certificates in its tree, is not required if the instances
  can be switched off and on again.

* with 'UseTLSTransportForNbd: False' which sets live_migration_with_native_tls back to false,
  libvirtd still uses tls encryption. It is the block migration stream which is not encrypted,
  like in previous releases.

--- Additional comment from Martin Schuppert on 2020-10-23 07:33:42 UTC ---

Comment 5 Irina 2021-01-04 14:10:15 UTC
Known issue release note included in 13z13 release notes, available on the Customer Portal:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/release_notes/index#known_issues_10


Note You need to log in before you can comment on or make changes to this bug.