Bug 1963164

Summary: [update] 16.1 to 16.2 update fail on ping loss as vms are stopped when libvirtd restart.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.2 (Train)CC: apevec, bdobreli, berrange, ccamposr, chhu, dasmith, eglynn, jfrancoa, jgrosso, jhakimra, jlibosva, jpretori, jsuchane, kchamart, laine, lbezdick, lhh, lmen, lyarwood, majopela, mburns, mprivozn, omcgonag, phrdina, sbauza, scohen, sgordon, shrjoshi, smooney, spower, vgrosu, virt-maint, vromanso, xuzhang
Target Milestone: rcKeywords: Triaged, UpgradeBlocker
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.5.1-2.20210603174819.el8ost.6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 07:15:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2021-05-21 15:40:54 UTC
Description of problem:

During an update from osp16.1 to osp16.2 we experience some ping loss.

This happened during (just before?) the compute update after controller and cephstorage get updated (update run).

Problem is the the we cannot see the lost ping in the ping output:

Result of ping loss is:

--- 10.0.0.213 ping statistics ---
1021 packets transmitted, 900 received, 11.8511% packet loss, time 1039194ms
rtt min/avg/max/mdev = 0.383/1.101/10.664/1.108 ms

But in the file we got a perfect icmp sequence from 1 to 900 with no error, so I'm unsure where the ping actually fails.

Here we are talking about 121 packets so around 2 minutes cut, even if I don't exactly know when.

The first entry in the ping log file is 2021-05-19 16:49:19 and the last one at 2021-05-19 17:04:35.

The Compute update started at 2021-05-19 16:49:22 and ended at 2021-05-19 17:06:38

Comment 6 Artom Lifshitz 2021-05-26 15:11:08 UTC
// Quick notes from looking at this:

Right after OVN releases the port, nova gets the vif unplugged event:

2021-05-19 17:04:38.490 7 DEBUG nova.compute.manager [req-86a0a2cf-262d-4fec-a692-e3fcdc26cd2a eceec1b17bba449ca7ce1f2c9d12fd6f 86fef77304cd43bbb7bfd69ae7d02034 - default default] [instance: d4d21533-a246-42d6-aa3c-88eac7091d63] Received event network-vif-unplugged-f1be45c9-4c8f-4546-8689-2a5d5a961f0a external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9310

Looking at what happened before that, we see:

Looks like nova-compute loses the connection to libvirt right before the port gets unplugged:

2021-05-19 17:04:31.414 7 INFO nova.virt.libvirt.driver [req-e3e24a5f-0ddb-46d2-b7a0-a034eb193133 - - - - -] Connection event '0' reason 'Connection to libvirt lost: 1'

nova-compute also decides that it's disabled:

2021-05-19 17:04:31.509 7 DEBUG nova.compute.manager [req-e3e24a5f-0ddb-46d2-b7a0-a034eb193133 - - - - -] Adding trait COMPUTE_STATUS_DISABLED to compute node resource provider af025232-d58e-4325-96bf-1e6870e272ad in placement. update_compute_provider_status /usr/lib/python3.6/site-packages/nova/compute/manager.py:514

Will download the full logs to look at the entire deployment.

Comment 7 Artom Lifshitz 2021-05-26 16:41:05 UTC
(In reply to Artom Lifshitz from comment #6)
> // Quick notes from looking at this:
> 
> Right after OVN releases the port, nova gets the vif unplugged event:
> 
> 2021-05-19 17:04:38.490 7 DEBUG nova.compute.manager
> [req-86a0a2cf-262d-4fec-a692-e3fcdc26cd2a eceec1b17bba449ca7ce1f2c9d12fd6f
> 86fef77304cd43bbb7bfd69ae7d02034 - default default] [instance:
> d4d21533-a246-42d6-aa3c-88eac7091d63] Received event
> network-vif-unplugged-f1be45c9-4c8f-4546-8689-2a5d5a961f0a
> external_instance_event
> /usr/lib/python3.6/site-packages/nova/compute/manager.py:9310
> 
> Looking at what happened before that, we see:
> 
> Looks like nova-compute loses the connection to libvirt right before the
> port gets unplugged:
> 
> 2021-05-19 17:04:31.414 7 INFO nova.virt.libvirt.driver
> [req-e3e24a5f-0ddb-46d2-b7a0-a034eb193133 - - - - -] Connection event '0'
> reason 'Connection to libvirt lost: 1'
> 
> nova-compute also decides that it's disabled:
> 
> 2021-05-19 17:04:31.509 7 DEBUG nova.compute.manager
> [req-e3e24a5f-0ddb-46d2-b7a0-a034eb193133 - - - - -] Adding trait
> COMPUTE_STATUS_DISABLED to compute node resource provider
> af025232-d58e-4325-96bf-1e6870e272ad in placement.
> update_compute_provider_status
> /usr/lib/python3.6/site-packages/nova/compute/manager.py:514

So that part is normal. We lose the connection to libvirt, and the libvirt driver handles that event by setting the node as down. Sean believes libvirt being restarted at [1] is causing it to remove the tap interface, because it's deleting the port first, then re-adding it, instead of using the `--may-exist` command line option. We'll get the libvirt folks to double-check this.

[1] 2021-05-19 17:04:36.043+0000: 195668: info : libvirt version: 7.0.0, package: 14.module+el8.4.0+10886+79296686 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2021-05-06-06:29:31, )

Comment 8 Artom Lifshitz 2021-05-26 17:08:40 UTC
Talking to the nice folks at #virt, we've agreed to move this BZ to libvirt.

tl;dr for anyone coming into this without context: libvirt is restarted as part of an OpenStack upgrade. When starting back up, it deletes and re-adds tap interfaces, causing networking connectivity loss for the affected instance(s). We believe this can be fixed by changing how libivrt calls the ovs command line, specifically by starting to use the `--may-exist` option.

Port / instance in question are f1be45c9-4c8f-4546-8689-2a5d5a961f0a / d4d21533-a246-42d6-aa3c-88eac7091d63 respectively. Interface XML looks like:

    <interface type="bridge">
      <mac address="fa:16:3e:b6:8f:34"/>
      <model type="virtio"/>
      <driver name="vhost" rx_queue_size="512"/>
      <source bridge="br-int"/>
      <mtu size="1442"/>
      <target dev="tapf1be45c9-4c"/>
      <virtualport type="openvswitch">
        <parameters interfaceid="f1be45c9-4c8f-4546-8689-2a5d5a961f0a"/>
      </virtualport>
    </interface>

Comment 11 Laine Stump 2021-05-26 19:06:06 UTC
> tl;dr for anyone coming into this without context: libvirt is
> restarted as part of an OpenStack upgrade.
> When starting back up, it deletes and re-adds tap interfaces,
> causing networking connectivity loss for the affected instance(s).

The confusing part of this - simply restarting libvirt would not cause any guest tap devices to be deleted and recreated. That would only happen if the guest itself was shutdown (or if the network interfaces of the guest were unplugged and then plugged back in).

Are you doing all the updating with the guest(s) remaining on the host that's being updated? Or are you migrating the guests off during the update? (I ask because migration would result in the qemu process being stopped and then restarted, which would lead to the tap device being destroyed and then later created)

Something that *does* happen when libvirt is restarted - if it determines that a tap device isn't attached to the bridge (or OVS switch) that it's supposed to be attached to, then it will detach it from whatever it's currently attached to, and then attach it to the correct bridge/switch. But if it's already attached to the correct bridge, then nothing should happen.

Comment 13 Jakub Libosvar 2021-05-27 09:15:00 UTC
(In reply to Artom Lifshitz from comment #8)
> Talking to the nice folks at #virt, we've agreed to move this BZ to libvirt.
> 
> tl;dr for anyone coming into this without context: libvirt is restarted as
> part of an OpenStack upgrade. When starting back up, it deletes and re-adds
> tap interfaces, 

Just a comment, not sure if it will help in any way but it seems the tap device was actually not added back as per the OVS logs, we see the tap only being deleted:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-updates-16.1-to-16.2-from-latest_cdn-HA-ipv4/83/compute-1/var/log/openvswitch/ovs-vswitchd.log.gz

Comment 15 Laine Stump 2021-06-07 21:21:01 UTC
(In reply to Artom Lifshitz from comment #8)
> We believe this can be fixed by changing how libivrt calls the
> ovs command line, specifically by starting to use the `--may-exist` option.

My only reservation about this is that re-using an existing OVS port could mean that libvirt is connecting the guest's tap device to a port that was originally created by "someone else" and is setup in a way that would hinder proper behavior (for example it may have some filtering attached to it or something).

Also, does deleting and recreating a port really take the port offline for 120 seconds? At most, I would imagine that it would prevent traffic for the length of any STP delay, but a 120 second STP delay seems extremely excessive.

From the other comments in this BZ, I gather that the guest isn't migrated elsewhere, and since you're concerned about continuity in network connection I can also assume the guest isn't being shutdown. I also assume that OpenStack isn't hot-unplugging the network interface. Based on this, I think we can confidently say that the tap device itself is never deleted and re-added. And since libvirt doesn't delete/add OVS ports unless it finds that a guest's tap device is currently connected to the wrong master (or to nothing at all) - otherwise it just leaves the connection alone.

So is there something (outside libvirt, because libvirt doesn't do it) during the upgrade that would cause the tap device to be disconnected from the OVS port?

I just sent a patch to upstream libvirt that switches from preemptively deleting any existing port to instead using --may-exist (as suggested in Comment 8):

  https://listman.redhat.com/archives/libvir-list/2021-June/msg00178.html

but I'm skeptical that is actually the source of the problem (unless someone/something else is already disconnecting the tap from the OVS port; but in that case, there is already a packet loss happening regardless of what libvirt does).

Comment 17 Laine Stump 2021-06-07 21:37:41 UTC
One other thing - if you're able to collect libvirt logs, we can see whether or not libvirt is disconnecting and reconnecting the tap devices during libvirtd restart. If you follow the directions here:

  https://libvirt.org/kbase/debuglogs.html

while setting log_filters="1:util.netdevtap" (and maybe some other stuff as suggested on that page, but this one is enough to see the tap devices reconnecting), then you should get a pair of messages like this if libvirtd needs to reconnect an existing tap device:

Removing vnet20 from br0
Attaching vnet20 to br1

(the first line will be missing if the tap wasn't connected to anything)

If you don't see those messages, then the whole "--if-exists del-port" vs "--may-exist" thing is a red herring.

Comment 19 Laine Stump 2021-06-10 18:58:00 UTC
I pushed the patch I mentioned in Comment 17 to upstream libvirt; it will be in libvirt-7.5.0. Since I'm not sure where the libvirt you're using for the version of OpenStack you're testing is coming from, I'm not sure how to get a build that you could use in your tests. However, before even bothering with that it would be useful to learn if this particular bit of code is even being used during the upgrade (and if so, why). The "if" part of that question could be learned from proper libvirt logs (I looked in the location you provided above and couldn't find any log messages from libvirt except those for the guests themselves, which isn't what I was looking for)

Comment 20 Laine Stump 2021-06-11 02:27:21 UTC
This is the commite upstream:

commit 2a51ff7b40ac7ed81d9244120716e1fd38371572
Author: Laine Stump <laine>
Date:   Mon Jun 7 16:27:49 2021 -0400

    openvswitch: don't delete existing OVS port prior to recreating same port

Comment 27 Sofer Athlan-Guyot 2021-06-23 10:29:36 UTC
Hi Laine,

so after investigating a bit more this is yet another issue[1]. We had to add a new parameter to the kernel and it has the unexpected side effect of rebooting the node.

I'm re-triggering the update without the reboot (which is the cause of the ping failure here).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1975240

Comment 28 Jaroslav Suchanek 2021-06-24 14:45:49 UTC
I am re-assigning this bz back based on comment 26 and comment 27 for further investigation.

Comment 30 Laine Stump 2021-06-25 06:04:02 UTC
> So we are using the new libvirt version, but then we have this issue:

>2021-06-24 17:47:44.832+0000: 396090: debug : virCommandRun:2489 : Result status 0, stdout: 'Usage: dnsmasq [options]

> Valid options are:
> -a, --listen-address=<ipaddr>                          Specify local address(es) to listen on.
> -A, --address=/<domain>/<ipaddr>                       Return ipaddr for all hosts in specified domains.
> -b, --bogus-priv                                       Fake reverse lookups for RFC1918 private address ranges.
> ...

This is *not* an issue/problem. It is normal behavior - libvirt is running "dnsmasq --help" and parsing the output to learn the version of dnsmasq, and whether or not it possesses certain features.

You say you aren't experiencing the ping loss anymore. Can you

1) collect the correct libvirt logs during the time of the upgrade to see if libvirt really is reconnecting the tap devices (by checking for presence of the "Removing ..." and "Attach ..." log messages for the tap device in question

2) check if the ping loss re-occurs if you downgrade to the libvirt without the patch?

If both of those are affirmative, then changing the method of attaching to the OVS bridge *does* fix the original problem, and we should move ahead with an official backport. If the ping loss doesn't re-occur with older libvirt, then it was caused by something else that's subsequently changed. If there are no "Removing ..." logs followed by "Attach ...", but the problem re-occurs when libvirt is downgraded, then either we're still looking at the wrong logs, or I'm completely baffled :-)

Comment 37 Sofer Athlan-Guyot 2021-07-09 14:04:14 UTC
Hi,

short summary of the debug session with Pavel this morning.

We identified the issue  as being linked to the fact that
in OSP16.1 we run libvirtd in a container and systemd-machined is not
working is there.

So for instance:

    podman exec -ti nova_libvirt bash
    ()[root@compute-0 /]# machinectl list
    Could not get machines: The name org.freedesktop.machine1 was not provided by any .service files

    ()[root@compute-0 /]# virsh list
     Id   Name                State
    -----------------------------------
     1    instance-00000067   running
    
meaning libvirtd created the vm without the help of systemd-machined.

when we upgrade to OSP16.2, we switch to rhel8.4 and there,
systemd-machined can work inside container (and is installed on the
host). So when libvirt container restart (updated from 6 to 7) it asks
systemd for the vm, receive empty and shutdown the vm.

Comment 38 Daniel Berrangé 2021-07-09 15:46:43 UTC
I did an interactive debugging session. Pre-upgrade we see

 * systemd-container and libvirt installed in nova_libvirt container
 * systemd-container and libvirt NOT installed in the host
 * machined cannot be used because systemd in the host cannot see the systemd-container in the nova_libvirt container
 * libvirt in nova_libvirt uses its generic cgroups backend, ignoring machined

After the upgrade we see

 * systemd-container and libvirt installed in nova_libvirt container
 * systemd-container and libvirt ALSO installed in the host
 * systemd in the host happily starts machined in the host
 * libvirt in nova_libvirt uses its machine backend because it connects to host dbus

IOW, the existence of systemd-container in the nova_libvirt container is irrelevant.  The problems arise from libvirt and systemd-container mistakenly getting added to the host during the upgrade, along with libguestfs.

So the cgroups behaviour problems aren't a libvirt bug, just a OSP deployment package list mistake somewhere

Comment 40 Pavel Hrdina 2021-07-09 16:38:32 UTC
(In reply to Daniel Berrangé from comment #38)
> I did an interactive debugging session. Pre-upgrade we see
> 
>  * systemd-container and libvirt installed in nova_libvirt container
>  * systemd-container and libvirt NOT installed in the host
>  * machined cannot be used because systemd in the host cannot see the
> systemd-container in the nova_libvirt container
>  * libvirt in nova_libvirt uses its generic cgroups backend, ignoring
> machined
> 
> After the upgrade we see
> 
>  * systemd-container and libvirt installed in nova_libvirt container
>  * systemd-container and libvirt ALSO installed in the host
>  * systemd in the host happily starts machined in the host
>  * libvirt in nova_libvirt uses its machine backend because it connects to
> host dbus
> 
> IOW, the existence of systemd-container in the nova_libvirt container is
> irrelevant.  The problems arise from libvirt and systemd-container
> mistakenly getting added to the host during the upgrade, along with
> libguestfs.
> 
> So the cgroups behaviour problems aren't a libvirt bug, just a OSP
> deployment package list mistake somewhere

Thanks for this analysis, I'm new to openstack so it was not clear to me
why machined suddenly works after the update and I was trying to figure out
what changed and how the libvirt container is started.

This sounds like the root cause of the issue reported here so agreed that
this is not libvirt bug.

Comment 50 Sofer Athlan-Guyot 2021-07-19 13:08:51 UTC
Hi,

@jfrancoa , I'm still not sold on the need to install all container packages. If your repository are correctly setup(and you have registred to the right stream), then podman should be updated during yum upgrade.

Why install all pkg from the container stream to get podman still eludes me, I think I'm going to open a separate bugzilla just to discuss this.

@smooney , so I've tested this :

        - name: Ensure systemd-container is not installed on the host.
          package:
            name: systemd-container
            state: absent

and the update is then successful.

Note that this is not the "classical" CI update from 16.1 to 16.2.

To reproduce what customer have we need:
 1. install 16.1;
 2. run update of 16.1 - this is the step we don't usually do in 16.1->16.2 update. So systemd-container is not installed usually on 16.1 during on CI test. Adding this stage add systemd-container but in rhel8.2 it doesn't have any impact (relative to 8.4)
 3. reboot everything;
 4. update to 16.2 - with the systemd-container removal.
 5. everything works fine even migration
 6. reboot.

Point 4. is working but we have a strange situation:
 - machinectl list work before the removal : machine have systemd controled cgroup;
 - we remove systemd-container, removing all related binary: fortunately the vm are still running under machined "control" with the in-memory binary, so no cut happen: I guess we're lucky.

I'm going to run a test with the command provided - and see if I can implement this workflow for 16.1->16.2 in the ci.

The overall good news is that the removal does work.

Comment 51 smooney 2021-07-19 13:32:52 UTC
- name: Ensure systemd-container is not installed on the host.
          package:
            name: systemd-container
            state: absent

is not really sufficent to close out this bz.
it will allow the update to proceed but from a comptue dfg perspective none of the virt mudules should be installed on the host in 16.
so by removing only the conflicting package you are still leaving this host in an untested state.
we do all our testing without the virt module present so please also remove the entire virt av module on the host and test that instead.


in terms of the contaienr packages im not sure that all of the are need but that is outside of my knolage area i would assuem just updating all of the install packages would be sufficent but
perhaps there is a reason ot install all of them for example to dprovide a new tool but in that case i think in general it would be better to install the explictly if for no other reason
then to document the requirement for the new dep.

Comment 61 errata-xmlrpc 2021-09-15 07:15:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483