Bug 1979730 - Windows VM ends up with ghost NIC and missing secondary disks machine type changes from pc-q35-rhel8.3.0 to pc-q35-rhel8.4.0
Summary: Windows VM ends up with ghost NIC and missing secondary disks machine type ch...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Lucia Jelinkova
QA Contact: Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-06 20:17 UTC by Abhishekh Patil
Modified: 2022-12-07 15:26 UTC (History)
12 users (show)

Fixed In Version: ovirt-engine-4.4.9-1
Doc Type: Bug Fix
Doc Text:
Previously, when upgrading a cluster from cluster level 4.5 to 4.6, the emulated machine changed to a newer one. This caused problems on some Windows virtual machines, such as - loss of static IP configuration or secondary disks going offline. In this release, the Webadmin shows a confirmation dialog during the cluster upgrade from cluster level 4.5 or lower to cluster level 4.6 or higher if there are any virtual machines that could be affected.
Clone Of:
Environment:
Last Closed: 2021-11-16 14:46:53 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6179481 0 None None None 2021-07-12 14:04:01 UTC
Red Hat Product Errata RHSA-2021:4626 0 None None None 2021-11-16 14:47:14 UTC
oVirt gerrit 116764 0 master MERGED engine: Add warning on cluster upgrade 2021-10-03 09:39:59 UTC
oVirt gerrit 116919 0 ovirt-engine-4.4 MERGED engine: Add warning on cluster upgrade 2021-10-04 08:42:37 UTC

Description Abhishekh Patil 2021-07-06 20:17:33 UTC
Description of problem:

Windows VM ends up with ghost NIC and missing secondary disks after upgrading cluster compatibility to 4.6 followed by VM restart. The vNIC goes offline and secondary disks are missing.

Version-Release number of selected component (if applicable):


How reproducible:
ovirt-engine-4.4.6

Steps to Reproduce:
1. Prepare RHV-4.4 environment, but do not update cluster compatibility to 4.6 yet
2. Install a Windows VM with minimum 2 disks and one vNIC.
3. Update cluster compatibility to 4.6 and restart the VM

Actual results:
VM ends up with ghost NIC and missing secondary disks.

Expected results:
VM should continue to see the vNIC and all disks even after updating cluster compatibility to 4.6

Additional info:

This can be resolved by re-configuring vNIC and re-attaching the secondary disks.

Comment 23 Arik 2021-07-29 14:40:00 UTC
Further investigation revealed that:
1. When importing Windows 2016/2019 VMs to either cluster level 4.5 or 4.6 the secondary disks are set to offline
2. When changing the emulated machine of the VM from pc-q35-rhel8.3.0 to pc-q35-rhel8.4.0, an additional NIC is discovered within the guest and the secondary disks go offline

We relied on the fix to bz 1939546 to fix this but #1 happens also with the latest virtio-win version that is supposed to include this fix (1.9.17-3).

We assume that after importing the VMs the user set static IP on the NIC and set the secondary disks online - which was expected
And the problem is that the network settings are lost and the disks go offline again when the emulated machine changes since a new NIC device that doesn't hold the previous settings is identified and the secondary disks are changed back to offline

AFAIK, that didn't happen before when changing the machine type so I'll clone this to qemu in order to investigate why this happens now

Comment 25 Arik 2021-09-14 13:59:58 UTC
Summing up an offline discussion on this bz:
As the platform team are unable to prevent this issue we need to do something on the oVirt/RHV side.
The solution suggested by platform to set custom emulated machine for the VM is a complex and risky change on our side.
So we agreed on providing a warning in the update cluster dialog when transitioning from machine type < RHEL 8.4 to machine type >= RHEL 8.4 with an explanation on this issue that may happen and what users can do to address it (fix their guests after the VM is upgraded to the new cluster-level, set custom cluster level before the cluster-upgrade, or avoid upgrading the cluster-level).

Comment 26 Marina Kalinin 2021-09-15 03:09:14 UTC
Hi Arik,
Some follow up questions please:

1. What would be the actions required from the user? And once we have those, we should document it in KCS (or official documentation?).
2. This warning would be given when? After we change the CL to 4.6 or after upgrading the manager and before changing the CL or? does it matter at all?
3. To what VMs would this apply? All the VMs that had machine type < pc-q35-rhel8.4.0?
4. When is the machine type set? When the VM is created first or when the VM is started? i.e. would the machine type change if the VM is shut down and is now started on a RHEL 8.4 host?


p.s. QEMU machine types in RHV:
https://access.redhat.com/articles/3229221

Comment 27 Arik 2021-09-19 17:17:47 UTC
Hi Marina,

> 1. What would be the actions required from the user? And once we have those,
> we should document it in KCS (or official documentation?).

It really depends on what the user will choose, the user has several options here:
1. To proceed with the upgrade of the cluster version and "fix" the guests afterwards (restore the static network configuration, online secondary disks)
2. To avoid upgrading the cluster level, i.e., keeping the previous compatibility version of the cluster
3. To set the VMs that there's a better chance that would be affected by this with custom compatibility level

We already documented it in the release notes and the upgrade guide when introducing cluster-level 4.6 (https://bugzilla.redhat.com/show_bug.cgi?id=1940232#c26) - back then we didn't know about the issue with secondary disks though so we can update that text though.

> 2. This warning would be given when? After we change the CL to 4.6 or after
> upgrading the manager and before changing the CL or? does it matter at all?

The latter - after upgrdeing the manager and before changing the cluster level so users will have a chance to choose one of the aforementioned actions

> 3. To what VMs would this apply? All the VMs that had machine type <
> pc-q35-rhel8.4.0?

That's still under review, I like the patch that was posted by Lucia with two warnings:
1. Due to the emulated machine change, some Windows virtual machines might lose static IP configuration. This would need to be resolved manually. The following virtual machines might be affected:
2. Due to the emulated machine change, some Windows virtual machines might start with offline secondary disks. This would need to be resolved manually. The following virtual machines might be affected:
When their machine type changes from < 8.4 to >=8.4

> 4. When is the machine type set? When the VM is created first or when the VM
> is started? i.e. would the machine type change if the VM is shut down and is
> now started on a RHEL 8.4 host?

The machine type is set when the VM is created

Comment 28 Germano Veit Michel 2021-09-21 02:38:54 UTC
(In reply to Arik from comment #27)
> Hi Marina,
> 
> > 1. What would be the actions required from the user? And once we have those,
> > we should document it in KCS (or official documentation?).
> 
> It really depends on what the user will choose, the user has several options
> here:
> 1. To proceed with the upgrade of the cluster version and "fix" the guests
> afterwards (restore the static network configuration, online secondary disks)
> 2. To avoid upgrading the cluster level, i.e., keeping the previous
> compatibility version of the cluster
> 3. To set the VMs that there's a better chance that would be affected by
> this with custom compatibility level

Could this option (3) above be part of the Cluster CL level upgrade dialog? A button to make the engine automatically set custom CL/machine for the affected VMs with values equal to those used on the current CL level.
Would be handy for large deployments.

Comment 29 Arik 2021-09-29 12:22:24 UTC
(In reply to Germano Veit Michel from comment #28)
> Could this option (3) above be part of the Cluster CL level upgrade dialog?
> A button to make the engine automatically set custom CL/machine for the
> affected VMs with values equal to those used on the current CL level.
> Would be handy for large deployments.

I agree and it was discussed but we didn't reach an agreement on this

Comment 33 Marina Kalinin 2021-09-30 21:01:15 UTC
Arik, Michal: I modified this KCS to reflect all the decisions. FYI: https://access.redhat.com/solutions/6179481

Comment 35 Arik 2021-10-04 08:46:07 UTC
(In reply to Marina Kalinin from comment #33)
> Arik, Michal: I modified this KCS to reflect all the decisions. FYI:
> https://access.redhat.com/solutions/6179481

Looks good, thanks

Comment 37 Qin Yuan 2021-11-01 12:59:25 UTC
Tested with:
ovirt-engine-4.4.9.2-0.6.el8ev.noarch

Steps:
1. Create a cluster with Compatibility Version 4.5
2. Create VMs according to the following matrix:
OS type   Disks number   Nics number   Custom Compatibility Version   Custom Emulated Machine
OtherOS   0              0             -                              -
OtherOS   0              1             -                              -
OtherOS   1              0             -                              -
OtherOS   1              1             -                              -
OtherOS   2              0             -                              -
OtherOS   2              1             -                              -
OtherOS   2              1             4.5                            -
OtherOS   2              1             -                              pc-q35-rhel8.3.0
Windows   0              0             -                              -
Windows   0              1             -                              -
Windows   1              0             -                              -
Windows   1              1             -                              -
Windows   2              0             -                              -
Windows   2              1             -                              -
Windows   2              1             4.5                            -
Windows   2              1             -                              pc-q35-rhel8.3.0   
RHEL 8.x  2              1             -                              -

3. Update the cluster compatibility version to 4.6


Results:
1. There is a confirmation dialog saying:
"
Due to the emulated machine change, some Windows virtual machines might start with offline secondary disks. This would need to be resolved manually. The following virtual machines might be affected:

    OtherOS_2Disks_0Nic
    OtherOS_2Disks_1Nic
    Windows_2Disks_0Nic
    Windows_2Disks_1Nic

Due to the emulated machine change, some Windows virtual machines might lose static IP configuration. This would need to be resolved manually. The following virtual machines might be affected:

    Windows_0Disk_1Nic [Windows 2012 x64]
    OtherOS_2Disks_0Nic [Other OS]
    OtherOS_1Disk_0Nic [Other OS]
    OtherOS_1Disk_1Nic [Other OS]
    Windows_0Disk_0Nic [Windows 10 x64]
    OtherOS_0Disk_0Nic [Other OS]
    Windows_1Disk_0Nic [Windows 2019 x64]
    Windows_1Disk_1Nic [Windows 2019 x64]
    OtherOS_0Disk_1Nic [Other OS]
    OtherOS_2Disks_1Nic [Other OS]
    Windows_2Disks_0Nic [Windows 2019 x64]
    Windows_2Disks_1Nic [Windows 2019 x64]

Are you sure you want to continue?
"

2. According to the info in the confirmation dialog:
 
- All OtherOS and Windows VMs with 2 disks, except the ones with Custom Compatibility Version or Custom Emulated Machine configured, are listed in "offline secondary disks" warning section.
- All OtherOS and Windows VMs with less than 2 disks, are not listed in "offline secondary disks" warning section.
- All OtherOS and Windows VMs, except the ones with Custom Compatibility Version or Custom Emulated Machine configured, are listed in "lose static IP configuration" warning section.
- The RHEL VM isn't listed in any warning section.

The VMs are listed in the warning message as expected, except that it's better not to list OtherOS/Windows VM without any Nic in the "lose static IP configuration" warning section.

Lucia, what do you think? Do we need to filter out OtherOS/Windows VM without any Nic?

Comment 38 Arik 2021-11-01 15:43:50 UTC
Yeah, we could have filtered only VMs with at least one NIC that is set with a NIC-profile
But what's the consequence of that? we'll display also VMs with no NIC or with no NIC that has a non-empty NIC profile
After all, we say those VMs "might lose static IP configuration" - considering that it's unlikely to have only such VMs, these VMs will typically be shown along with other VMs that this warning may apply for. I don't think we should block 4.4.9 on that. That's something we can improve on the master branch though.

Comment 39 Qin Yuan 2021-11-02 00:58:22 UTC
(In reply to Arik from comment #38)
> Yeah, we could have filtered only VMs with at least one NIC that is set with
> a NIC-profile
> But what's the consequence of that? we'll display also VMs with no NIC or
> with no NIC that has a non-empty NIC profile
> After all, we say those VMs "might lose static IP configuration" -
> considering that it's unlikely to have only such VMs, these VMs will
> typically be shown along with other VMs that this warning may apply for. I
> don't think we should block 4.4.9 on that. That's something we can improve
> on the master branch though.

Ok, this makes sense, move this bug to VERIFIED.

Comment 43 errata-xmlrpc 2021-11-16 14:46:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) security update [ovirt-4.4.9]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4626


Note You need to log in before you can comment on or make changes to this bug.