Bug 1605172 - [downstream clone - 4.2.5] VM was destroyed on destination after successful migration due to missing the 'device' key on the lease device
Summary: [downstream clone - 4.2.5] VM was destroyed on destination after successful m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.2.5
: ---
Assignee: Francesco Romani
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On: 1590063
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-20 11:37 UTC by RHV bug bot
Modified: 2021-09-09 15:08 UTC (History)
12 users (show)

Fixed In Version: vdsm v4.20.35
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1590063
Environment:
Last Closed: 2018-07-31 17:50:12 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43557 0 None None None 2021-09-09 15:08:42 UTC
Red Hat Knowledge Base (Solution) 3523081 0 None None None 2018-07-20 11:39:28 UTC
Red Hat Product Errata RHEA-2018:2319 0 None None None 2018-07-31 17:50:26 UTC
oVirt gerrit 92160 0 master MERGED vm: lease: on hotplug, ensure the 'device' key 2018-07-20 11:39:28 UTC
oVirt gerrit 92197 0 ovirt-4.2 MERGED vm: lease: on hotplug, ensure the 'device' key 2018-07-20 11:39:28 UTC
oVirt gerrit 93046 0 master MERGED vm: lease: fix params on migration destination 2018-07-20 11:39:28 UTC
oVirt gerrit 93141 0 ovirt-4.2 MERGED vm: lease: fix params on migration destination 2018-07-20 11:39:28 UTC

Description RHV bug bot 2018-07-20 11:37:39 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1590063 +++
======================================================================

Description of problem:

A migration completed fine, the VM ran on destination and shutdown on the source.
But right after resuming on destination, VDSM hit an exception in and destroyed the VM. 
So the VM was down on both source and destination, causing an outage.

On destination:

1. VM runs after completing the migration:
2018-06-10 15:27:50,811+0200 INFO  (libvirt/events) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU running: onResume (vm:5119)

2. VDSM hits an exception on update_device_info() and kills it
2018-06-10 15:27:50,821+0200 ERROR (vm/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') The vm start process failed (vm:631)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 581, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 3284, in _completeIncomingMigration
    self._domDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1850, in _domDependentInit
    self._vmDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1866, in _vmDependentInit
    self._getUnderlyingVmDevicesInfo()
  File "/usr/share/vdsm/virt/vm.py", line 1799, in _getUnderlyingVmDevicesInfo
    vmdevices.common.update_device_info(self, self._devices)
  File "/usr/share/vdsm/virt/vmdevices/common.py", line 87, in update_device_info
    core.Console.update_device_info(vm, devices[hwclass.CONSOLE])
  File "/usr/share/vdsm/virt/vmdevices/core.py", line 249, in update_device_info
    if dev['device'] == hwclass.CONSOLE and \
KeyError: 'device'

2018-06-10 15:27:50,828+0200 INFO  (vm/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Changed state to Down: 'device' (code=1) (vm:1259)

2018-06-10 15:27:50,838+0200 INFO  (jsonrpc/2) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') _destroyVmGraceful attempt #0 (vm:4334)

After parsing the device list received by vdsm from the source host, I could only find 1 device missing the 'device' key:

{"path": "/dev/7688a6ed-3742-4cee-82d3-233775112670/xleases", "sd_id": "7688a6ed-3742-4cee-82d3-233775112670", "type": "lease", "lease_id": "50cab403-bf5d-422f-9fac-d3ca3cc57a66", "offset": 3145728}

This is a lease device, and indeed its missing the 'device' key.

I can partially reproduce this behavior in 4.2.3:
1) If I start the VM with the lease device already configured, VDSM reports the 'device' key for the device.
2) If I hot-plug the lease device, it is missing the 'device' key, see:

            {
                "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
                "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
                "type": "lease", 
                "lease_id": "3f3267ab-b2df-47c0-9b98-c8fd45ba485b", 
                "offset": 6291456
            }

So there is a difference in the devices list if the list was hot-plugged or not. But I can't make the migration fail on latest VDSM, not sure what could have fixed it.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.9.2-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64

How reproducible:
Partially

Steps to Reproduce:
1. Configure lease with the VM up

Actual results:
'device' key is missing (?)

Expected results:
'device' key is present (? - not sure?)

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-07-20 11:37:51 UTC
Arik, can you take a look? Is it something we've fixed?

(Originally by Yaniv Kaul)

Comment 4 RHV bug bot 2018-07-20 11:37:55 UTC
(In reply to Germano Veit Michel from comment #0)
> Description of problem:
> 
> A migration completed fine, the VM ran on destination and shutdown on the
> source.
> But right after resuming on destination, VDSM hit an exception in and
> destroyed the VM. 
> So the VM was down on both source and destination, causing an outage.
> 
> On destination:
> 
> 1. VM runs after completing the migration:
> 2018-06-10 15:27:50,811+0200 INFO  (libvirt/events) [virt.vm]
> (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU running: onResume (vm:5119)
> 
> 2. VDSM hits an exception on update_device_info() and kills it
> 2018-06-10 15:27:50,821+0200 ERROR (vm/50cab403) [virt.vm]
> (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') The vm start process failed
> (vm:631)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/vm.py", line 581, in _startUnderlyingVm
>     self._completeIncomingMigration()
>   File "/usr/share/vdsm/virt/vm.py", line 3284, in _completeIncomingMigration
>     self._domDependentInit()
>   File "/usr/share/vdsm/virt/vm.py", line 1850, in _domDependentInit
>     self._vmDependentInit()
>   File "/usr/share/vdsm/virt/vm.py", line 1866, in _vmDependentInit
>     self._getUnderlyingVmDevicesInfo()
>   File "/usr/share/vdsm/virt/vm.py", line 1799, in
> _getUnderlyingVmDevicesInfo
>     vmdevices.common.update_device_info(self, self._devices)
>   File "/usr/share/vdsm/virt/vmdevices/common.py", line 87, in
> update_device_info
>     core.Console.update_device_info(vm, devices[hwclass.CONSOLE])
>   File "/usr/share/vdsm/virt/vmdevices/core.py", line 249, in
> update_device_info
>     if dev['device'] == hwclass.CONSOLE and \
> KeyError: 'device'
[...]
> So there is a difference in the devices list if the list was hot-plugged or
> not. But I can't make the migration fail on latest VDSM, not sure what could
> have fixed it.

A large factor is most likely the engine XML change, which happened between 4.1 and 4.2.

The most likely clause is indeed the 'device' key being missing, we need to learn if indeed it happens only on hotplug lease or if there are other cases.

(Originally by Francesco Romani)

Comment 5 RHV bug bot 2018-07-20 11:37:59 UTC
the code doesn't really look changed in 4.1. I suspect no one was really using hotplug lease in 4.1, it was a very late feature targeted for a very specific narrow usecase, originally without hotplugging. It's quite possible it slipped through testing.

Germano, since we do not plan any further 4.1.z, would it make sense to try to reproduce this in 4.2?

(Originally by michal.skrivanek)

Comment 6 RHV bug bot 2018-07-20 11:38:04 UTC
(In reply to Michal Skrivanek from comment #4)
> the code doesn't really look changed in 4.1. I suspect no one was really
> using hotplug lease in 4.1, it was a very late feature targeted for a very
> specific narrow usecase, originally without hotplugging. It's quite possible
> it slipped through testing.
> 
> Germano, since we do not plan any further 4.1.z, would it make sense to try
> to reproduce this in 4.2?

Hi Michal,

Sorry if I was not clear enough. I did reproduce this partially in 4.2.3. The 'device' key is missing on the lease device even in 4.2.3. But the VM migrates fine, it doesn't hit that exception.

(Originally by Germano Veit Michel)

Comment 7 RHV bug bot 2018-07-20 11:38:07 UTC
(In reply to Germano Veit Michel from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > the code doesn't really look changed in 4.1. I suspect no one was really
> > using hotplug lease in 4.1, it was a very late feature targeted for a very
> > specific narrow usecase, originally without hotplugging. It's quite possible
> > it slipped through testing.
> > 
> > Germano, since we do not plan any further 4.1.z, would it make sense to try
> > to reproduce this in 4.2?
> 
> Hi Michal,
> 
> Sorry if I was not clear enough. I did reproduce this partially in 4.2.3.
> The 'device' key is missing on the lease device even in 4.2.3. But the VM
> migrates fine, it doesn't hit that exception.

ah, sorry, missed that. Great, can you attach that separately as well?
Did you have serial console enabled? Can you try with/without it? It seems that’s what happened in VW, it’s possible console is the only place where it checks the device key

Also, what about the target release then, since we do not plan 4.1.z is a 4.2 fix enough?


Separately, Arik, please take a look at why rerun was not triggered

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2018-07-20 11:38:11 UTC
(In reply to Germano Veit Michel from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > the code doesn't really look changed in 4.1. I suspect no one was really
> > using hotplug lease in 4.1, it was a very late feature targeted for a very
> > specific narrow usecase, originally without hotplugging. It's quite possible
> > it slipped through testing.
> > 
> > Germano, since we do not plan any further 4.1.z, would it make sense to try
> > to reproduce this in 4.2?
> 
> Hi Michal,
> 
> Sorry if I was not clear enough. I did reproduce this partially in 4.2.3.
> The 'device' key is missing on the lease device even in 4.2.3. But the VM
> migrates fine, it doesn't hit that exception.

From my initial investigation, it should work like that:

In 4.1:
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, this causing this bug

In 4.2, using Engine XML
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, but on migration flow we use the domain XML as authoritative, thus we call the very same code as in the VM start flow, thus sidestepping this bug

In 4.2, using vm.conf (aka 4.1 compatibility mode)
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, and on migration flow we use the vm params as authoritative, thus we hit the very same bug.

We call the console code anyway: we are indeed looking in the VM device parameters (which we don't know upfront) to check if we have a console device to update. I don't think we can skip this check.

(Originally by Francesco Romani)

Comment 9 RHV bug bot 2018-07-20 11:38:16 UTC
(In reply to Francesco Romani from comment #7)
> We call the console code anyway: we are indeed looking in the VM device
> parameters (which we don't know upfront) to check if we have a console
> device to update. I don't think we can skip this check.

We indeed call it anyway, but we actually do logic only if we have a console device in the domain XML. So, it could work without configured console devices. It needs to be tested.

(Originally by Francesco Romani)

Comment 10 RHV bug bot 2018-07-20 11:38:21 UTC
(In reply to Francesco Romani from comment #8)
> (In reply to Francesco Romani from comment #7)
> > We call the console code anyway: we are indeed looking in the VM device
> > parameters (which we don't know upfront) to check if we have a console
> > device to update. I don't think we can skip this check.

we can definitely not raise in that case because you know that when that device key is missing it's not a console device:)

(Originally by michal.skrivanek)

Comment 11 RHV bug bot 2018-07-20 11:38:24 UTC
(no reason for private comments)

(Originally by michal.skrivanek)

Comment 12 RHV bug bot 2018-07-20 11:38:28 UTC
> Separately, Arik, please take a look at why rerun was not triggered

Germano, can you please provide the engine log?

(Originally by Arik Hadas)

Comment 14 RHV bug bot 2018-07-20 11:38:37 UTC
> Separately, Arik, please take a look at why rerun was not triggered

Steffen, thanks.

In order to initiate another attempt to migrate the VM, the VM must be running on the source host. However, in this case, the VM died on the source host:
2018-06-10 15:27:52,550+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS 'b6732abc-9733-411b-bc6c-723315370d79'.

(Originally by Arik Hadas)

Comment 15 RHV bug bot 2018-07-20 11:38:41 UTC
And it gets even more interesting....

Test on engine 4.2.3 and vdsm 4.20.27:

1. Create VM without Lease but with Virtio Serial Console device
2. Start VM
3. Hot-plug a Lease
4. Migrate

Result: migration succeeds and device key "appears" on destination.

Pre-migration:
{
  "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
  "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
  "type": "lease", 
  "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
  "offset": 3145728
}

Post-migration:
{
  "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
  "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
  "offset": "3145728", 
  "device": "lease", 
  "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
  "type": "lease"
}

(Originally by Germano Veit Michel)

Comment 16 RHV bug bot 2018-07-20 11:38:46 UTC
(In reply to Germano Veit Michel from comment #14)
> And it gets even more interesting....
> 
> Test on engine 4.2.3 and vdsm 4.20.27:
> 
> 1. Create VM without Lease but with Virtio Serial Console device
> 2. Start VM
> 3. Hot-plug a Lease
> 4. Migrate
> 
> Result: migration succeeds and device key "appears" on destination.
> 
> Pre-migration:
> {
>   "path":
> "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-
> 687498984df0/dom_md/xleases", 
>   "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
>   "type": "lease", 
>   "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
>   "offset": 3145728
> }
> 
> Post-migration:
> {
>   "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
>   "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
>   "offset": "3145728", 
>   "device": "lease", 
>   "path":
> "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-
> 687498984df0/dom_md/xleases", 
>   "type": "lease"
> }

This is expected and a (beneficial) side effect of the behaviour documented in
https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7

(Originally by Francesco Romani)

Comment 17 RHV bug bot 2018-07-20 11:38:50 UTC
(In reply to Francesco Romani from comment #15)
> This is expected and a (beneficial) side effect of the behaviour documented
> in
> https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7
Ahh, this:

(In reply to Francesco Romani from comment #7)
> In 4.2, using Engine XML
> - lease devices created at VM start looks OK
> - lease devices hotplugged lack the 'device' key, but on migration flow we
> use the domain XML as authoritative, thus we call the very same code as in
> the VM start flow, thus sidestepping this bug
OK, now I finally understood why I cant hit this on 4.2.

Do you need any more testing from me?

(Originally by Germano Veit Michel)

Comment 18 RHV bug bot 2018-07-20 11:38:55 UTC
(In reply to Germano Veit Michel from comment #16)
> (In reply to Francesco Romani from comment #15)
> > This is expected and a (beneficial) side effect of the behaviour documented
> > in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7
> Ahh, this:
> 
> (In reply to Francesco Romani from comment #7)
> > In 4.2, using Engine XML
> > - lease devices created at VM start looks OK
> > - lease devices hotplugged lack the 'device' key, but on migration flow we
> > use the domain XML as authoritative, thus we call the very same code as in
> > the VM start flow, thus sidestepping this bug
> OK, now I finally understood why I cant hit this on 4.2.
> 
> Do you need any more testing from me?

I think we're fine now, thanks. I posted a fix candidate for 4.2 - we need to support 4.1 compatibility mode in 4.2.z, so we will need a fix here.

(Originally by Francesco Romani)

Comment 19 RHV bug bot 2018-07-20 11:39:00 UTC
Steps to reproduce:

1. Set up a 4.1 cluster, but install Vdsm 4.20.z on it.
2. Create VM without Lease but with Virtio Serial Console device
3. Start VM
4. Hot-plug a Lease
5. Migrate

Without this fix we will have the behaviour documented in this bug
With this fix, the migration will work and the VM will run after the migration.

(Originally by Francesco Romani)

Comment 20 RHV bug bot 2018-07-20 11:39:05 UTC
This is an internal bug caused by an unexpected internal interaction, no doc_text

(Originally by Francesco Romani)

Comment 21 RHV bug bot 2018-07-20 11:39:09 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 22 RHV bug bot 2018-07-20 11:39:14 UTC
Please clone properly.

(Originally by Dusan Fodor)

Comment 24 Francesco Romani 2018-07-20 11:58:06 UTC
all relevant patches merged

Comment 25 Israel Pinto 2018-07-24 11:41:39 UTC
Verify with:
Host 4.2:
OS Version:RHEL - 7.5 - 8.el7
Kernel Version:3.10.0 - 862.9.1.el7.x86_64
KVM Version:2.10.0 - 21.el7_5.4
LIBVIRT Version:libvirt-3.9.0-14.el7_5.6
VDSM Version:vdsm-4.20.35-1.el7ev
Host 4.1:
vdsm-4.20.19-1.el7ev.x86_64

Steps on 4.1 cluster:
1. Set up a 4.1 cluster, with host vdsm 4.20.z on it.
2. Create VM without Lease but with Virtio Serial Console device
3. Start VM on host 4.1
4. Hot-plug a Lease
5. Migrate to host 4.2

Steps on 4.2 cluster:
1. Set up a 4.2 cluster
2. Create VM without Lease but with Virtio Serial Console device
3. Start VM on host 4.2
4. Hot-plug a Lease
5. Migrate

PASS

Comment 26 Israel Pinto 2018-07-24 11:42:40 UTC
(In reply to Israel Pinto from comment #25)
> Verify with:
> Host 4.2:
> OS Version:RHEL - 7.5 - 8.el7
> Kernel Version:3.10.0 - 862.9.1.el7.x86_64
> KVM Version:2.10.0 - 21.el7_5.4
> LIBVIRT Version:libvirt-3.9.0-14.el7_5.6
> VDSM Version:vdsm-4.20.35-1.el7ev
> Host 4.1:
> vdsm-4.20.19-1.el7ev.x86_64
> 
> Steps on 4.1 cluster:
> 1. Set up a 4.1 cluster, with host vdsm 4.20.z on it.
> 2. Create VM without Lease but with Virtio Serial Console device
> 3. Start VM on host 4.1
> 4. Hot-plug a Lease
> 5. Migrate to host 4.2
> 
> Steps on 4.2 cluster:
> 1. Set up a 4.2 cluster
> 2. Create VM without Lease but with Virtio Serial Console device
> 3. Start VM on host 4.2
> 4. Hot-plug a Lease
> 5. Migrate
> 
> PASS

Engine version: 
Software Version:4.2.5.2_SNAPSHOT-83.g4210c43.0.scratch.master.el7ev

Comment 27 Francesco Romani 2018-07-24 12:16:52 UTC
(In reply to Israel Pinto from comment #26)
> (In reply to Israel Pinto from comment #25)
> > Verify with:
> > Host 4.2:
> > OS Version:RHEL - 7.5 - 8.el7
> > Kernel Version:3.10.0 - 862.9.1.el7.x86_64
> > KVM Version:2.10.0 - 21.el7_5.4
> > LIBVIRT Version:libvirt-3.9.0-14.el7_5.6
> > VDSM Version:vdsm-4.20.35-1.el7ev
> > Host 4.1:
> > vdsm-4.20.19-1.el7ev.x86_64
> > 
> > Steps on 4.1 cluster:
> > 1. Set up a 4.1 cluster, with host vdsm 4.20.z on it.
> > 2. Create VM without Lease but with Virtio Serial Console device
> > 3. Start VM on host 4.1
> > 4. Hot-plug a Lease
> > 5. Migrate to host 4.2
> > 
> > Steps on 4.2 cluster:
> > 1. Set up a 4.2 cluster
> > 2. Create VM without Lease but with Virtio Serial Console device
> > 3. Start VM on host 4.2
> > 4. Hot-plug a Lease
> > 5. Migrate
> > 
> > PASS
> 
> Engine version: 
> Software Version:4.2.5.2_SNAPSHOT-83.g4210c43.0.scratch.master.el7ev

steps look correct granted that "lease" as per step 4 is a VM lease (not a disk lease)

Comment 28 Francesco Romani 2018-07-24 14:18:58 UTC
VM leases are automatically set by Engine using when the VM is set to be HA, so it should be OK.

Comment 30 Israel Pinto 2018-07-25 12:14:39 UTC
See: 
https://bugzilla.redhat.com/show_bug.cgi?id=1605172#c27

Comment 32 errata-xmlrpc 2018-07-31 17:50:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2319


Note You need to log in before you can comment on or make changes to this bug.