Bug 1590063 - VM was destroyed on destination after successful migration due to missing the 'device' key on the lease device
Summary: VM was destroyed on destination after successful migration due to missing the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.1.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.3.1
: 4.3.0
Assignee: Francesco Romani
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks: 1605172
TreeView+ depends on / blocked
 
Reported: 2018-06-12 01:22 UTC by Germano Veit Michel
Modified: 2021-09-09 14:34 UTC (History)
15 users (show)

Fixed In Version: v4.30.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1605172 (view as bug list)
Environment:
Last Closed: 2019-05-08 12:36:02 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1590068 0 medium CLOSED HA VM died on both source and destination during failed migration, but was not restarted. 2021-09-09 14:36:00 UTC
Red Hat Issue Tracker RHV-43519 0 None None None 2021-09-09 14:34:49 UTC
Red Hat Knowledge Base (Solution) 3523081 0 None None None 2018-07-09 01:26:40 UTC
Red Hat Product Errata RHBA-2019:1077 0 None None None 2019-05-08 12:36:24 UTC
oVirt gerrit 92160 0 master MERGED vm: lease: on hotplug, ensure the 'device' key 2020-07-02 06:42:10 UTC
oVirt gerrit 92197 0 ovirt-4.2 MERGED vm: lease: on hotplug, ensure the 'device' key 2020-07-02 06:42:10 UTC
oVirt gerrit 93046 0 master MERGED vm: lease: fix params on migration destination 2020-07-02 06:42:10 UTC
oVirt gerrit 93141 0 ovirt-4.2 MERGED vm: lease: fix params on migration destination 2020-07-02 06:42:10 UTC

Internal Links: 1590068

Description Germano Veit Michel 2018-06-12 01:22:29 UTC
Description of problem:

A migration completed fine, the VM ran on destination and shutdown on the source.
But right after resuming on destination, VDSM hit an exception in and destroyed the VM. 
So the VM was down on both source and destination, causing an outage.

On destination:

1. VM runs after completing the migration:
2018-06-10 15:27:50,811+0200 INFO  (libvirt/events) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU running: onResume (vm:5119)

2. VDSM hits an exception on update_device_info() and kills it
2018-06-10 15:27:50,821+0200 ERROR (vm/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') The vm start process failed (vm:631)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 581, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 3284, in _completeIncomingMigration
    self._domDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1850, in _domDependentInit
    self._vmDependentInit()
  File "/usr/share/vdsm/virt/vm.py", line 1866, in _vmDependentInit
    self._getUnderlyingVmDevicesInfo()
  File "/usr/share/vdsm/virt/vm.py", line 1799, in _getUnderlyingVmDevicesInfo
    vmdevices.common.update_device_info(self, self._devices)
  File "/usr/share/vdsm/virt/vmdevices/common.py", line 87, in update_device_info
    core.Console.update_device_info(vm, devices[hwclass.CONSOLE])
  File "/usr/share/vdsm/virt/vmdevices/core.py", line 249, in update_device_info
    if dev['device'] == hwclass.CONSOLE and \
KeyError: 'device'

2018-06-10 15:27:50,828+0200 INFO  (vm/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Changed state to Down: 'device' (code=1) (vm:1259)

2018-06-10 15:27:50,838+0200 INFO  (jsonrpc/2) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') _destroyVmGraceful attempt #0 (vm:4334)

After parsing the device list received by vdsm from the source host, I could only find 1 device missing the 'device' key:

{"path": "/dev/7688a6ed-3742-4cee-82d3-233775112670/xleases", "sd_id": "7688a6ed-3742-4cee-82d3-233775112670", "type": "lease", "lease_id": "50cab403-bf5d-422f-9fac-d3ca3cc57a66", "offset": 3145728}

This is a lease device, and indeed its missing the 'device' key.

I can partially reproduce this behavior in 4.2.3:
1) If I start the VM with the lease device already configured, VDSM reports the 'device' key for the device.
2) If I hot-plug the lease device, it is missing the 'device' key, see:

            {
                "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
                "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
                "type": "lease", 
                "lease_id": "3f3267ab-b2df-47c0-9b98-c8fd45ba485b", 
                "offset": 6291456
            }

So there is a difference in the devices list if the list was hot-plugged or not. But I can't make the migration fail on latest VDSM, not sure what could have fixed it.

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.9.2-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64

How reproducible:
Partially

Steps to Reproduce:
1. Configure lease with the VM up

Actual results:
'device' key is missing (?)

Expected results:
'device' key is present (? - not sure?)

Comment 2 Yaniv Kaul 2018-06-12 06:04:49 UTC
Arik, can you take a look? Is it something we've fixed?

Comment 3 Francesco Romani 2018-06-12 06:48:09 UTC
(In reply to Germano Veit Michel from comment #0)
> Description of problem:
> 
> A migration completed fine, the VM ran on destination and shutdown on the
> source.
> But right after resuming on destination, VDSM hit an exception in and
> destroyed the VM. 
> So the VM was down on both source and destination, causing an outage.
> 
> On destination:
> 
> 1. VM runs after completing the migration:
> 2018-06-10 15:27:50,811+0200 INFO  (libvirt/events) [virt.vm]
> (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU running: onResume (vm:5119)
> 
> 2. VDSM hits an exception on update_device_info() and kills it
> 2018-06-10 15:27:50,821+0200 ERROR (vm/50cab403) [virt.vm]
> (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') The vm start process failed
> (vm:631)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/vm.py", line 581, in _startUnderlyingVm
>     self._completeIncomingMigration()
>   File "/usr/share/vdsm/virt/vm.py", line 3284, in _completeIncomingMigration
>     self._domDependentInit()
>   File "/usr/share/vdsm/virt/vm.py", line 1850, in _domDependentInit
>     self._vmDependentInit()
>   File "/usr/share/vdsm/virt/vm.py", line 1866, in _vmDependentInit
>     self._getUnderlyingVmDevicesInfo()
>   File "/usr/share/vdsm/virt/vm.py", line 1799, in
> _getUnderlyingVmDevicesInfo
>     vmdevices.common.update_device_info(self, self._devices)
>   File "/usr/share/vdsm/virt/vmdevices/common.py", line 87, in
> update_device_info
>     core.Console.update_device_info(vm, devices[hwclass.CONSOLE])
>   File "/usr/share/vdsm/virt/vmdevices/core.py", line 249, in
> update_device_info
>     if dev['device'] == hwclass.CONSOLE and \
> KeyError: 'device'
[...]
> So there is a difference in the devices list if the list was hot-plugged or
> not. But I can't make the migration fail on latest VDSM, not sure what could
> have fixed it.

A large factor is most likely the engine XML change, which happened between 4.1 and 4.2.

The most likely clause is indeed the 'device' key being missing, we need to learn if indeed it happens only on hotplug lease or if there are other cases.

Comment 4 Michal Skrivanek 2018-06-12 06:56:22 UTC
the code doesn't really look changed in 4.1. I suspect no one was really using hotplug lease in 4.1, it was a very late feature targeted for a very specific narrow usecase, originally without hotplugging. It's quite possible it slipped through testing.

Germano, since we do not plan any further 4.1.z, would it make sense to try to reproduce this in 4.2?

Comment 5 Germano Veit Michel 2018-06-12 07:01:16 UTC
(In reply to Michal Skrivanek from comment #4)
> the code doesn't really look changed in 4.1. I suspect no one was really
> using hotplug lease in 4.1, it was a very late feature targeted for a very
> specific narrow usecase, originally without hotplugging. It's quite possible
> it slipped through testing.
> 
> Germano, since we do not plan any further 4.1.z, would it make sense to try
> to reproduce this in 4.2?

Hi Michal,

Sorry if I was not clear enough. I did reproduce this partially in 4.2.3. The 'device' key is missing on the lease device even in 4.2.3. But the VM migrates fine, it doesn't hit that exception.

Comment 6 Michal Skrivanek 2018-06-12 07:07:18 UTC
(In reply to Germano Veit Michel from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > the code doesn't really look changed in 4.1. I suspect no one was really
> > using hotplug lease in 4.1, it was a very late feature targeted for a very
> > specific narrow usecase, originally without hotplugging. It's quite possible
> > it slipped through testing.
> > 
> > Germano, since we do not plan any further 4.1.z, would it make sense to try
> > to reproduce this in 4.2?
> 
> Hi Michal,
> 
> Sorry if I was not clear enough. I did reproduce this partially in 4.2.3.
> The 'device' key is missing on the lease device even in 4.2.3. But the VM
> migrates fine, it doesn't hit that exception.

ah, sorry, missed that. Great, can you attach that separately as well?
Did you have serial console enabled? Can you try with/without it? It seems that’s what happened in VW, it’s possible console is the only place where it checks the device key

Also, what about the target release then, since we do not plan 4.1.z is a 4.2 fix enough?


Separately, Arik, please take a look at why rerun was not triggered

Comment 7 Francesco Romani 2018-06-12 07:12:12 UTC
(In reply to Germano Veit Michel from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > the code doesn't really look changed in 4.1. I suspect no one was really
> > using hotplug lease in 4.1, it was a very late feature targeted for a very
> > specific narrow usecase, originally without hotplugging. It's quite possible
> > it slipped through testing.
> > 
> > Germano, since we do not plan any further 4.1.z, would it make sense to try
> > to reproduce this in 4.2?
> 
> Hi Michal,
> 
> Sorry if I was not clear enough. I did reproduce this partially in 4.2.3.
> The 'device' key is missing on the lease device even in 4.2.3. But the VM
> migrates fine, it doesn't hit that exception.

From my initial investigation, it should work like that:

In 4.1:
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, this causing this bug

In 4.2, using Engine XML
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, but on migration flow we use the domain XML as authoritative, thus we call the very same code as in the VM start flow, thus sidestepping this bug

In 4.2, using vm.conf (aka 4.1 compatibility mode)
- lease devices created at VM start looks OK
- lease devices hotplugged lack the 'device' key, and on migration flow we use the vm params as authoritative, thus we hit the very same bug.

We call the console code anyway: we are indeed looking in the VM device parameters (which we don't know upfront) to check if we have a console device to update. I don't think we can skip this check.

Comment 8 Francesco Romani 2018-06-12 08:47:39 UTC
(In reply to Francesco Romani from comment #7)
> We call the console code anyway: we are indeed looking in the VM device
> parameters (which we don't know upfront) to check if we have a console
> device to update. I don't think we can skip this check.

We indeed call it anyway, but we actually do logic only if we have a console device in the domain XML. So, it could work without configured console devices. It needs to be tested.

Comment 9 Michal Skrivanek 2018-06-12 10:29:24 UTC
(In reply to Francesco Romani from comment #8)
> (In reply to Francesco Romani from comment #7)
> > We call the console code anyway: we are indeed looking in the VM device
> > parameters (which we don't know upfront) to check if we have a console
> > device to update. I don't think we can skip this check.

we can definitely not raise in that case because you know that when that device key is missing it's not a console device:)

Comment 10 Michal Skrivanek 2018-06-12 10:30:33 UTC
(no reason for private comments)

Comment 11 Arik 2018-06-12 10:41:02 UTC
> Separately, Arik, please take a look at why rerun was not triggered

Germano, can you please provide the engine log?

Comment 13 Arik 2018-06-12 14:01:35 UTC
> Separately, Arik, please take a look at why rerun was not triggered

Steffen, thanks.

In order to initiate another attempt to migrate the VM, the VM must be running on the source host. However, in this case, the VM died on the source host:
2018-06-10 15:27:52,550+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS 'b6732abc-9733-411b-bc6c-723315370d79'.

Comment 14 Germano Veit Michel 2018-06-12 23:48:43 UTC
And it gets even more interesting....

Test on engine 4.2.3 and vdsm 4.20.27:

1. Create VM without Lease but with Virtio Serial Console device
2. Start VM
3. Hot-plug a Lease
4. Migrate

Result: migration succeeds and device key "appears" on destination.

Pre-migration:
{
  "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
  "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
  "type": "lease", 
  "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
  "offset": 3145728
}

Post-migration:
{
  "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
  "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
  "offset": "3145728", 
  "device": "lease", 
  "path": "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-687498984df0/dom_md/xleases", 
  "type": "lease"
}

Comment 15 Francesco Romani 2018-06-13 06:27:07 UTC
(In reply to Germano Veit Michel from comment #14)
> And it gets even more interesting....
> 
> Test on engine 4.2.3 and vdsm 4.20.27:
> 
> 1. Create VM without Lease but with Virtio Serial Console device
> 2. Start VM
> 3. Hot-plug a Lease
> 4. Migrate
> 
> Result: migration succeeds and device key "appears" on destination.
> 
> Pre-migration:
> {
>   "path":
> "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-
> 687498984df0/dom_md/xleases", 
>   "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
>   "type": "lease", 
>   "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
>   "offset": 3145728
> }
> 
> Post-migration:
> {
>   "lease_id": "e313e714-00c0-4819-9d83-a96c55fe0e1d", 
>   "sd_id": "9f076c60-1a5f-47f8-80ea-687498984df0", 
>   "offset": "3145728", 
>   "device": "lease", 
>   "path":
> "/rhev/data-center/mnt/10.64.24.33:_exports_data/9f076c60-1a5f-47f8-80ea-
> 687498984df0/dom_md/xleases", 
>   "type": "lease"
> }

This is expected and a (beneficial) side effect of the behaviour documented in
https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7

Comment 16 Germano Veit Michel 2018-06-13 06:34:47 UTC
(In reply to Francesco Romani from comment #15)
> This is expected and a (beneficial) side effect of the behaviour documented
> in
> https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7
Ahh, this:

(In reply to Francesco Romani from comment #7)
> In 4.2, using Engine XML
> - lease devices created at VM start looks OK
> - lease devices hotplugged lack the 'device' key, but on migration flow we
> use the domain XML as authoritative, thus we call the very same code as in
> the VM start flow, thus sidestepping this bug
OK, now I finally understood why I cant hit this on 4.2.

Do you need any more testing from me?

Comment 17 Francesco Romani 2018-06-13 06:49:33 UTC
(In reply to Germano Veit Michel from comment #16)
> (In reply to Francesco Romani from comment #15)
> > This is expected and a (beneficial) side effect of the behaviour documented
> > in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1590063#c7
> Ahh, this:
> 
> (In reply to Francesco Romani from comment #7)
> > In 4.2, using Engine XML
> > - lease devices created at VM start looks OK
> > - lease devices hotplugged lack the 'device' key, but on migration flow we
> > use the domain XML as authoritative, thus we call the very same code as in
> > the VM start flow, thus sidestepping this bug
> OK, now I finally understood why I cant hit this on 4.2.
> 
> Do you need any more testing from me?

I think we're fine now, thanks. I posted a fix candidate for 4.2 - we need to support 4.1 compatibility mode in 4.2.z, so we will need a fix here.

Comment 18 Francesco Romani 2018-06-13 11:16:28 UTC
Steps to reproduce:

1. Set up a 4.1 cluster, but install Vdsm 4.20.z on it.
2. Create VM without Lease but with Virtio Serial Console device
3. Start VM
4. Hot-plug a Lease
5. Migrate

Without this fix we will have the behaviour documented in this bug
With this fix, the migration will work and the VM will run after the migration.

Comment 19 Francesco Romani 2018-06-18 07:50:05 UTC
This is an internal bug caused by an unexpected internal interaction, no doc_text

Comment 20 RHV bug bot 2018-07-02 15:34:26 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 21 Dusan Fodor 2018-07-12 14:21:55 UTC
Please clone properly.

Comment 25 Pedut 2018-11-14 10:00:28 UTC
Hi Francesco,

I go over the thread and before I'm starting to verify this bug I want your approval to check this matrix:

Source host                     | Destination host                | cluster
4.1 with vdsm 4.19 and rhel 7.5 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1
4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.2
4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1

Comment 26 Francesco Romani 2018-11-14 10:40:19 UTC
(In reply to Pedut from comment #25)
> Hi Francesco,
> 
> I go over the thread and before I'm starting to verify this bug I want your
> approval to check this matrix:
> 
> Source host                     | Destination host                | cluster
> 4.1 with vdsm 4.19 and rhel 7.5 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1

OK, we should have both sides on 4.1 level

> 4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.2

OK, we should have both sides on 4.2 level

> 4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1

OK, we should see the 4.1 -> 4.2 upgrade

So it Looks OK.

Comment 27 Pedut 2018-11-28 13:12:16 UTC
Although the VM was not destroyed in the destination it did not work as it should.

Source host                     | Destination host                | cluster
4.1 with vdsm 4.19 and rhel 7.5 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1

> Migration succedded but the vm failed to retrieve Hosted Engine HA info 

4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.2

> Migration failed

4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1

> Migration failed

Comment 28 Francesco Romani 2018-11-28 13:43:04 UTC
(In reply to Pedut from comment #27)
> Although the VM was not destroyed in the destination it did not work as it
> should.
> 
> Source host                     | Destination host                | cluster
> 4.1 with vdsm 4.19 and rhel 7.5 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1
> > Migration succedded but the vm failed to retrieve Hosted Engine HA info 

OK, but how's that relevant for this bug?

> 4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.2
> 
> > Migration failed

We need to see the reason (and the logs) to assess if it is relevant for this bug

> 4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.1
> 
> > Migration failed

same

Comment 29 Ryan Barry 2019-01-21 14:54:00 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 30 Ryan Barry 2019-01-31 08:23:24 UTC
Francesco is fairly sure this is fixed.

Can you please test?

Comment 32 Polina 2019-02-13 14:47:13 UTC
verification version - ovirt-engine-dwh-4.3.1-0.0.master.20190110155219.el7.noarch

The following migration combinations successfully tested:

the setup has cluster 4.1, three hosts
host1 - rhel 7.5 & vdsm-4.19.51-1.el7ev.x86_64
host2 - rhel 7.6 & vdsm-4.20.46-1.el7ev.x86_64
host3 - rhel 7.6 & vdsm-4.30.6-19.git754a02d.el7.x86_64

HA VM with hot plugged lease (the same as not HA VM) is migrated successfully between all the combinations.

Also tested for cluster 4.2 migration beween host2 <=> host3 (4.2 with vdsm 4.20 and rhel 7.6 | 4.3 with vdsm 4.30 and rhel 7.6 |4.2 according to comment25)

Comment 34 errata-xmlrpc 2019-05-08 12:36:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 35 Daniel Gur 2019-08-28 13:14:52 UTC
sync2jira

Comment 36 Daniel Gur 2019-08-28 13:19:55 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.