Bug 1522901 - VM migration 4.2 -> 4.1 fails with virtio-rng device
Summary: VM migration 4.2 -> 4.1 fails with virtio-rng device
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.20.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.2.0
: 4.20.9.1
Assignee: Milan Zamazal
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-06 17:27 UTC by Evgheni Dereveanchin
Modified: 2017-12-20 11:41 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-12-20 11:41:40 UTC
oVirt Team: Virt
Embargoed:
ykaul: ovirt-4.2+
ykaul: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85200 0 master MERGED virt: Create also device conf when making devices from XML 2021-02-02 23:54:15 UTC
oVirt gerrit 85265 0 ovirt-4.2.0 MERGED virt: Create also device conf when making devices from XML 2021-02-02 23:53:30 UTC

Description Evgheni Dereveanchin 2017-12-06 17:27:31 UTC
Description of problem:
When trying to live-migrate a VM from a 4.2 host back to a 4.1 host the migration fails and the VM crashes with error:


ERROR (jsonrpc/4) [virt.vm] (vmId='...') Alias not found for device type balloon during migration at destination host (vm:4631)


Version-Release number of selected component (if applicable):


How reproducible:
source:      vdsm-4.20.9-1.el7.centos.x86_64
destination: vdsm-4.19.31-1.el7.centos.x86_64

Steps to Reproduce:
1. update 4.1 host to 4.2
2. live migrate VM from existing 4.1 host to 4.2
3. live migrate VM back to 4.1 host

Actual results:
VM crashes on second migration

Expected results:
VM migrates fine both times as both hosts are using the same cluster version so VM definitions must match.

Comment 4 Michal Skrivanek 2017-12-06 22:04:28 UTC
Possible mis-initialization of rng specparam on incoming migration(and possibly create) in 4.2 using vmconf format. That is needed for migration to <4.2 hosts later on. 

Missing logs for VM migration in step 2 and engine version and log, Evgheni, can you add that?

Workaround is to remove virtio-rng device from VM (requires restart), or upgrade the destination host to 4.2. Decreasing Sev.

Comment 5 Yaniv Kaul 2017-12-07 08:12:36 UTC
(In reply to Michal Skrivanek from comment #4)
> Possible mis-initialization of rng specparam on incoming migration(and
> possibly create) in 4.2 using vmconf format. That is needed for migration to
> <4.2 hosts later on. 
> 
> Missing logs for VM migration in step 2 and engine version and log, Evgheni,
> can you add that?
> 
> Workaround is to remove virtio-rng device from VM (requires restart), or
> upgrade the destination host to 4.2. Decreasing Sev.

I don't think either workarounds are valid. IIRC, both require downtime.

Comment 6 Evgheni Dereveanchin 2017-12-07 08:57:41 UTC
Engine is 4.1.7, cluster compatibility level is 4.1. Will provide the logs ASAP. In any case, I believe VMs crashing during live migration is a bug, this shouldn't happen. Worst case the migration should fail with the VM remaining on source host.

Comment 10 Michal Skrivanek 2017-12-07 10:36:31 UTC
(In reply to Yaniv Kaul from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > Possible mis-initialization of rng specparam on incoming migration(and
> > possibly create) in 4.2 using vmconf format. That is needed for migration to
> > <4.2 hosts later on. 
> > 
> > Missing logs for VM migration in step 2 and engine version and log, Evgheni,
> > can you add that?
> > 
> > Workaround is to remove virtio-rng device from VM (requires restart), or
> > upgrade the destination host to 4.2. Decreasing Sev.
> 
> I don't think either workarounds are valid. IIRC, both require downtime.

I disagree. Urgent severity is affecting existing workloads in significant way without any workaround. Disabling migrations while completing the relatively simple host upgrade on the rest of the cluster sounds feasible enough to me and doesn't require downtime

Comment 11 Michal Skrivanek 2017-12-07 10:39:18 UTC
(In reply to Evgheni Dereveanchin from comment #6)
> Engine is 4.1.7, cluster compatibility level is 4.1. Will provide the logs
> ASAP. In any case, I believe VMs crashing during live migration is a bug,
> this shouldn't happen. Worst case the migration should fail with the VM
> remaining on source host.

Are you saying that on that migration attempt the source VM crashes? IIUC you "only" experience a failed migration while the VM still continues to run (at least somehwere)

Comment 12 Evgheni Dereveanchin 2017-12-07 10:51:20 UTC
Regarding #10 - what if there's a scheduling policy enabled on the cluster that will auto-migrate VMs to load balance the cluster? Or ovirt-optimizer if that still exists?

I, for example, ran into this issue after hitting bz1522878 in an attempt to to evacuate VMs caused unexpected crashes.

As for the migration, the VM is running on destination but not visible to VDSM - not sure what happens if I try to start it, hopefully VM reservations will kick in and not let the VM start (then it'll be stuck in "down" state on engine). Alternatively, the VM will start somewhere else and disk corruption will probably occur.

Comment 13 Michal Skrivanek 2017-12-07 10:58:13 UTC
Evgheni, could you also please provide earlier engine.log? Seems tracking of that VM is also not entirely correct, so it's interesting to follow logs from the original VM start time
Was it started on -03 originally? Where did engine think it runs when you triggered the migration - can you confirm/change that at that time that expected original host was _not_ running that VM?

Comment 14 Evgheni Dereveanchin 2017-12-07 11:40:31 UTC
The test VM was created today, then started on ovirt-srv01 (4.2 - started automatically as it has fewer VMs). It was then successfully migrated to ovirt-srv02. I then migrated it back to ovirt-srv01 which succeeded but once again triggered bz1522878 with VDSM restarting on ovirt-srv01. Then there were several unsuccessful migration attempts until eventually the VM went into "down" state on the engine.

Comment 16 Milan Zamazal 2017-12-07 12:41:08 UTC
(In reply to Evgheni Dereveanchin from comment #12)

> As for the migration, the VM is running on destination but not visible to
> VDSM

I can confirm that I've seen a situation when the VM was reported as down in the web UI, while it was running on the destination. So it should be reproducible.

Comment 17 Evgheni Dereveanchin 2017-12-07 14:01:07 UTC
Migrating a VM without a Random Generator seems to work without problems. From what I recall from bz1337101, RNG is enabled by default on all VMs created after 4.0

As for VMs running in libvirt but invisible to VDSM, I have this situation right now and unsure if it's the same bug or a different one. Please advise if any more data is needed from my side to troubleshoot that.

Comment 18 Evgheni Dereveanchin 2017-12-07 22:52:53 UTC
Patch applied to affected host. Migrating a VM after VDSM restart succeeds now, however in the admin portal the VM shows up as having no graphical console after migration. Will open a new bug for that after more tests as the issue described in this BZ is now fixed.

Comment 19 Israel Pinto 2017-12-12 10:45:34 UTC
Status:

Engine: Software Version:4.2.1-0.0.master.20171211205712.git7b1f4d1.el7.centos	
vdsm 4.1: vdsm-4.19.42-1.el7ev.x86_64	
vdsm 4.2: vdsm-4.20.9-30.gite026991.el7.centos

Steps:
the steps as Milan provide are:
1. Starting VM on 4.1 with configuration as describe in case 
2. Migrating it to 4.2
3. Restarting Vdsm on 4.2
4. Migrating the VM back to 4.1

Cases and Results: 
1. Run all the below VM in 4.2 host and set host to maintenance -PASS
2. Migration VM with snapshot  from 4.1 to 4.2 -PASS
3  Migration VM with snapshot  from 4.2 to 4.1 -PASS
4. Migrate VM with RNG (urandon) from 4.1 to 4.2 - PASS
5. Migrate VM with RNG (urandom) from 4.1 to 4.2 - PASS
6  Migrate VM with RNG (hwrng) from 4.1 to 4.2 - PASS
7. Migrate VM with RNG (hwrng) from 4.2 to 4.1 -PASS
8. Migration VM with hotplug memory and CPU from 4.1 to 4.2 - PASS
9. Migration VM with hotplug memory and CPU from 4.2 to 4.1 - PASS
10. Migrate VM with spice  from 4.1 to 4.2 - PASS
11. Migrate VM with spice  from 4.2 to 4.1 -PASS
12. Migrate VM with VNC  from 4.1 to 4.2 - PASS
13. Migrate VM with VNC from 4.2 to 4.1 -PASS
14. Migration Vm in pause from 4.2 to 4.1 -PASS
15. Migration Vm in pause from 4.1 to 4.2 - PASS
16. Migrate VM with SRIOV nic configured from 4.1 to 4.2 - PASS - reported by mburman 
17  Migrate VM with SRIOV nic configured from 4.2 to 4.1 - PASS - reported by mburman
18. Migrate VM with Direct LUN based disk from 4.1 to 4.2 - Failed to start vm BZ:1524941
19. Migrate VM with Direct LUN based disk from 4.2 to 4.1 - Failed to start vm BZ:1524941
20. Migrate Headless VM from 4.2 to 4.1 - PASS
21. Migrate Headless VM from 4.1 to 4.2 - PASS
22. Migrate smartcard VM from 4.1 to 4.2 - PASS
23. Migrate smartcard VM from 4.1 to 4.2 - PASS

Comment 22 Israel Pinto 2017-12-13 08:47:37 UTC
Verify with:
https://bugzilla.redhat.com/show_bug.cgi?id=1522901#c19

Comment 23 Sandro Bonazzola 2017-12-20 11:41:40 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.