Red Hat Bugzilla – Bug 1282581
VDSM failure on RNG device conf on VM migration after upgrade to oVirt 3.6
Last modified: 2016-02-23 04:19:01 EST
Created attachment 1095056 [details]
Description of problem:
Jason Keltz <email@example.com> wrote:
I installed 3.6 on the engine. I rebooted the engine.
The 3 hosts were still running vdsm from 3.5. I checked back in the yum log, and it was 4.16.26-0.el7.
On the first host upgrade (virt1), I made a mistake. After bringing in the 3.6 repo, I upgraded the packages with just "yum update". However, I know that I should have put the host into maintenance mode first. After the updates installed, I put the host into maintenance mode, and it migrated the VMs off, during which I saw more than one failed VM migration.
I'm willing to accept the failures there because I should have put the host into maintenance mode first. Live and learn!
I had two other hosts to do this right. For virt2, and virt3, I put the hosts into maintenance mode first. However, the same problem occurred with failed migrations. I proceeded anyway, brought the failed VMs back up elsewhere, applied the updates, and rebooted the hosts.
So now, 3.6 is installed on the engine and the 3 hosts, and they are all rebooted.
I tried another migration, and again, there were failures, so this isn't specifically related to just 3.6.
Martin Polednik <firstname.lastname@example.org> wrote:
The issue is that 3.5 engine created RNG device without sending the device key (which should've been 'rng' but it wasn't properly documented in the API as fixed in https://gerrit.ovirt.org/#/c/43166/). This caused the getUnderlyingRngDevice method to fail matching the device (fixed in https://gerrit.ovirt.org/#/c/40095/) and it would therefore be treated as unknown device (where the notion of 'source' isn't known). 3.6 engine should handle it correctly https://gerrit.ovirt.org/#/c/43165/.
The implication is that when VM is created in 3.5 environment and moved to 3.6 environment, the matching will work but there will be 2 RNG devices for the single one. Same goes for migration.
I'm not sure about the fix yet, to rescue the 3.6 VM we would have to remove the duplicate device without specParams (meaning that address would be lost) or remove the original device but adding it's specParams to the new device. A temporary fix would be creating a hook that does this.
Steps to Reproduce:
1. Install oVirt 3.5 and start the engine.
2. Create a cluster with one host and deploy VDSM on it. Make sure that VDSM version is 4.16.*.
3. Create a VM and start it.
4. Stop the engine.
5. Install oVirt 3.6.
6. Run engine-setup and link the new engine to the same DB, upgrade it.
7. Start the engine.
8. Add another host to the cluster and deploy VDSM on it. Make sure that VDSM version is 4.17.*.
9. Migrate the VM from the first host to the second, then back to the first and once again to the second.
VM goes down during migrations.
The migrations succeed.
Why isn't this a dup of bug 1233825 ?
(In reply to Yaniv Kaul from comment #1)
> Why isn't this a dup of bug 1233825 ?
This is completely different problem. Bug 1233825 is about incompatibility between the Engine and VDSM: Engine didn't add 'device' field that VDSM was relying on. But this problem was fixed before 3.6 release: now the Engine always puts the 'device' field and VDSM uses 'type' to distinguish RNG devices.
But in this bug the VM with an RNG device was started on old VDSM that's completely unaware of RNG devices. The old VDSM created unknown device conf instead of RNG device conf. When some of the hosts in the cluster were upgraded to the recent VDSM, the VM was migrated to an upgraded host and the recent VDSM treats this unknown device conf (that was preserved during migration) as RNG device conf (because it contains 'type'==RNG) and fails while trying to read 'source' spec param that's absent.
(In reply to Shmuel Melamud from comment #2)
> (In reply to Yaniv Kaul from comment #1)
> > Why isn't this a dup of bug 1233825 ?
> This is completely different problem. Bug 1233825 is about incompatibility
> between the Engine and VDSM: Engine didn't add 'device' field that VDSM was
> relying on. But this problem was fixed before 3.6 release: now the Engine
> always puts the 'device' field and VDSM uses 'type' to distinguish RNG
> But in this bug the VM with an RNG device was started on old VDSM that's
> completely unaware of RNG devices. The old VDSM created unknown device conf
> instead of RNG device conf. When some of the hosts in the cluster were
> upgraded to the recent VDSM, the VM was migrated to an upgraded host and the
> recent VDSM treats this unknown device conf (that was preserved during
> migration) as RNG device conf (because it contains 'type'==RNG) and fails
> while trying to read 'source' spec param that's absent.
Ok, I still think it's two sides of the same issue. But if this is the exact issue, please fix the title of the bug to reflect it. The title is not mentioning that the problem is with the RNG device, for example.
Still need to update flags (I don't have permissions for this) and create clone for 3.6.
Fixed bug tickets must have version flags set prior to fixing them. Please set the correct version flags and move the bugs back to the previous status after this is corrected.
please use this same bug for 3.6 backport
actually made it in 3.6.1
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
- Build versions used before upgrading to RHEVM 3.6:
- Build version used for upgrading engine:
- build versions of host added after the migration:
1. Use rhevm-3.5, enable RNG (/dev/random source) on the cluster and create RHEL7 VM with RNG enabled (also /dev/random source).
2. Run VM and verify host /dev/random is attached to VM properly:
On host run the next commands:
(fetch qemu-kvm PID)
#ps -aux | grep [qemu-kvm PID]
Verify QEMU process is running with the next parameters:
-object rng-random,id=rng0,filename=/dev/random -device virtio-rng-pci,rng=rng0
3. Verify host and VM RNG functionality using the next commands:
#dd count=100 bs=100000 if=/dev/random of=/dev/stdout | hexdump -c
on host verify entropy starvation does not occur:
#watch -d -n 1 cat /proc/sys/kernel/random/entropy_avail
4. Upgrade engine to 3.6 using the next procedure:
#yum install http://bob.eng.lab.tlv.redhat.com/builds/latest_3.6/rhev-release-3.6.3-3-001.noarch.rpm
#yum update rhevm-setup
5. After engine is up, validate VM and host RNG functionallity (step 3 procedure).
Add another host to cluster with latest VDSM (vdsm-4.17.21-0.el7ev.noarch)
6. Migrate VM to the new host, verify VM is not going down and RNG functionality is not affected by the migration.
7. Migrate VM back to the first host and verify VM is not going down and RNG functionality is not affected by the migration.
8. Repeat steps 5-6 few times.