Bug 1282581

Summary: VDSM failure on RNG device conf on VM migration after upgrade to oVirt 3.6
Product: [oVirt] vdsm Reporter: Shmuel Melamud <smelamud>
Component: CoreAssignee: Shmuel Melamud <smelamud>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.17.11CC: amureini, bugs, danken, gklein, jas, mgoldboi, michal.skrivanek, nsimsolo, philippe, smelamud
Target Milestone: ovirt-3.6.1Flags: rule-engine: ovirt-3.6.z+
mgoldboi: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+
Target Release: 4.17.12   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-23 09:19:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1285700    
Attachments:
Description Flags
oVirt logs none

Description Shmuel Melamud 2015-11-16 19:26:49 UTC
Created attachment 1095056 [details]
oVirt logs

Description of problem:

Jason Keltz <jas.ca> wrote:
I installed 3.6 on the engine.  I rebooted the engine. 
The 3 hosts were still running vdsm from 3.5.  I checked back in the yum log, and it was 4.16.26-0.el7.
On the first host upgrade (virt1), I made a mistake.  After bringing in the 3.6 repo, I upgraded the packages with just "yum update".  However, I know that I should have put the host into maintenance mode first.  After the updates installed, I put the host into maintenance mode, and it migrated the VMs off, during which I saw more than one failed VM migration. 
I'm willing to accept the failures there because I should have put the host into maintenance mode first.  Live and learn!
I had two other hosts to do this right.  For virt2, and virt3, I put the hosts into maintenance mode first.  However, the same problem occurred with failed migrations.  I proceeded anyway, brought the failed VMs back up elsewhere, applied the updates, and rebooted the hosts.
So now, 3.6 is installed on the engine and the 3 hosts, and they are all rebooted.
I tried another migration, and again, there were failures, so this isn't specifically related to just 3.6.

Martin Polednik <mpolednik> wrote:
The issue is that 3.5 engine created RNG device without sending the device key (which should've been 'rng' but it wasn't properly documented in the API as fixed in https://gerrit.ovirt.org/#/c/43166/). This caused the getUnderlyingRngDevice method to fail matching the device (fixed in https://gerrit.ovirt.org/#/c/40095/) and it would therefore be treated as unknown device (where the notion of 'source' isn't known). 3.6 engine should handle it correctly https://gerrit.ovirt.org/#/c/43165/.

The implication is that when VM is created in 3.5 environment and moved to 3.6 environment, the matching will work but there will be 2 RNG devices for the single one. Same goes for migration.

I'm not sure about the fix yet, to rescue the 3.6 VM we would have to remove the duplicate device without specParams (meaning that address would be lost) or remove the original device but adding it's specParams to the new device. A temporary fix would be creating a hook that does this.

How reproducible:
100%

Steps to Reproduce:
1. Install oVirt 3.5 and start the engine.
2. Create a cluster with one host and deploy VDSM on it. Make sure that VDSM version is 4.16.*.
3. Create a VM and start it.
4. Stop the engine.
5. Install oVirt 3.6.
6. Run engine-setup and link the new engine to the same DB, upgrade it.
7. Start the engine.
8. Add another host to the cluster and deploy VDSM on it. Make sure that VDSM version is 4.17.*.
9. Migrate the VM from the first host to the second, then back to the first and once again to the second.

Actual results:
VM goes down during migrations.

Expected results:
The migrations succeed.

Additional info:

Comment 1 Yaniv Kaul 2015-11-17 08:08:03 UTC
Why isn't this a dup of bug 1233825 ?

Comment 2 Shmuel Melamud 2015-11-17 09:43:05 UTC
(In reply to Yaniv Kaul from comment #1)
> Why isn't this a dup of bug 1233825 ?

This is completely different problem. Bug 1233825 is about incompatibility between the Engine and VDSM: Engine didn't add 'device' field that VDSM was relying on. But this problem was fixed before 3.6 release: now the Engine always puts the 'device' field and VDSM uses 'type' to distinguish RNG devices.

But in this bug the VM with an RNG device was started on old VDSM that's completely unaware of RNG devices. The old VDSM created unknown device conf instead of RNG device conf. When some of the hosts in the cluster were upgraded to the recent VDSM, the VM was migrated to an upgraded host and the recent VDSM treats this unknown device conf (that was preserved during migration) as RNG device conf (because it contains 'type'==RNG) and fails while trying to read 'source' spec param that's absent.

Comment 3 Yaniv Kaul 2015-11-17 11:11:11 UTC
(In reply to Shmuel Melamud from comment #2)
> (In reply to Yaniv Kaul from comment #1)
> > Why isn't this a dup of bug 1233825 ?
> 
> This is completely different problem. Bug 1233825 is about incompatibility
> between the Engine and VDSM: Engine didn't add 'device' field that VDSM was
> relying on. But this problem was fixed before 3.6 release: now the Engine
> always puts the 'device' field and VDSM uses 'type' to distinguish RNG
> devices.
> 
> But in this bug the VM with an RNG device was started on old VDSM that's
> completely unaware of RNG devices. The old VDSM created unknown device conf
> instead of RNG device conf. When some of the hosts in the cluster were
> upgraded to the recent VDSM, the VM was migrated to an upgraded host and the
> recent VDSM treats this unknown device conf (that was preserved during
> migration) as RNG device conf (because it contains 'type'==RNG) and fails
> while trying to read 'source' spec param that's absent.

Ok, I still think it's two sides of the same issue. But if this is the exact issue, please fix the title of the bug to reflect it. The title is not mentioning that the problem is with the RNG device, for example.

Comment 4 Shmuel Melamud 2015-11-17 13:38:14 UTC
Still need to update flags (I don't have permissions for this) and create clone for 3.6.

Comment 5 Red Hat Bugzilla Rules Engine 2015-11-17 13:38:17 UTC
Fixed bug tickets must have version flags set prior to fixing them. Please set the correct version flags and move the bugs back to the previous status after this is corrected.

Comment 6 Michal Skrivanek 2015-11-24 11:45:49 UTC
please use this same bug for 3.6 backport

Comment 7 Michal Skrivanek 2015-12-07 13:49:28 UTC
actually made it in 3.6.1

Comment 8 Red Hat Bugzilla Rules Engine 2015-12-11 02:27:00 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 9 Nisim Simsolo 2016-02-18 15:07:05 UTC
Verified.
- Build versions used before upgrading to RHEVM 3.6:
rhevm-3.5.6.2-0.1.el6ev
host:
qemu-kvm-rhev-0.12.1.2-2.479.el6_7.3.x86_64
libvirt-client-0.10.2-54.el6_7.2.x86_64
vdsm-4.16.34-2.el6ev.x86_64
sanlock-2.8-2.el6_5.x86_64

- Build version used for upgrading engine:
rhevm-3.6.3.2-0.1.el6

- build versions of host added after the migration:
qemu-kvm-rhev-2.3.0-31.el7_2.7.x86_64
libvirt-client-1.2.17-13.el7_2.2.x86_64
vdsm-4.17.21-0.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64

Verification scenario:
1. Use rhevm-3.5, enable RNG (/dev/random source) on the cluster and create RHEL7 VM with RNG enabled (also /dev/random source).
2. Run VM and verify host /dev/random is attached to VM properly:
   On host run the next commands: 
   #lsof /dev/random
   (fetch qemu-kvm PID)
   #ps -aux | grep [qemu-kvm PID]
   Verify QEMU process is running with the next parameters:
   -object rng-random,id=rng0,filename=/dev/random -device virtio-rng-pci,rng=rng0
3. Verify host and VM RNG functionality using the next commands: 
   VM: 
   #dd count=100 bs=100000 if=/dev/random of=/dev/stdout | hexdump -c
   on host verify entropy starvation does not occur: 
   #watch -d -n 1 cat /proc/sys/kernel/random/entropy_avail
4. Upgrade engine to 3.6 using the next procedure: 
   #yum install http://bob.eng.lab.tlv.redhat.com/builds/latest_3.6/rhev-release-3.6.3-3-001.noarch.rpm
   #yum update rhevm-setup
   #engine-setup
5. After engine is up, validate VM and host RNG functionallity (step 3 procedure).
   Add another host to cluster with latest VDSM (vdsm-4.17.21-0.el7ev.noarch)
6. Migrate VM to the new host, verify VM is not going down and RNG functionality is not affected by the migration.
7. Migrate VM  back to the first host and verify VM is not going down and RNG functionality is not affected by the migration.
8. Repeat steps 5-6 few times.