Bug 1419924

Summary: cluster level 4.1 adds Random Generator to all VMs while it may not be presented by cluster
Product: [oVirt] ovirt-engine Reporter: Evgheni Dereveanchin <ederevea>
Component: BLL.VirtAssignee: jniederm
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0.4CC: bugs, ederevea, eedri, gklein, jniederm, mburman, michal.skrivanek, nsimsolo
Target Milestone: ovirt-4.1.1Flags: gklein: ovirt-4.1+
gklein: blocker+
Target Release: 4.1.1.3   
Hardware: Unspecified   
OS: Linux   
Whiteboard: upgrade
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If a VM was running during cluster upgrade 4.0 -> 4.1 and the VM had a /dev/random RNG device set and VM had no custom compatibility level set then the RNG device was not updated from `random` to `urandom` on VM shutdown (or power off or restart). This caused a VM to have incompatible RNG device which prevented it from running. Workaround: Remove and re-add rng device to the VM. Fix: Running VMs have updated RNG device stored in next-run configuration during cluster update. Thus the RNG device is properly updated on VM shutdown
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-21 09:54:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1337101, 1374227    
Attachments:
Description Flags
engine log
none
rhevh engine.log
none
reassign engine.log
none
reassign vdsm.log none

Description Evgheni Dereveanchin 2017-02-07 12:18:18 UTC
Description of problem:
When upgrading from 4.0 to 4.1 and setting cluster compatibility level to 4.1 all VMs implicitly get "Random Generator enabled". Clusters themselves however do not present a random number generator by default. THis causes a situations that after reboot VMs will fail to boot with error:

"Cannot run VM. Random Number Generator device is not supported in cluster"

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0.4-1.el7.centos

How reproducible:
always

Steps to Reproduce:
1. install oVirt 4.0 engine and hosts
2. create a VM and start it
2. upgrade to 4.1
3. change cluster and DC compatibility level to 4.1
4. Note the orange marker appear on VM, reboot it to apply changes

Actual results:
VM fails to boot with:
"Cannot run VM. Random Number Generator device is not supported in cluster"

Expected results:
VM starts up fine. If a device is silently added to all VMs, the Engine must ensure the respective cluster offers this device and display warnings otherwise as this can lead to unexpected outages due to non-starting VMs.

Additional info:
To work around this either disable RNG on all VMs (may be hundreds of VMs in the cluster) or enable RNG source in cluster settings (which may render hosts NonOperational)

Comment 1 Evgheni Dereveanchin 2017-02-07 12:34:04 UTC
This is probably related to bz#1337101 which enabled this by default but on pre-existing hosts the flag is not set for some reason as otherwise the VM would start fine.

Comment 2 Michal Skrivanek 2017-02-07 14:54:56 UTC
if you have engine.log from the time of cluster level change please attach it to the bug

Comment 3 Evgheni Dereveanchin 2017-02-07 15:21:36 UTC
Created attachment 1248430 [details]
engine log

engine log provided, VM name is "vm1" and the first failed reboot is at timestamp 2017-02-06 13:56:36,561

Cluster and DC updates happened a few minutes before that.

Comment 4 Tomas Jelinek 2017-02-08 07:55:51 UTC
*** Bug 1420213 has been marked as a duplicate of this bug. ***

Comment 5 Nisim Simsolo 2017-02-08 16:59:21 UTC
This bug is also relevant for RHEV-H:
2017-02-08 18:58:12,824+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-30) [21f2ce18-0f42-467a-b4d8-664538fed970] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_RNG_SOURCE_NOT_SUPPORTED

engine.log attached

Comment 6 Nisim Simsolo 2017-02-08 17:01:33 UTC
Created attachment 1248645 [details]
rhevh engine.log

Comment 7 Michal Skrivanek 2017-02-10 12:32:04 UTC
a workaround is fairly simple - for all VMs using virtio-rng "random" source in 4.0 do the followinf after upgrade to 4.1: Edit VM, uncheck virtio-rng, and check it again

Comment 13 Nisim Simsolo 2017-02-19 14:27:06 UTC
Not fixed, trying to run VM after cluster upgrade from 4.0 to 4.1 failed and the next message displayed in webadmin: 
"Error while executing action: 
1111upgrade1:
Cannot run VM. Random Number Generator device is not supported in cluster."

engine.log showing the next WARN:
2017-02-19 16:18:48,705+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-23) [301d16a7-d572-4d22-90a1-73385eab3a7c] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_RNG_SOURCE_NOT_SUPPORTED

Verification builds: 
ovirt-engine-4.1.1.2-0.1.el7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
vdsm-4.19.6-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.5.x86_64
sanlock-3.4.0-1.el7.x86_64

Verification scenario: 
1. Create New DC and cluster with compatibility version 4.0, add host and storage to it.
2. Create new VM, install RHEL 7 on it and verify /dev/random functionality: 
dd count=1 bs=128 if=/dev/random of=/dev/stdout| xxd
3. Change cluster and DC compatibility version to 4.1 and verify delta icon appears on VM status tab.
4. Power off VM and try to run it.

engine.log and vdsm.log attached

Comment 14 Nisim Simsolo 2017-02-19 14:31:55 UTC
Created attachment 1255438 [details]
reassign engine.log

Comment 15 Nisim Simsolo 2017-02-19 14:32:26 UTC
Created attachment 1255439 [details]
reassign vdsm.log

Comment 16 Michal Skrivanek 2017-02-20 14:49:27 UTC
Hi Nisim, this was caused by incorrect move to MODIFIED by automation. Sorry, this fix is not in 4.1.1.2.

Eyal, this is still happening and confusing people and they waste their time. We need to be more careful what is being moved to MODIFIED and ON_QA automatically.

Comment 17 Eyal Edri 2017-02-20 14:53:31 UTC
A bug is moved to MODIFIED if all its attached external trackers are in MERGED status.

The bot can't know if there will be future patches attached to the bug or not.
In the past if this happened, we moved the status back to POST - we got a request for PM to never move bug status backwards, so this was removed.

If you choose to leave moving bugs from POST to MODIFIED to a manual process then we'll end up with much more bugs on POST then on MODIFIED and will deploy fixes to QE that are already fixed but weren't moved due to human error.


So unless you have a better criteria or solution on when to move to MODIFIED, I don't think we can do anything different than we're doing now that will improve things w/o the penalties explained above.

Comment 18 Michal Skrivanek 2017-02-20 16:33:31 UTC
(In reply to Eyal Edri from comment #17)
> A bug is moved to MODIFIED if all its attached external trackers are in
> MERGED status.

Sorry, I wasn't clear, I've meant the MODIFIED->ON_QA is problematic in this case
 
> The bot can't know if there will be future patches attached to the bug or
> not.
> In the past if this happened, we moved the status back to POST - we got a
> request for PM to never move bug status backwards, so this was removed.

really? hm, but that's exactly what a person will do manually now anyway
 
> If you choose to leave moving bugs from POST to MODIFIED to a manual process
> then we'll end up with much more bugs on POST then on MODIFIED and will
> deploy fixes to QE that are already fixed but weren't moved due to human
> error.
> 
> So unless you have a better criteria or solution on when to move to
> MODIFIED, I don't think we can do anything different than we're doing now
> that will improve things w/o the penalties explained above.

well, I personally advocate for manually moving the bug. But the point here is really about incorrect ON_QA - that we should either verify automatically or at least get the list of modified candidates at the time of tagging instead of the build which takes one or two days more. Everything merged in that time falls through the cracks and ends up in ON_QA without being in

Comment 19 Eyal Edri 2017-02-20 20:04:02 UTC
OK, looking at the logs, the bug passed all versification, but its still not part of the latest tag so it shouldn't have been moved.
This might a bug in our code that scans the bugs's external trackers and verify the bug is in.

I've opened [1] to track debugging this issue, thanks for reporting it.

[1] https://ovirt-jira.atlassian.net/browse/OVIRT-1168

Comment 20 Nisim Simsolo 2017-03-05 14:16:23 UTC
Verification builds: 
ovirt-engine-4.1.1.3-0.1.el7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
vdsm-4.19.7-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.5.x86_64
sanlock-3.4.0-1.el7.x86_64

Verification scenario: 
1. Run some VMs
2. Upgrade clusted and DC with /dev/random enabled from 3.6 to 4.1
3. Verify delta icon (reboot required) appears on running VMs.
4. Power -> run all VMs.
5. Veirfy from vdsm.log VM xml that rng source is now: <backend model="random">/dev/urandom</backend>
6. Repeat steps 1-5 but this time upgrade cluster from 4.0 to 4.1

Comment 21 Nisim Simsolo 2017-03-06 13:29:29 UTC
Verified also with RHEV-H:
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
vdsm-4.19.4-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
sanlock-3.4.0-1.el7.x86_64