Bug 1419924

Summary:

cluster level 4.1 adds Random Generator to all VMs while it may not be presented by cluster

Product:

[oVirt] ovirt-engine

Reporter:

Evgheni Dereveanchin <ederevea>

Component:

BLL.Virt

Assignee:

jniederm

Status:

CLOSED CURRENTRELEASE

QA Contact:

Nisim Simsolo <nsimsolo>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.1.0.4

CC:

bugs, ederevea, eedri, gklein, jniederm, mburman, michal.skrivanek, nsimsolo

Target Milestone:

ovirt-4.1.1

Flags:

gklein: ovirt-4.1+
gklein: blocker+

Target Release:

4.1.1.3

Hardware:

Unspecified

OS:

Linux

Whiteboard:

upgrade

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

If a VM was running during cluster upgrade 4.0 -> 4.1 and the VM had a /dev/random RNG device set and VM had no custom compatibility level set then the RNG device was not updated from `random` to `urandom` on VM shutdown (or power off or restart). This caused a VM to have incompatible RNG device which prevented it from running. Workaround: Remove and re-add rng device to the VM. Fix: Running VMs have updated RNG device stored in next-run configuration during cluster update. Thus the RNG device is properly updated on VM shutdown

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-04-21 09:54:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1337101, 1374227

Attachments:

Description	Flags
engine log	none
rhevh engine.log	none
reassign engine.log	none
reassign vdsm.log	none

Description Evgheni Dereveanchin 2017-02-07 12:18:18 UTC

Description of problem:
When upgrading from 4.0 to 4.1 and setting cluster compatibility level to 4.1 all VMs implicitly get "Random Generator enabled". Clusters themselves however do not present a random number generator by default. THis causes a situations that after reboot VMs will fail to boot with error:

"Cannot run VM. Random Number Generator device is not supported in cluster"

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0.4-1.el7.centos

How reproducible:
always

Steps to Reproduce:
1. install oVirt 4.0 engine and hosts
2. create a VM and start it
2. upgrade to 4.1
3. change cluster and DC compatibility level to 4.1
4. Note the orange marker appear on VM, reboot it to apply changes

Actual results:
VM fails to boot with:
"Cannot run VM. Random Number Generator device is not supported in cluster"

Expected results:
VM starts up fine. If a device is silently added to all VMs, the Engine must ensure the respective cluster offers this device and display warnings otherwise as this can lead to unexpected outages due to non-starting VMs.

Additional info:
To work around this either disable RNG on all VMs (may be hundreds of VMs in the cluster) or enable RNG source in cluster settings (which may render hosts NonOperational)

Comment 1 Evgheni Dereveanchin 2017-02-07 12:34:04 UTC

This is probably related to bz#1337101 which enabled this by default but on pre-existing hosts the flag is not set for some reason as otherwise the VM would start fine.

Comment 2 Michal Skrivanek 2017-02-07 14:54:56 UTC

if you have engine.log from the time of cluster level change please attach it to the bug

Comment 3 Evgheni Dereveanchin 2017-02-07 15:21:36 UTC

Created attachment 1248430 [details]
engine log

engine log provided, VM name is "vm1" and the first failed reboot is at timestamp 2017-02-06 13:56:36,561

Cluster and DC updates happened a few minutes before that.

Comment 4 Tomas Jelinek 2017-02-08 07:55:51 UTC

*** Bug 1420213 has been marked as a duplicate of this bug. ***

Comment 5 Nisim Simsolo 2017-02-08 16:59:21 UTC

This bug is also relevant for RHEV-H:
2017-02-08 18:58:12,824+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-30) [21f2ce18-0f42-467a-b4d8-664538fed970] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_RNG_SOURCE_NOT_SUPPORTED

engine.log attached

Comment 6 Nisim Simsolo 2017-02-08 17:01:33 UTC

Created attachment 1248645 [details]
rhevh engine.log

Comment 7 Michal Skrivanek 2017-02-10 12:32:04 UTC

a workaround is fairly simple - for all VMs using virtio-rng "random" source in 4.0 do the followinf after upgrade to 4.1: Edit VM, uncheck virtio-rng, and check it again

Comment 13 Nisim Simsolo 2017-02-19 14:27:06 UTC

Not fixed, trying to run VM after cluster upgrade from 4.0 to 4.1 failed and the next message displayed in webadmin: 
"Error while executing action: 
1111upgrade1:
Cannot run VM. Random Number Generator device is not supported in cluster."

engine.log showing the next WARN:
2017-02-19 16:18:48,705+02 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (default task-23) [301d16a7-d572-4d22-90a1-73385eab3a7c] Validation of action 'RunVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_RNG_SOURCE_NOT_SUPPORTED

Verification builds: 
ovirt-engine-4.1.1.2-0.1.el7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
vdsm-4.19.6-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.5.x86_64
sanlock-3.4.0-1.el7.x86_64

Verification scenario: 
1. Create New DC and cluster with compatibility version 4.0, add host and storage to it.
2. Create new VM, install RHEL 7 on it and verify /dev/random functionality: 
dd count=1 bs=128 if=/dev/random of=/dev/stdout| xxd
3. Change cluster and DC compatibility version to 4.1 and verify delta icon appears on VM status tab.
4. Power off VM and try to run it.

engine.log and vdsm.log attached

Comment 14 Nisim Simsolo 2017-02-19 14:31:55 UTC

Created attachment 1255438 [details]
reassign engine.log

Comment 15 Nisim Simsolo 2017-02-19 14:32:26 UTC

Created attachment 1255439 [details]
reassign vdsm.log

Comment 16 Michal Skrivanek 2017-02-20 14:49:27 UTC

Hi Nisim, this was caused by incorrect move to MODIFIED by automation. Sorry, this fix is not in 4.1.1.2.

Eyal, this is still happening and confusing people and they waste their time. We need to be more careful what is being moved to MODIFIED and ON_QA automatically.

Comment 17 Eyal Edri 2017-02-20 14:53:31 UTC

A bug is moved to MODIFIED if all its attached external trackers are in MERGED status.

The bot can't know if there will be future patches attached to the bug or not.
In the past if this happened, we moved the status back to POST - we got a request for PM to never move bug status backwards, so this was removed.

If you choose to leave moving bugs from POST to MODIFIED to a manual process then we'll end up with much more bugs on POST then on MODIFIED and will deploy fixes to QE that are already fixed but weren't moved due to human error.


So unless you have a better criteria or solution on when to move to MODIFIED, I don't think we can do anything different than we're doing now that will improve things w/o the penalties explained above.

Comment 18 Michal Skrivanek 2017-02-20 16:33:31 UTC

(In reply to Eyal Edri from comment #17)
> A bug is moved to MODIFIED if all its attached external trackers are in
> MERGED status.

Sorry, I wasn't clear, I've meant the MODIFIED->ON_QA is problematic in this case
 
> The bot can't know if there will be future patches attached to the bug or
> not.
> In the past if this happened, we moved the status back to POST - we got a
> request for PM to never move bug status backwards, so this was removed.

really? hm, but that's exactly what a person will do manually now anyway
 
> If you choose to leave moving bugs from POST to MODIFIED to a manual process
> then we'll end up with much more bugs on POST then on MODIFIED and will
> deploy fixes to QE that are already fixed but weren't moved due to human
> error.
> 
> So unless you have a better criteria or solution on when to move to
> MODIFIED, I don't think we can do anything different than we're doing now
> that will improve things w/o the penalties explained above.

well, I personally advocate for manually moving the bug. But the point here is really about incorrect ON_QA - that we should either verify automatically or at least get the list of modified candidates at the time of tagging instead of the build which takes one or two days more. Everything merged in that time falls through the cracks and ends up in ON_QA without being in

Comment 19 Eyal Edri 2017-02-20 20:04:02 UTC

OK, looking at the logs, the bug passed all versification, but its still not part of the latest tag so it shouldn't have been moved.
This might a bug in our code that scans the bugs's external trackers and verify the bug is in.

I've opened [1] to track debugging this issue, thanks for reporting it.

[1] https://ovirt-jira.atlassian.net/browse/OVIRT-1168

Comment 20 Nisim Simsolo 2017-03-05 14:16:23 UTC

Verification builds: 
ovirt-engine-4.1.1.3-0.1.el7
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
vdsm-4.19.7-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.5.x86_64
sanlock-3.4.0-1.el7.x86_64

Verification scenario: 
1. Run some VMs
2. Upgrade clusted and DC with /dev/random enabled from 3.6 to 4.1
3. Verify delta icon (reboot required) appears on running VMs.
4. Power -> run all VMs.
5. Veirfy from vdsm.log VM xml that rng source is now: <backend model="random">/dev/urandom</backend>
6. Repeat steps 1-5 but this time upgrade cluster from 4.0 to 4.1

Comment 21 Nisim Simsolo 2017-03-06 13:29:29 UTC

Verified also with RHEV-H:
qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
vdsm-4.19.4-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
sanlock-3.4.0-1.el7.x86_64