(I'm not sure if the component or the team is correct. If not, please redirect it).
I'm trying to run oVirt in nested virtualization with AMD's various Zen/Zen+ based CPU's (Ryzen, Threadripper,EPYC).
In Nested Virtualization mode, when trying to create or launch a VM, it stops and complains that the "monitor" flag is missing.
Checking libvirt domcapabilities shows that indeed the monitor policy is "disabled" which is correct (checking against other virtualization solutions), but oVirt doesn't respect the domcapabilities.
Could someone please disable the monitor flag check? it cannot be enabled and it's not a bug in the CPU or KVM.
*** Bug 1689361 has been marked as a duplicate of this bug. ***
Please attach logs (engine.log, libvirt logs, qemu logs). We don't directly check for flags outside of setting a CPU model. Is this coming from qemu?
Is vdsm-hook-nestedvt in use?
Created attachment 1544682 [details]
Created attachment 1544683 [details]
I don't see any libvirt or qemu logs. Where are they? I'm enclosing both vdsm and engine logs which shows the error.
Exact message is: 6464: error : virCPUx86Compare:1731 : the CPU is incompatible with host CPU: Host CPU does not provide required features: monitor
Output of virsh domcapabilities:
# virsh domcapabilities | grep mon
<feature policy='disable' name='monitor'/>
Forgot to mention: yes, installed vdsm-hook-nestedvt and checked that that it appears in host hooks (it does).
So, that message comes directly from libvirt.
Libvirt and qemu logs will be on the host the VM was scheduled on before it failed (likely to be the same host as the vdsm logs). Both are under /var/log/
Is vdsm-hook-nestedvt installed?
yes, vdsm-hook-nestedvt installed and running.
/var/log/libvirt/qemu doesn't help much - it has the VM log which only shows:
2019-03-16 00:59:38.003+0000: shutting down, reason=failed
2019-03-16 01:15:28.349+0000: shutting down, reason=failed
2019-03-16 01:28:57.159+0000: shutting down, reason=failed
2019-03-16 01:29:04.283+0000: shutting down, reason=failed
2019-03-16 01:29:51.729+0000: shutting down, reason=failed
2019-03-16 01:31:44.493+0000: shutting down, reason=failed
/var/log/qemu-ga is an empty directory.
tailing the journald when starting a VM shows:
מרץ 16 03:52:44 localhost.localdomain vdsm: WARN Attempting to add an existing net user: ovirtmgmt/a3b4d8de-f2d3-4272-843c-fba78751f481
מרץ 16 03:52:45 localhost.localdomain libvirtd: 2019-03-16 01:52:45.401+0000: 6466: error : virCPUx86Compare:1731 : the CPU is incompatible with host CPU: Host CPU does not provide required features: monitor
מרץ 16 03:52:45 localhost.localdomain vdsm: WARN File: /var/lib/libvirt/qemu/channels/a3b4d8de-f2d3-4272-843c-fba78751f481.ovirt-guest-agent.0 already removed
מרץ 16 03:52:45 localhost.localdomain vdsm: WARN Attempting to remove a non existing network: ovirtmgmt/a3b4d8de-f2d3-4272-843c-fba78751f481
מרץ 16 03:52:45 localhost.localdomain vdsm: WARN Attempting to remove a non existing net user: ovirtmgmt/a3b4d8de-f2d3-4272-843c-fba78751f481
מרץ 16 03:52:45 localhost.localdomain vdsm: WARN File: /var/lib/libvirt/qemu/channels/a3b4d8de-f2d3-4272-843c-fba78751f481.org.qemu.guest_agent.0 already removed
Please attach /proc/cpuinfo from L0 host, and domcapabilities output. Then the same from your nested L1 host, plus its domain xml from libvirt. If you manage to start a nested guest manually, can you please also get qemu cmdline and cpuinfo from the L2 guest?
As requested, I'm including a dump of l0 and l1 cpuinfo and dom capabilities.
I also include the ovirt-node1 dumpxml as well as centos7 dumpxml.
I found something very interesting:
On the host (Fedora 29 with Ryzen 7) I created a CentOS 7 nested guest and installed CentOS 7 below it (so: Fedora host -> Centos nested -> Centos guest without nest) - this works perfectly ok.
However - I launched the ovirt node 1 (latest - 4.3.1) as a guest with nested virtualization and I tried to launch a VM using virsh (Centos 7 guest, no nested) - it stops with the CPU error about monitor.
So, it seems that the problem related to the Node-NG-appliance which I installed as ovirt-node-1. On a standard CentOS geust with nested, everything works, no errors...
So, how can I find what causes it in the Node-NG?
Created attachment 1544756 [details]
Dump XML as requested
Just to make myself clear - all VM's were created on the host (Fedora 29) using virt-manager
After researching further, I found the following issue:
I installed CentOS as L1 guest with nested virtualization, and added oVirt Repo, and started the hosted-engine deployment.
It creates the HE VM, launches it and it works well (I can access it by port 6900).
However, when it comes to the storage part, after giving it the NFS share and continuing deployment, it creates the new HE VM in the NFS, moving the data and then when it tries to launch the new VM - it goes up and down.
So while it does this up & down, I mounted manually my virtual machines and tried to launch a VM using virsh (using virsh create)
And .. surprise surprise:
# virsh create nfs-server.xml
Please enter your authentication name: hetz
Please enter your password:
error: Failed to create domain from nfs-server.xml
error: the CPU is incompatible with host CPU: Host CPU does not provide required features: monitor
Prior to deploying the HE on this VM, KVM inside the guest OS worked perfectly well with virsh. After the failed deployed - I got the above.
I'm enclosing the whole VDSM stuff as the ansible logs doesn't show anything relevant..
Created attachment 1544820 [details]
VDSM logs after running hosted-engine deployment
Update #3: When running HE as a stand alone VM (not deploying using the hosted-engine --deploy) and adding a nested VM as "node" - it creates the same issue on this new "node".
Hope this helps...
Thanks, Hetz. I'll look at the logs tomorrow.
vdsm does try to do CPU detection and set a host model appropriately (including HE setups -- you would have been prompted for this as part of the deployment), but we may be missing something here...
Confirmed, and I know for sure that this doesn't happen with nested Intel CPUs, since I use them regularly
Hello, any progress on this bug? I'm experiencing the same problem deploying a nested HostedEngine. If you need more logs or tests, just tell me.