Bug 1985670

Summary: virt-launcher fails to create v1 controller cpu for group: Read-only file system
Product: Container Native Virtualization (CNV) Reporter: Denis Ollier <dollierp>
Component: VirtualizationAssignee: Itamar Holder <iholder>
Status: CLOSED ERRATA QA Contact: Israel Pinto <ipinto>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.9.0CC: cnv-qe-bugs, mtessun, sgott
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 15:59:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Denis Ollier 2021-07-24 19:58:01 UTC
Description of problem
----------------------

Creation of VirtualMachines is failing due an error with cgroups in virt-launcher Pod:

> server error. command SyncVMI failed: "LibvirtError(Code=38, Domain=54, Message='Failed to create v1 controller cpu for group: Read-only file system')"

Version
-------

OCP: v4.9.0-0.nightly-2021-07-24-064622
RHCOS: v49.84.202107220219-0
CNV: http://cnv-version-explorer.apps.cnv.engineering.redhat.com/BundleDetails?ver=v4.9.0-57

How reproducible
----------------

100%

Steps to Reproduce
------------------

Create a basic VM:

> ---
> kind: VirtualMachine
> apiVersion: kubevirt.io/v1
> metadata:
>   name: cirros
> spec:
>   template:
>     spec:
>       domain:
>         cpu:
>           cores: 1
>         devices:
>           disks:
>             - name: rootdisk
>               disk:
>                 bus: virtio
>         resources:
>           requests:
>             memory: '128Mi'
>       volumes:
>         - name: rootdisk
>           dataVolume:
>             name: cirros-rootdisk
>   running: true
>   dataVolumeTemplates:
>     - metadata:
>         name: cirros-rootdisk
>       spec:
>         source:
>           http:
>             url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/cirros-images/cirros-0.5.1-x86_64-disk.img
>         pvc:
>           accessModes:
>             - ReadWriteOnce
>           resources:
>             requests:
>               storage: '150Mi'

Actual results
--------------

The VirtualMachineInstance stays in Scheduled Phase because the virt-launcher Pod is facing issues with cgroups:

> {"component":"virt-launcher","level":"error","msg":"Failed to create v1 controller cpu for group: Read-only file system","pos":"virCgroupV1MakeGroup:675","subcomponent":"libvirt","thread":"34","timestamp":"2021-07-24T19:34:56.099000Z"}
> {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to start VirtualMachineInstance with flags 0.","name":"cirros","namespace":"openshift-cnv","pos":"manager.go:827","reason":"virError(Code=38, Domain=54, Message='Failed to create v1 controller cpu for group: Read-only file system')","timestamp":"2021-07-24T19:34:56.305333Z","uid":"d6b381b1-367b-4882-a6a4-9fc1b84745b5"}
> {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"cirros","namespace":"openshift-cnv","pos":"server.go:184","reason":"virError(Code=38, Domain=54, Message='Failed to create v1 controller cpu for group: Read-only file system')","timestamp":"2021-07-24T19:34:56.305505Z","uid":"d6b381b1-367b-4882-a6a4-9fc1b84745b5"}

Expected results
----------------

The VirtualMachine should start properly.

Comment 2 Itamar Holder 2021-07-29 08:09:12 UTC
This bug's root cause has been found and is now fixed upstream with this PR: https://github.com/kubevirt/kubevirt/pull/6153.

As explained in the PR itself:


Very long story short:

Background:
multiple libvirtd processes can run for multiple VMs, some of them can be root and some non-root. QEMU's configuration file path is different for root / non-root VMS.

For root VMs it's /etc/libvirt/qemu.conf
For non-root VMs it's /var/run/libvirt/qemu.conf.
(for more info: https://libvirt.org/manpages/libvirtd.html#when-run-as-non-root)

In Kubevirt, we also add cgroup_controllers = [ ] string to the configuration file (here: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-launcher/virtwrap/util/libvirt_helper.go#L454).

Bug root cause:
As can be seen by this PR, the bug is that the wrong configuration file (the non-root one) is being chosen also for root VMs.

Bug outcome:
The outcome is this bug. Deep in libvirt's code there an if-else branch (in virCgroupV1DetectControllers() function) that depends on the number on controllers defined in QEMU config file. Previously it was 0, since we had cgroup_controllers = [ ] in the config file, but since this bug causes us to look at the wrong config file (non-root one) the actual config file doesn't have cgroup_controllers defined at all, therefor in libvirt the number of controllers is determined to be -1.

This change in libvirt code-path breaks Kubevirt and causes VMs to stay in Scheduled mode until they fail.

We need to make sure the configuration file is set up correctly to fix this as this PR does.


Thanks very much to @dollierp for helping me with this bug!

Comment 3 Denis Ollier 2021-08-02 16:13:27 UTC
It has been mitigated by modifying the default /etc/libvirt/qemu.conf file.

Removing blocker tags.

Comment 4 Denis Ollier 2021-08-06 20:21:05 UTC
Verified with http://cnv-version-explorer.apps.cnv.engineering.redhat.com/BundleDetails?ver=v4.9.0-79.

virt-launcher does not create file /var/run/libvirt/qemu.conf anymore for root VMs and overrides the file /etc/libvirt/qemu.conf instead.

Comment 7 errata-xmlrpc 2021-11-02 15:59:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104