Bug 1881930

Summary: Failure to create tap device upon VM creation
Product: Container Native Virtualization (CNV) Reporter: Yossi Segev <ysegev>
Component: NetworkingAssignee: Miguel Duarte Barroso <mduarted>
Status: CLOSED ERRATA QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.5.0CC: cnv-qe-bugs, danken, lbednar, ncredi, phoracek
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 2.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: virt-launcher-container-v2.5.0-56 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 13:24:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vm-fedora.yaml none

Description Yossi Segev 2020-09-23 12:16:32 UTC
Created attachment 1716019 [details]
vm-fedora.yaml

Description of problem:
starting a VM results in a warning message in the VMI description about a failure to create a tap device.


Version-Release number of selected component (if applicable):
OCP version:
Client Version: 4.6.0-202009212020.p0-0a57069
Server Version: 4.6.0-fc.7
Kubernetes Version: v1.19.0+b4ffb45

CNV version: 2.5.0


How reproducible:
Always


Steps to Reproduce:
1. In an OCP 4.6/CNV 2.5 cluster - create a VM:
$ oc apply -f vm-fedora.yaml
virtualmachine.kubevirt.io/vm-fedora created

The Fedora VM spec yaml I used is attached.

2. Start the VM:
$ virtctl start vm-fedora
VM vm-fedora was scheduled to start

3. Wait for the VMI to get to Running state:
$ oc get vmi vm-fedora -w
NAME        AGE   PHASE        IP    NODENAME
vm-fedora   1s    Scheduling         
vm-fedora   5s    Scheduled          myakove-8ljbm-worker-0-xvkrt
vm-fedora   5s    Scheduled          myakove-8ljbm-worker-0-xvkrt
vm-fedora   7s    Scheduled          myakove-8ljbm-worker-0-xvkrt
vm-fedora   8s    Running      10.128.3.126   myakove-8ljbm-worker-0-xvkrt
vm-fedora   8s    Running      10.128.3.126   myakove-8ljbm-worker-0-xvkrt

4. Check the VM description (specifically the Events section):
$ oc describe vmi vm-fedora


Actual results:
$ oc describe vmi vm-fedora
...
Events:
  Type     Reason            Age                   From                       Message
  ----     ------            ----                  ----                       -------
  Normal   SuccessfulCreate  3m31s                 virtualmachine-controller  Created virtual machine pod virt-launcher-vm-fedora-jskm4
  Warning  SyncFailed        3m25s                 virt-handler               server error. command SyncVMI failed: "LibvirtError(Code=38, Domain=0, Message='Unable to create tap device tap0: Permission denied')"
  Normal   Started           3m24s                 virt-handler               VirtualMachineInstance started.
  Normal   Created           48s (x11 over 3m24s)  virt-handler               VirtualMachineInstance defined.

<BUG> The warning message about failure to create tap device.


Expected results:
1. No such warning/error.
2. Tap device exists for every interface in the VM.
To verify that - dump the domxml of the virt-launcher pod of the VMI:
 a. Find the virt-launcher pod:
$ oc get pod | grep launcher
virt-launcher-vm-fedora-n2v7p   2/2     Running   0          40m

 b. Find the Id of the VMI domain:
[cnv-qe-jenkins@myakove-8ljbm-executor yossi]$ oc exec -it virt-launcher-vm-fedora-n2v7p -- virsh list
Defaulting container name to compute.
Use 'oc describe pod/virt-launcher-vm-fedora-n2v7p -n yoss-ns' to see all of the containers in this pod.
 Id   Name                State
-----------------------------------
 2    yoss-ns_vm-fedora   running

c. Dump the domxml for this domain (which is "2" in this example)
[cnv-qe-jenkins@myakove-8ljbm-executor yossi]$ oc exec -it virt-launcher-vm-fedora-n2v7p -- virsh dumpxml 2

d. Search for the ethernet entries - they all should have tap device defined in them, for example:
    <interface type='ethernet'>
...
      <target dev='tap0' managed='no'/>
      <model type='virtio'/>
...
    </interface>


Additional info:
This error can also be found in the virt-handler and virt-launcher logs:
a. virt-handler:
$ oc get vmi vm-fedora
NAME        AGE   PHASE     IP             NODENAME
vm-fedora   49m   Running   10.128.3.126   myakove-8ljbm-worker-0-xvkrt

[cnv-qe-jenkins@myakove-8ljbm-executor yossi]$ oc get pods -n openshift-cnv -o wide | grep "virt-handler" | grep "myakove-8ljbm-worker-0-xvkrt"
virt-handler-8xlfq                                    1/1     Running   0          21h     10.128.2.4      myakove-8ljbm-worker-0-xvkrt   <none>           <none>

$ oc logs virt-handler-8xlfq -n openshift-cnv
...
{"component":"virt-handler","kind":"","level":"error","msg":"Synchronizing the VirtualMachineInstance failed.","name":"oper-test-vm-1600844833.408544","namespace":"cluster-addons-operator-test-network-addons-operator","pos":"vm.go:1328","reason":"server error. command SyncVMI failed: \"LibvirtError(Code=38, Domain=0, Message='Unable to create tap device tap0: Permission denied')\"","timestamp":"2020-09-23T07:07:56.769890Z","uid":"ce26416b-ee8b-4089-9fc7-1110acf49f92"}
...

b. virt-launcher:
$ oc get pods | grep "virt-launcher"
virt-launcher-vm-fedora-n2v7p   2/2     Running   0          52m

$ oc logs virt-launcher-vm-fedora-tpqk2 -c compute
...
{"component":"virt-launcher","kind":"","level":"error","msg":"Starting the VirtualMachineInstance failed.","name":"vm-fedora","namespace":"yoss-ns","pos":"manager.go:1245","reason":"virError(Code=38, Domain=0, Message='Unable to create tap device tap0: Permission denied')","timestamp":"2020-09-23T11:22:47.393442Z","uid":"cc269f09-1d8e-489d-84bd-f03f067089ff"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"vm-fedora","namespace":"yoss-ns","pos":"server.go:161","reason":"virError(Code=38, Domain=0, Message='Unable to create tap device tap0: Permission denied')","timestamp":"2020-09-23T11:22:47.393609Z","uid":"cc269f09-1d8e-489d-84bd-f03f067089ff"}
...

Comment 1 Yossi Segev 2020-09-23 14:10:10 UTC
Another finding from Miguel's investigation is that this bug affects connectivity, so when verifying this bug - add verifying the connectivity via the primary interface, in addition to verifying points 1 (No such warning/error) and 2 (Tap device exists for every interface in the VM) in the original bug description.

Comment 2 Miguel Duarte Barroso 2020-09-24 09:13:22 UTC
For whatever reason, the build process upstream and downstream are entirely different - upstream uses bazel, while downstream does not.

As a direct consequence of this, d/s, we end up compiling the KubeVirt binaries without selinux support, which causes the selinux stub to be used. 
Using the stub makes it impossible for virt-handler to read the correct selinux context of virt-launcher, and also make impossible for the handler to switch context.

Not being able to switch context creates the tap device with the incorrect labels, which ultimately prevents libvirt from opening it.

Comment 3 Yossi Segev 2020-10-04 12:57:21 UTC
Verified on OCP 4.6.0-fc.9 / CNV v2.5.0 by checking the 3 expected results in the bug description + comment #1:
1. The warning error doesn't appear in any of VMI description, virt-handler log or virt-launcher log.
2. Tap device exists for the single interface (default eth0) on the VM (checked in the virt-launcher's domxml).
3. Valid connectivity (using ping) between 2 created VMs.

Comment 6 errata-xmlrpc 2020-11-17 13:24:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127