Created attachment 1699464 [details] vlan-nncp-.yaml Description of problem: A VM, which is intended to serve as a DHCP server for VLAN interfaces on nodes, fails to start. When starting the VMI (virtctl start) - the VMI doesn't get to Running state. Version-Release number of selected component (if applicable): CNV 2.4.0 How reproducible: Always Steps to Reproduce: 1. Create a VLAN interface using the attached NNCP: $ oc apply -f vlan-nncp.yaml Change the base-iface and the interface name according to the available physical NIC on your nodes. Change the nodeSelector to one of the schedulable nodes in your cluster. 2. Create a bridge interface over this VLAN using the attached NNCP: $ oc apply -f bridge-nncp.yaml Change the port name to the VLAN interface name you set in the vlan-nncp-.yaml, and the bridge name accordingly. 3. crate a NetworkAttachmentDefinition using the attached NAD spec: $ oc apply -f bridge-nad.yaml Change the bridge name according to the bridge interface created in bridge-nncp.yaml 4. Create a VM using the attached VM yaml: $ oc apply -f dhcp-server-vm.yaml change networkName according to the name of the NAD in bridge-nad.yaml 5. start the VM: $ virtctl start dhcp-server-vm Actual results: <BUG> The VMI enters a loop, where a new VMI is created and entering "Scheduling" state, a virt-launcher pod is created, and after a few seconds the pod stops, and a different pod is created, and so on. Expected results: VMI should get to "Running" state Additional info: 1. When the VMI is gone and a new one is created, I see this output in the VMI describe: $ oc describe vmi dhcp-server-vm ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 10s virtualmachine-controller Created virtual machine pod virt-launcher-dhcp-server-vm-jt9s5 Warning SyncFailed 1s virt-handler, master-2.cnvcl2.cnvqe.lab.eng.rdu2.redhat.com unknown error encountered sending command SyncVMI: rpc error: code = Unavailable desc = transport is closing Warning SyncFailed 1s (x11 over 1s) virt-handler, master-2.cnvcl2.cnvqe.lab.eng.rdu2.redhat.com failed to detect VMI pod: dial unix //pods/4701df26-d742-47da-85f4-b0770ec986f3/volumes/kubernetes.io~empty-dir/sockets/launcher-sock: connect: connection refused 2. On the hosting node - this is the output of "bridge vlan show" which I run during the attempt to start the VMI: [core@cnv-qe-10 ~]$ bridge vlan show port vlan ids eno1 eno2 eno3 eno4 br10 1 PVID Egress Untagged eno4.1001 1 PVID Egress Untagged 2 3 4 5 6 ... (7-998) 999 1000 1001 1002 ... (1003-4090) 4091 4092 4093 4094 eno4.1001-br 1 PVID Egress Untagged veth0e856763 1 PVID Egress Untagged Note that in this output, veth0e856763 is the VETH interface created for the bridge. It seems to be changed for every attempt of starting VMI - on each attempt I see a different name. 3. I tried to capture journalctl output during one of these attempts. In this snippet you can follow the life cycle of the VETH from the example above (veth0e856763). http://pastebin.test.redhat.com/880012 4. I also tried to capture a relevant cycle of nmstate pod log on the hosting node: http://pastebin.test.redhat.com/880018 5. Resources used (files are attached): VLAN interface NNCP: vlan-nncp.yaml Bridge NNCP: bridge-nncp.yaml Bridge NAD: bridge-nad.yaml VM spec: dhcp-server-vm.yaml 6. Please note that because everything was running in a quick loop, you might see some discrepancies between the outputs, for example - one name of a virt-launcher pod in the VMI describe output, and a different pod name in the journalctl output 7. I you remove the secondary interface (multus) from the VM - the Vm starts successfully. 8. Although VLAN interfaces can fully function, and receive dynamic IP address from a DHCP server only on BM machines, this bug, which happens prior to enabling DHCP client on the nodes, is similar on both BM and PSI VMs.
Created attachment 1699465 [details] bridge-nncp.yaml
Created attachment 1699466 [details] bridge-nad.yaml
Created attachment 1699467 [details] dhco-server-vm.yaml
Got some logs oc logs --tail=1000 -n yoss-ns -l kubevirt.io=virt-launcher -c compute {"component":"virt-launcher","level":"info","msg":"Found nameservers in /etc/resolv.conf: \ufffd\u001e\u0000\n","pos":"converter.go:1564","timestamp":"2020-07-01T11:19:21.072418Z"} {"component":"virt-launcher","level":"info","msg":"Found search domains in /etc/resolv.conf: yoss-ns.svc.cluster.local svc.cluster.local cluster.local openstacklocal","pos":"converter.go:1565","timestamp":"2020-07-01T11:19:21.072457Z"} {"component":"virt-launcher","level":"info","msg":"Starting SingleClientDHCPServer","pos":"dhcp.go:64","timestamp":"2020-07-01T11:19:21.072563Z"} {"component":"virt-launcher","level":"info","msg":"/var/run/kubevirt/container-disks/disk_0.img backing file system does not support direct I/O","pos":"converter.go:168","timestamp":"2020-07-01T11:19:21.086636Z"} {"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/disk-data/containerdisk/disk.qcow2 set to none","pos":"converter.go:187","timestamp":"2020-07-01T11:19:21.086740Z"} {"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/cloud-init-data/yoss-ns/dhcp-server-vm/noCloud.iso set to none","pos":"converter.go:187","timestamp":"2020-07-01T11:19:21.086774Z"} {"component":"virt-launcher","kind":"","level":"info","msg":"Domain defined.","name":"dhcp-server-vm","namespace":"yoss-ns","pos":"manager.go:1204","timestamp":"2020-07-01T11:19:21.440929Z","uid":"ea6c40dd-6a12-4693-af95-0e97669de101"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 0 received","pos":"client.go:337","timestamp":"2020-07-01T11:19:21.441187Z"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x142a95c]
Stack trace oroutine 54 [running]: kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).buildDevicesMetadata(0xc0019bc070, 0xc001ae0000, 0x19ee4a0, 0xc001b76218, 0xc0000d2eb0, 0xc001b7a490, 0xc001b7a494, 0xc000339310, 0x1541140) /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:1615 +0x27c kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).generateCloudInitISO(0xc0019bc070, 0xc001ae0000, 0xc001af0bf0, 0x0, 0x0) /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:888 +0x173 kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).SyncVMI(0xc0019bc070, 0xc001ae0000, 0xc0019a6400, 0xc000010030, 0x0, 0x0, 0x0) /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:1230 +0x1116 kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/cmd-server.(*Launcher).SyncVirtualMachine(0xc0019480a0, 0x19c2940, 0xc000286270, 0xc000392520, 0xc0019480a0, 0xc000286270, 0xc00197db30) /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/cmd-server/server.go:160 +0x9b kubevirt.io/kubevirt/pkg/handler-launcher-com/cmd/v1._Cmd_SyncVirtualMachine_Handler(0x16c13a0, 0xc0019480a0, 0x19c2940, 0xc000286270, 0xc0001007e0, 0x0, 0x19c2940, 0xc000286270, 0xc000162000, 0xea6) /go/src/kubevirt.io/kubevirt/pkg/handler-launcher-com/cmd/v1/cmd.pb.go:556 +0x21a kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000, 0xc0002bc4e0, 0x27978a0, 0x0, 0x0, 0x0) /go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:1024 +0x4f4 kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).handleStream(0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000, 0x0) /go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:1313 +0xd97 kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc00192cc30, 0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000) /go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:722 +0xbb created by kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).serveStreams.func1 /go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:720 +0xa1 {"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:510","reason":"exit status 2","timestamp":"2020-07-01T11:19:21.450017Z"}
I also tested a similar scenario, only this time I configured the bridge over one of the node's physical interfaces, rather than over VLAN interface. This time the VMI started successfully, with the secondary NIC seen when entering the VM console. [fedora@multus-vm ~]$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:7f:3d:00:01:d4 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86312782sec preferred_lft 86312782sec inet6 fe80::7f:3dff:fe00:1d4/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:7f:3d:00:01:d5 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.1.1/24 brd 10.200.1.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::7f:3dff:fe00:1d5/64 scope link valid_lft forever preferred_lft forever
Quique found what needed to be changed in order for that to work properly: The network name of the secondary interface in the VM spec could not be consumed. The original network name was "eno4.1001-br". When we changed that to "blah" (and in the corresponding interface entry, of course) the VMI manged to start successfully. So now the issue is that if there's a problem with the naming - it must be reflected to the user as an error/event in the VMI description. Quique - do we know who has problems consuming this name? Is it virt-launcher pod, multus CNI, or any other entity?
I have managed to pinpoint the problem in the network name - it's the dot. When I replace the dot with a dash, i.e. "eno4-1001-br" instead of "eno4.1001-br", the VMI starts running successfully. Hilarious...
And another point worth emphasizing, although it seems to be quite obvious now: This has nothing to with the VM intended to serve as a DHCP server, nor with the bridge's port being a VLAN interface. I have verified that by testing the same setup I used in comment#6 - where the bridge is configured over a physical node interface (and not over a VLAN interface). The VMI failed to start when the network name was "eno4.1001-br", and ran successfully with "eno4-1001-br"
Another resolution about this: This issue also occurs the default network name includes a dot for example "d.efault"). So I assume the problem is not in multus, but elsewhere (k8s, openshift, CNV... ?).
Closing this one, and opened a more general and relevant ticket for this issue (https://bugzilla.redhat.com/show_bug.cgi?id=1853911)