Bug 1852786

Summary:

VM fails to start | launcher-sock: connect: connection refused

Product:

Container Native Virtualization (CNV)

Reporter:

Yossi Segev <ysegev>

Component:

Networking

Assignee:

Petr Horáček <phoracek>

Status:

CLOSED DEFERRED

QA Contact:

Meni Yakove <myakove>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

2.4.0

CC:

cnv-qe-bugs, ellorent, myakove, phoracek

Target Milestone:

---

Target Release:

2.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-07-05 14:11:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
vlan-nncp-.yaml	none
bridge-nncp.yaml	none
bridge-nad.yaml	none
dhco-server-vm.yaml	none

Description Yossi Segev 2020-07-01 10:18:04 UTC

Created attachment 1699464 [details]
vlan-nncp-.yaml

Description of problem:
A VM, which is intended to serve as a DHCP server for VLAN interfaces on nodes, fails to start.
When starting the VMI (virtctl start) - the VMI doesn't get to Running state.


Version-Release number of selected component (if applicable):
CNV 2.4.0


How reproducible:
Always


Steps to Reproduce:
1. Create a VLAN interface using the attached NNCP:
$ oc apply -f vlan-nncp.yaml
Change the base-iface and the interface name according to the available physical NIC on your nodes.
Change the nodeSelector to one of the schedulable nodes in your cluster.

2. Create a bridge interface over this VLAN using the attached NNCP:
$ oc apply -f bridge-nncp.yaml
Change the port name to the VLAN interface name you set in the vlan-nncp-.yaml, and the bridge name accordingly.
 
3. crate a NetworkAttachmentDefinition using the attached NAD spec:
$ oc apply -f bridge-nad.yaml
Change the bridge name according to the bridge interface created in bridge-nncp.yaml

4. Create a VM using the attached VM yaml:
$ oc apply -f dhcp-server-vm.yaml
change networkName according to the name of the NAD in bridge-nad.yaml

5. start the VM:
$ virtctl start dhcp-server-vm


Actual results:
<BUG>
The VMI enters a loop, where a new VMI is created and entering "Scheduling" state, a virt-launcher pod is created, and after a few seconds the pod stops, and a different pod is created, and so on.


Expected results:
VMI should get to "Running" state

Additional info:
1. When the VMI is gone and a new one is created, I see this output in the VMI describe:
$ oc describe vmi dhcp-server-vm
...
Events:
  Type     Reason            Age               From                                                         Message
  ----     ------            ----              ----                                                         -------
  Normal   SuccessfulCreate  10s               virtualmachine-controller                                    Created virtual machine pod virt-launcher-dhcp-server-vm-jt9s5
  Warning  SyncFailed        1s                virt-handler, master-2.cnvcl2.cnvqe.lab.eng.rdu2.redhat.com  unknown error encountered sending command SyncVMI: rpc error: code = Unavailable desc = transport is closing
  Warning  SyncFailed        1s (x11 over 1s)  virt-handler, master-2.cnvcl2.cnvqe.lab.eng.rdu2.redhat.com  failed to detect VMI pod: dial unix //pods/4701df26-d742-47da-85f4-b0770ec986f3/volumes/kubernetes.io~empty-dir/sockets/launcher-sock: connect: connection refused

2. On the hosting node - this is the output of "bridge vlan show" which I run during the attempt to start the VMI:
[core@cnv-qe-10 ~]$ bridge vlan show
port    vlan ids
eno1
eno2
eno3
eno4
br10     1 PVID Egress Untagged

eno4.1001        1 PVID Egress Untagged
         2  
         3  
         4  
         5  
         6  
... (7-998)                                     
         999
         1000
         1001
         1002
... (1003-4090)
         4091
         4092
         4093
         4094

eno4.1001-br     1 PVID Egress Untagged

veth0e856763     1 PVID Egress Untagged

Note that in this output, veth0e856763 is the VETH interface created for the bridge. It seems to be changed for every attempt of starting VMI - on each attempt I see a different name.

3. I tried to capture journalctl output during one of these attempts. In this snippet you can follow the life cycle of the VETH from the example above (veth0e856763).
http://pastebin.test.redhat.com/880012

4. I also tried to capture a relevant cycle of nmstate pod log on the hosting node:
http://pastebin.test.redhat.com/880018

5. Resources used (files are attached):
VLAN interface NNCP: vlan-nncp.yaml
Bridge NNCP: bridge-nncp.yaml
Bridge NAD: bridge-nad.yaml
VM spec: dhcp-server-vm.yaml

6. Please note that because everything was running in a quick loop, you might see some discrepancies between the outputs, for example - one name of a virt-launcher pod in the VMI describe output, and a different pod name in the journalctl output

7. I you remove the secondary interface (multus) from the VM - the Vm starts successfully.

8. Although VLAN interfaces can fully function, and receive dynamic IP address from a DHCP server only on BM machines, this bug, which happens prior to enabling DHCP client on the nodes, is similar on both BM and PSI VMs.

Comment 1 Yossi Segev 2020-07-01 10:19:48 UTC

Created attachment 1699465 [details]
bridge-nncp.yaml

Comment 2 Yossi Segev 2020-07-01 10:20:28 UTC

Created attachment 1699466 [details]
bridge-nad.yaml

Comment 3 Yossi Segev 2020-07-01 10:20:52 UTC

Created attachment 1699467 [details]
dhco-server-vm.yaml

Comment 4 Quique Llorente 2020-07-01 11:20:13 UTC

Got some logs 

oc logs --tail=1000 -n yoss-ns -l kubevirt.io=virt-launcher -c compute

{"component":"virt-launcher","level":"info","msg":"Found nameservers in /etc/resolv.conf: \ufffd\u001e\u0000\n","pos":"converter.go:1564","timestamp":"2020-07-01T11:19:21.072418Z"}
{"component":"virt-launcher","level":"info","msg":"Found search domains in /etc/resolv.conf: yoss-ns.svc.cluster.local svc.cluster.local cluster.local openstacklocal","pos":"converter.go:1565","timestamp":"2020-07-01T11:19:21.072457Z"}
{"component":"virt-launcher","level":"info","msg":"Starting SingleClientDHCPServer","pos":"dhcp.go:64","timestamp":"2020-07-01T11:19:21.072563Z"}
{"component":"virt-launcher","level":"info","msg":"/var/run/kubevirt/container-disks/disk_0.img backing file system does not support direct I/O","pos":"converter.go:168","timestamp":"2020-07-01T11:19:21.086636Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/disk-data/containerdisk/disk.qcow2 set to none","pos":"converter.go:187","timestamp":"2020-07-01T11:19:21.086740Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/cloud-init-data/yoss-ns/dhcp-server-vm/noCloud.iso set to none","pos":"converter.go:187","timestamp":"2020-07-01T11:19:21.086774Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain defined.","name":"dhcp-server-vm","namespace":"yoss-ns","pos":"manager.go:1204","timestamp":"2020-07-01T11:19:21.440929Z","uid":"ea6c40dd-6a12-4693-af95-0e97669de101"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 0 received","pos":"client.go:337","timestamp":"2020-07-01T11:19:21.441187Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x142a95c]

Comment 5 Quique Llorente 2020-07-01 11:22:37 UTC

Stack trace
oroutine 54 [running]:
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).buildDevicesMetadata(0xc0019bc070, 0xc001ae0000, 0x19ee4a0, 0xc001b76218, 0xc0000d2eb0, 0xc001b7a490, 0xc001b7a494, 0xc000339310, 0x1541140)
	/go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:1615 +0x27c
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).generateCloudInitISO(0xc0019bc070, 0xc001ae0000, 0xc001af0bf0, 0x0, 0x0)
	/go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:888 +0x173
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap.(*LibvirtDomainManager).SyncVMI(0xc0019bc070, 0xc001ae0000, 0xc0019a6400, 0xc000010030, 0x0, 0x0, 0x0)
	/go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/manager.go:1230 +0x1116
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/cmd-server.(*Launcher).SyncVirtualMachine(0xc0019480a0, 0x19c2940, 0xc000286270, 0xc000392520, 0xc0019480a0, 0xc000286270, 0xc00197db30)
	/go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/cmd-server/server.go:160 +0x9b
kubevirt.io/kubevirt/pkg/handler-launcher-com/cmd/v1._Cmd_SyncVirtualMachine_Handler(0x16c13a0, 0xc0019480a0, 0x19c2940, 0xc000286270, 0xc0001007e0, 0x0, 0x19c2940, 0xc000286270, 0xc000162000, 0xea6)
	/go/src/kubevirt.io/kubevirt/pkg/handler-launcher-com/cmd/v1/cmd.pb.go:556 +0x21a
kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000, 0xc0002bc4e0, 0x27978a0, 0x0, 0x0, 0x0)
	/go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:1024 +0x4f4
kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).handleStream(0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000, 0x0)
	/go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:1313 +0xd97
kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc00192cc30, 0xc0002df200, 0x19ddbe0, 0xc00012d200, 0xc001adc000)
	/go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:722 +0xbb
created by kubevirt.io/kubevirt/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/kubevirt.io/kubevirt/vendor/google.golang.org/grpc/server.go:720 +0xa1
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:510","reason":"exit status 2","timestamp":"2020-07-01T11:19:21.450017Z"}

Comment 6 Yossi Segev 2020-07-01 12:51:22 UTC

I also tested a similar scenario, only this time I configured the bridge over one of the node's physical interfaces, rather than over VLAN interface.
This time the VMI started successfully, with the secondary NIC seen when entering the VM console.

[fedora@multus-vm ~]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:7f:3d:00:01:d4 brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86312782sec preferred_lft 86312782sec
    inet6 fe80::7f:3dff:fe00:1d4/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:7f:3d:00:01:d5 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 10.200.1.1/24 brd 10.200.1.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::7f:3dff:fe00:1d5/64 scope link 
       valid_lft forever preferred_lft forever

Comment 7 Yossi Segev 2020-07-01 15:53:53 UTC

Quique found what needed to be changed in order for that to work properly:
The network name of the secondary interface in the VM spec could not be consumed.
The original network name was "eno4.1001-br". When we changed that to "blah" (and in the corresponding interface entry, of course) the VMI manged to start successfully.
So now the issue is that if there's a problem with the naming - it must be reflected to the user as an error/event in the VMI description.

Quique - do we know who has problems consuming this name? Is it virt-launcher pod, multus CNI, or any other entity?

Comment 8 Yossi Segev 2020-07-01 19:29:55 UTC

I have managed to pinpoint the problem in the network name - it's the dot.
When I replace the dot with a dash, i.e. "eno4-1001-br" instead of "eno4.1001-br", the VMI starts running successfully.
Hilarious...

Comment 9 Yossi Segev 2020-07-01 19:35:52 UTC

And another point worth emphasizing, although it seems to be quite obvious now:
This has nothing to with the VM intended to serve as a DHCP server, nor with the bridge's port being a VLAN interface.
I have verified that by testing the same setup I used in comment#6 - where the bridge is configured over a physical node interface (and not over a VLAN interface).
The VMI failed to start when the network name was "eno4.1001-br", and ran successfully with "eno4-1001-br"

Comment 10 Yossi Segev 2020-07-02 06:35:29 UTC

Another resolution about this:
This issue also occurs the default network name includes a dot for example "d.efault").
So I assume the problem is not in multus, but elsewhere (k8s, openshift, CNV... ?).

Comment 11 Yossi Segev 2020-07-05 14:11:28 UTC

Closing this one, and opened a more general and relevant ticket for this issue (https://bugzilla.redhat.com/show_bug.cgi?id=1853911)