Bug 1834253 - VMs are stuck in Starting state
Summary: VMs are stuck in Starting state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 2.4.0
Assignee: Alona Kaplan
QA Contact: Ofir Nash
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-11 12:01 UTC by Radim Hrazdil
Modified: 2020-07-28 19:10 UTC (History)
8 users (show)

Fixed In Version: virt-operator-container-v2.4.0-41
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-28 19:10:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
VM manifests (2.65 KB, application/zip)
2020-05-11 12:01 UTC, Radim Hrazdil
no flags Details
vmi (4.03 KB, text/plain)
2020-05-11 14:18 UTC, Radim Hrazdil
no flags Details
virt-handler-qpb9t.log (13.58 MB, text/plain)
2020-05-17 19:14 UTC, Lukas Bednar
no flags Details
virt-handler-n6k5j.log (569.68 KB, application/gzip)
2020-05-17 19:16 UTC, Lukas Bednar
no flags Details
virt-handler-h4knk.log (10.98 MB, text/plain)
2020-05-17 19:17 UTC, Lukas Bednar
no flags Details
virt-handler.log (12.63 KB, text/plain)
2020-05-21 17:05 UTC, Petr Horáček
no flags Details
virt-launcher.log (2.65 KB, text/plain)
2020-05-21 17:06 UTC, Petr Horáček
no flags Details
virt-handler.prev.log (4.24 KB, text/plain)
2020-05-21 17:09 UTC, Petr Horáček
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3194 0 None None None 2020-07-28 19:10:21 UTC

Description Radim Hrazdil 2020-05-11 12:01:37 UTC
Created attachment 1687282 [details]
VM manifests

Description of problem:
VMs created with any provision method get always stuck in Starting state and never become Running.


Version-Release number of selected component (if applicable):
CNV 2.4
4.5.0-0.nightly-2020-05-04-113741

How reproducible:
100%

Steps to Reproduce:
1. create any VM (example VM will do)
2. Start the VM

Actual results:
VM is in Starting

Expected results:
VM is in Starting

Additional info:

Comment 1 Radim Hrazdil 2020-05-11 14:16:36 UTC
After discussion with Tomas, I'm moving this to CNV Virt.
The issue seems similar to BZ1832179, except that the VMIs are in Scheduled PHASE, not Running and I'm hitting this issue without setting runStrategy=Always.

$ oc get vm,vmi -A
NAMESPACE    NAME                                                    AGE    RUNNING   VOLUME
default      virtualmachine.kubevirt.io/test                         3m7s   true      
default      virtualmachine.kubevirt.io/vm-example                   130m   true      
test-vfwta   virtualmachine.kubevirt.io/vm-list-actions-test-vfwta   120m   true      

NAMESPACE    NAME                                                            AGE    PHASE       IP             NODENAME
default      virtualmachineinstance.kubevirt.io/test                         3m5s   Scheduled   10.128.2.106   rhrazdil24-bs7br-worker-hl2cm
default      virtualmachineinstance.kubevirt.io/vm-example                   107s   Scheduled   10.129.2.72    rhrazdil24-bs7br-worker-dljvg
test-vfwta   virtualmachineinstance.kubevirt.io/vm-list-actions-test-vfwta   107s   Scheduled   10.129.2.71    rhrazdil24-bs7br-worker-dljvg

Indeed, the VM doesn't seem to be running:
virtctl vnc vm-example
Can't access VMI vm-example: Unable to connect to VirtualMachineInstance because phase is Scheduled instead of Running


Status in UI is sometimes Starting, sometimes Stopping, based on the virt-launcher status - it's sometimes in Error state and recreated, which is the cause of the VM status changes.

Comment 2 Radim Hrazdil 2020-05-11 14:18:32 UTC
Created attachment 1687354 [details]
vmi

Comment 3 sgott 2020-05-11 15:04:58 UTC
This looks like your VMs are really stuck in scheduling. Please provide must-gather output in order to help us diagnose the root cause.

Comment 4 Radim Hrazdil 2020-05-11 16:12:24 UTC
Running must-gather failed, further inspection revealed that two master nodes are in NotReady state

NAME                            STATUS     ROLES    AGE   VERSION
rhrazdil24-bs7br-master-0       NotReady   master   31h   v1.18.0-rc.1
rhrazdil24-bs7br-master-1       Ready      master   31h   v1.18.0-rc.1
rhrazdil24-bs7br-master-2       NotReady   master   31h   v1.18.0-rc.1
rhrazdil24-bs7br-worker-dljvg   Ready      worker   31h   v1.18.0-rc.1
rhrazdil24-bs7br-worker-hl2cm   Ready      worker   31h   v1.18.0-rc.1
rhrazdil24-bs7br-worker-xvsd9   Ready      worker   31h   v1.18.0-rc.1

Moreover, several kubevirt resources were crashlooping: kube-apiserver, virt-handler,  nmstate-handler
Apparently, the cluster is not in healthy state, which may have been caused by PSI networking issue.

Apologies for the false positive, I got confused with BZ1832179 for the similarity in behaviour.
Closing as notabug.

Comment 5 sgott 2020-05-12 16:23:30 UTC
Re-opening. This BZ is reproducible. Radim re-deployed the cluster and ran into the same issue.

virt-handler will crash after a VMI is deployed with this traceback:

{"component":"virt-handler","level":"info","msg":"Processing event default/vm-example","pos":"vm.go:860","timestamp":"2020-05-12T15:39:41.279009Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Scheduled\n","name":"vm-example","namespace":"default","pos":"vm.go:862","timestamp":"2020-05-12T15:39:41.279337Z
","uid":"15d575dd-2e81-49ec-ba43-95baae6609cf"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-12T15:39:41.279395Z"}
{"component":"virt-handler","level":"error","msg":"failed to create nat rules for vm error: failed to load nftable ipv4-nat error ","pos":"podinterface.go:582","timestamp":"2020-05-12T15:39:
41.510071Z"}
{"component":"virt-handler","level":"fatal","msg":"failed to prepare pod networking","pos":"podinterface.go:109","reason":"failed to load nftable ipv4-nat error ","timestamp":"2020-05-12T15:
39:41.510140Z"}
panic: failed to load nftable ipv4-nat error 

goroutine 668 [running, locked to thread]:
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.(*PodInterface).PlugPhase1(0x2c3e9b8, 0xc000ae2400, 0xc000d16990, 0xc0011c8240, 0x19b97a3, 0x4, 0x3b196d, 0xc000c9af78, 0xc000d16990)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/podinterface.go:110 +0x903
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.SetupNetworkInterfacesPhase1(0xc000ae2400, 0x3b196d, 0xc001333ce0, 0x21)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/network.go:111 +0x267
kubevirt.io/kubevirt/pkg/virt-handler.(*VirtualMachineController).setPodNetworkPhase1.func1(0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/vm.go:277 +0x3a
kubevirt.io/kubevirt/pkg/virt-handler/isolation.(*realIsolationResult).DoNetNS.func1(0x1c82320, 0xc000ccc1c0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/isolation/isolation.go:254 +0x26
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1(0x1c82320, 0xc000ccc1c0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:183 +0x2c1
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2(0xc0005a0b70, 0xc000e15f60, 0x1c82320, 0xc000ccc1c0, 0xc000ccc1d0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:200 +0x76
created by github.com/containernetworking/plugins/pkg/ns.(*netNS).Do
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:197 +0x224

Comment 6 Guohua Ouyang 2020-05-13 05:35:46 UTC
I see the same issue on 2.4, the logs are a bit different from above but very similar with bz 1743250:

$ oc logs virt-launcher-vm-cirros-hdg5h -c compute
{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:68","timestamp":"2020-05-13T05:16:30.676549Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:71","timestamp":"2020-05-13T05:16:30.676865Z"}
{"component":"virt-launcher","level":"info","msg":"Watchdog file created at /var/run/kubevirt/watchdog-files/default_vm-cirros","pos":"virt-launcher.go:155","timestamp":"2020-05-13T05:16:30.680029Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu:///system","pos":"libvirt.go:355","timestamp":"2020-05-13T05:16:30.682169Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')","pos":"libvirt.go:363","timestamp":"2020-05-13T05:16:30.683294Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 6.0.0, package: 19.module+el8.2.1+6538+c148631f (Red Hat, Inc. \u003chttp://bugzilla.redhat.com/bugzilla\u003e, 2020-05-07-17:48:29, )","subcomponent":"libvirt","thread":"44","timestamp":"2020-05-13T05:16:30.734000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: vm-cirros","subcomponent":"libvirt","thread":"44","timestamp":"2020-05-13T05:16:30.734000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (/usr/sbin/dmidecode -q -t 0,1,2,3,4,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2713","subcomponent":"libvirt","thread":"44","timestamp":"2020-05-13T05:16:30.734000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:371","timestamp":"2020-05-13T05:16:31.194884Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:315","timestamp":"2020-05-13T05:16:31.230093Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-05-13T05:16:31.230580Z"}
panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0x45d964b800, 0xc0000f2fc0, 0xc001a7a000, 0x19ca940, 0xc0000c4370, 0xc0000f2fc0)
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:238 +0x3de
main.main()
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:449 +0x115f
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:531","reason":"exit status 2","timestamp":"2020-05-13T05:21:31.250249Z"}

Comment 7 sgott 2020-05-13 12:30:02 UTC
The traceback in Comment #6 is an after-effect of the traceback in Comment #5. Because virt-handler panics and CrashLoops, it never sends the domain to virt-launcher, which results in the observed timeout. It is worth noting that a second panic can occur in virt-handler, but it is likely masking the first one. No matter what happens with a VMI, virt-handler should not panic.

------

{"component":"virt-handler","level":"info","msg":"Starting device plugin controller","pos":"device_controller.go:92","timestamp":"2020-05-12T15:31:05.598088Z"}
{"component":"virt-handler","level":"info","msg":"kvm device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-12T15:31:05.609547Z"}
{"component":"virt-handler","level":"info","msg":"vhost-net device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-12T15:31:05.610529Z"}
{"component":"virt-handler","level":"info","msg":"tun device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-12T15:31:05.611529Z"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-example","pos":"vm.go:860","timestamp":"2020-05-12T15:31:05.698813Z"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"VMI is in phase: Scheduled\n","name":"vm-example","namespace":"default","pos":"vm.go:862","timestamp":"2020-
05-12T15:31:05.698987Z","uid":"46331abb-248b-4567-8b05-bfb3304350f8"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-12T15:31:05.699085Z"}
{"component":"virt-handler","level":"info","msg":"Updating cluster config to resource version '40015'","pos":"config-map.go:443","timestamp":"2020-05-12T15:31:05.748644Z"}
{"component":"virt-handler","level":"error","msg":"failed to create an interface: k6t-eth0-nic","pos":"podinterface.go:558","reason":"file exists","timestamp":"2020-05-12T15:31:05.765450Z"}
{"component":"virt-handler","level":"fatal","msg":"failed to prepare pod networking","pos":"podinterface.go:109","reason":"file exists","timestamp":"2020-05-12T15:31:05.765517Z"}
panic: file exists

goroutine 611 [running, locked to thread]:
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.(*PodInterface).PlugPhase1(0x2c3e9b8, 0xc00244a000, 0xc00221d200, 0xc0027cc180, 0x19b97a3, 0x4, 0x39b930, 0xc00240ef78, 0xc00221d200)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/podinterface.go:110 +0x903
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.SetupNetworkInterfacesPhase1(0xc00244a000, 0x39b930, 0xc00258d380, 0x21)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/network.go:111 +0x267
kubevirt.io/kubevirt/pkg/virt-handler.(*VirtualMachineController).setPodNetworkPhase1.func1(0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/vm.go:277 +0x3a
kubevirt.io/kubevirt/pkg/virt-handler/isolation.(*realIsolationResult).DoNetNS.func1(0x1c82320, 0xc001a6fa10, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/isolation/isolation.go:254 +0x26
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1(0x1c82320, 0xc001a6fa10, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:183 +0x2c1
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2(0xc0010172b0, 0xc001a729c0, 0x1c82320, 0xc001a6fa10, 0xc001a6fa20)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:200 +0x76
created by github.com/containernetworking/plugins/pkg/ns.(*netNS).Do
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:197 +0x224

Comment 8 Petr Horáček 2020-05-14 15:08:52 UTC
Thanks Stu for the close analysis, I think that the comment #5 takes us really close to the source of the issue. Appreciated!

Comment 9 Alona Kaplan 2020-05-14 19:30:01 UTC
(In reply to sgott from comment #5)
> Re-opening. This BZ is reproducible. Radim re-deployed the cluster and ran
> into the same issue.
> 
> virt-handler will crash after a VMI is deployed with this traceback:
> 
> {"component":"virt-handler","level":"info","msg":"Processing event
> default/vm-example","pos":"vm.go:860","timestamp":"2020-05-12T15:39:41.
> 279009Z"}
> {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase:
> Scheduled\n","name":"vm-example","namespace":"default","pos":"vm.go:862",
> "timestamp":"2020-05-12T15:39:41.279337Z
> ","uid":"15d575dd-2e81-49ec-ba43-95baae6609cf"}
> {"component":"virt-handler","level":"info","msg":"Domain does not
> exist","pos":"vm.go:869","timestamp":"2020-05-12T15:39:41.279395Z"}
> {"component":"virt-handler","level":"error","msg":"failed to create nat
> rules for vm error: failed to load nftable ipv4-nat error


There is something strange here. When introducing ipv6 this error message was changed to - "failed to create ipv4 nat rules for vm error" instead of "failed to create nat rules for vm error".
Are you sure that's a knightly build of 2.4 cluster?

> ","pos":"podinterface.go:582","timestamp":"2020-05-12T15:39:
> 41.510071Z"}
> {"component":"virt-handler","level":"fatal","msg":"failed to prepare pod
> networking","pos":"podinterface.go:109","reason":"failed to load nftable
> ipv4-nat error ","timestamp":"2020-05-12T15:
> 39:41.510140Z"}
> panic: failed to load nftable ipv4-nat error 
> 
> goroutine 668 [running, locked to thread]:
> kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.(*PodInterface).
> PlugPhase1(0x2c3e9b8, 0xc000ae2400, 0xc000d16990, 0xc0011c8240, 0x19b97a3,
> 0x4, 0x3b196d, 0xc000c9af78, 0xc000d16990)
>        
> /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/podinterface.
> go:110 +0x903
> kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.
> SetupNetworkInterfacesPhase1(0xc000ae2400, 0x3b196d, 0xc001333ce0, 0x21)
>        
> /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/network.go:
> 111 +0x267
> kubevirt.io/kubevirt/pkg/virt-handler.(*VirtualMachineController).
> setPodNetworkPhase1.func1(0x0, 0x0)
>         /go/src/kubevirt.io/kubevirt/pkg/virt-handler/vm.go:277 +0x3a
> kubevirt.io/kubevirt/pkg/virt-handler/isolation.(*realIsolationResult).
> DoNetNS.func1(0x1c82320, 0xc000ccc1c0, 0x0, 0x0)
>        
> /go/src/kubevirt.io/kubevirt/pkg/virt-handler/isolation/isolation.go:254
> +0x26
> github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1(0x1c82320,
> 0xc000ccc1c0, 0x0, 0x0)
>        
> /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/
> pkg/ns/ns_linux.go:183 +0x2c1
> github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2(0xc0005a0b70,
> 0xc000e15f60, 0x1c82320, 0xc000ccc1c0, 0xc000ccc1d0)
>        
> /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/
> pkg/ns/ns_linux.go:200 +0x76
> created by github.com/containernetworking/plugins/pkg/ns.(*netNS).Do
>        
> /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/
> pkg/ns/ns_linux.go:197 +0x224

Comment 14 Petr Horáček 2020-05-16 08:19:50 UTC
I collected virt-handler logs from a recent run. It seems to panic for a different reason now, still networking though. It seems like an issue in idempotency of our code. I don't see how it could be related to the previous nftables issue. Maybe we failed due to nftables issue, that restarted virt-handler and on the second run we got into this issue?

{"component":"virt-handler","level":"info","msg":"Starting device plugin controller","pos":"device_controller.go:92","timestamp":"2020-05-16T01:51:03.470937Z"}
{"component":"virt-handler","level":"info","msg":"vhost-net device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-16T01:51:03.487833Z"}
{"component":"virt-handler","level":"info","msg":"kvm device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-16T01:51:03.488828Z"}
{"component":"virt-handler","level":"info","msg":"tun device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-16T01:51:03.489232Z"}
{"component":"virt-handler","level":"info","msg":"Processing event kubevirt-test-alternative/testvmijfjpzl6hknw9wssq8r799ht7rsrjf4pgzsshrck2tdmhgkrv","pos":"vm.go:860","timestamp":"2020-05-16T01:51:03.572725Z"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"VMI is in phase: Scheduled\n","name":"testvmijfjpzl6hknw9wssq8r799ht7rsrjf4pgzsshrck2tdmhgkrv","namespace":"kubevirt-test-alternative","pos":"vm.go:862","timestamp":"2020-05-16T01:51:03.572857Z","uid":"9a277791-de2b-4fd1-afaa-e24f03ee8e4f"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-16T01:51:03.572958Z"}
{"component":"virt-handler","level":"info","msg":"Processing event kubevirt-test-default/testvmi6zqfspm9c7h7c4f8579b8222ghrb4fslzk57kr8w5glqc77w","pos":"vm.go:860","timestamp":"2020-05-16T01:51:03.572602Z"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"VMI is in phase: Scheduled\n","name":"testvmi6zqfspm9c7h7c4f8579b8222ghrb4fslzk57kr8w5glqc77w","namespace":"kubevirt-test-default","pos":"vm.go:862","timestamp":"2020-05-16T01:51:03.573823Z","uid":"6a565390-12ce-44b3-b669-3471d3bed4f5"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-16T01:51:03.573870Z"}
{"component":"virt-handler","level":"error","msg":"failed to create an interface: k6t-eth0-nic","pos":"podinterface.go:558","reason":"file exists","timestamp":"2020-05-16T01:51:03.629576Z"}
{"component":"virt-handler","level":"fatal","msg":"failed to prepare pod networking","pos":"podinterface.go:109","reason":"file exists","timestamp":"2020-05-16T01:51:03.630434Z"}
panic: file exists

goroutine 659 [running, locked to thread]:
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.(*PodInterface).PlugPhase1(0x2c3e9b8, 0xc0015fc000, 0xc000ac9d40, 0xc001c3e9f0, 0x19b97a3, 0x4, 0xe5d3f, 0xc001ae2348, 0xc000ac9d40)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/podinterface.go:110 +0x903
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.SetupNetworkInterfacesPhase1(0xc0015fc000, 0xe5d3f, 0xc000a51760, 0x1f)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/network.go:111 +0x267
kubevirt.io/kubevirt/pkg/virt-handler.(*VirtualMachineController).setPodNetworkPhase1.func1(0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/vm.go:277 +0x3a
kubevirt.io/kubevirt/pkg/virt-handler/isolation.(*realIsolationResult).DoNetNS.func1(0x1c82320, 0xc0013f36a0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/isolation/isolation.go:254 +0x26
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1(0x1c82320, 0xc0013f36a0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:183 +0x2c1
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2(0xc00112a380, 0xc000c57880, 0x1c82320, 0xc0013f36a0, 0xc0013f36b0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:200 +0x76
created by github.com/containernetworking/plugins/pkg/ns.(*netNS).Do
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:197 +0x224

Comment 15 Alona Kaplan 2020-05-17 12:39:39 UTC
Petr, the log you attached seems to be the log from the second run. It fails since "k6t-eth0-nic" already exists (was created on the first run).

Anyway, PR - https://github.com/kubevirt/kubevirt/pull/3128 removes the 'panic' from those two failures (the nftables and the already existing nic).


Looking at the log from the first run (attached below). The nfttable unsuccessful load caused the panic. But the virt-handler wasn't 'healthy' even before this error.
So since the panic is already gone from the code, I think we should put our efforts to understand why we have those "context deadline exceeded" errors. The nftable is not the issue here.

BTW, Radim, do you have the same failures when running a vm with bridge binding (there we don't use iptables/nftables)?

{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"prometheus.go:469","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:24.568976Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/3895ca29-d8a3-48c8-8c2b-51bc008eedf0_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:24.
824147Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:24.824497Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/49c851a2-d732-4560-ad4d-1041da663b32_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:26.
825351Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:26.825498Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/590e13e6-1f35-4517-8066-6b67d78e5493_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:28.
825771Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:28.826051Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/684cac8a-af3f-4777-beb5-bb9c28914abf_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:30.
826483Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:30.826685Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/86145cd0-b69c-4923-ab5e-355ca3d9a599_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:32.
828264Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:32.829149Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/921d19ff-e436-40b2-bc46-f9e2cf6e1099_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:34.
829939Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:34.830359Z"}
{"component":"virt-handler","kind":"Domain","level":"warning","msg":"detected expired watchdog for domain","name":"vm-example","namespace":"default","pos":"cache.go:114","timestamp":"2020-05-14T20:20:35.823281Z","uid":""}
{"component":"virt-handler","kind":"Domain","level":"warning","msg":"detected expired watchdog for domain","name":"vmi-masquerade","namespace":"default","pos":"cache.go:114","timestamp":"2020-05-14T20:20:35.823524Z","uid":""}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/9e6ff55c-81fc-4284-a015-6ecac35ade0a_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:36.
831055Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:36.831300Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/9fc7297f-00c7-4ee5-bf1a-7fe5d71649d3_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:38.
832277Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:38.832391Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/daaf555e-b65c-49ce-9dd3-65fdb2052934_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:40.
832680Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"cache.go:131","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:40.832844Z"}
{"component":"virt-handler","level":"info","msg":"Starting device plugin controller","pos":"device_controller.go:92","timestamp":"2020-05-14T20:20:40.921636Z"}
{"component":"virt-handler","level":"info","msg":"kvm device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-14T20:20:40.933315Z"}
{"component":"virt-handler","level":"info","msg":"tun device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-14T20:20:40.934935Z"}
{"component":"virt-handler","level":"info","msg":"vhost-net device plugin started","pos":"generic_device.go:157","timestamp":"2020-05-14T20:20:40.938356Z"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vmi-masquerade","pos":"vm.go:860","timestamp":"2020-05-14T20:20:41.022274Z"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"VMI is in phase: Failed\n","name":"vmi-masquerade","namespace":"default","pos":"vm.go:862","timestamp":"2020-05-14T20:20:41.022463Z","uid":"9e6ff55c-81fc-42
84-a015-6ecac35ade0a"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-14T20:20:41.022636Z"}
{"component":"virt-handler","level":"info","msg":"Updating cluster config to resource version '353569'","pos":"config-map.go:443","timestamp":"2020-05-14T20:20:41.145806Z"}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/9e6ff55c-81fc-4284-a015-6ecac35ade0a_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:42.
923121Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"prometheus.go:469","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:42.923216Z"}
{"component":"virt-handler","kind":"Domain","level":"warning","msg":"detected expired watchdog for domain","name":"vm-example","namespace":"default","pos":"cache.go:114","timestamp":"2020-05-14T20:20:50.823080Z","uid":""}
{"component":"virt-handler","kind":"Domain","level":"warning","msg":"detected expired watchdog for domain","name":"vmi-masquerade","namespace":"default","pos":"cache.go:114","timestamp":"2020-05-14T20:20:50.823253Z","uid":""}
{"component":"virt-handler","level":"info","msg":"failed to dial cmd socket: /var/run/kubevirt/sockets/9e6ff55c-81fc-4284-a015-6ecac35ade0a_sock","pos":"client.go:135","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:54.508496Z"}
{"component":"virt-handler","level":"error","msg":"failed to connect to cmd client socket","pos":"prometheus.go:469","reason":"context deadline exceeded","timestamp":"2020-05-14T20:20:54.509665Z"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-example","pos":"vm.go:860","timestamp":"2020-05-14T20:20:58.961270Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Scheduled\n","name":"vm-example","namespace":"default","pos":"vm.go:862","timestamp":"2020-05-14T20:20:58.961412Z","uid":"92478755-d503-4cd9-aa8f-42b2e99b7251"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:869","timestamp":"2020-05-14T20:20:58.961480Z"}
{"component":"virt-handler","level":"error","msg":"failed to create nat rules for vm error: failed to load nftable ipv4-nat error ","pos":"podinterface.go:582","timestamp":"2020-05-14T20:20:59.075115Z"}
{"component":"virt-handler","level":"fatal","msg":"failed to prepare pod networking","pos":"podinterface.go:109","reason":"failed to load nftable ipv4-nat error ","timestamp":"2020-05-14T20:20:59.075215Z"}
panic: failed to load nftable ipv4-nat error 

goroutine 544 [running, locked to thread]:
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.(*PodInterface).PlugPhase1(0x2c3e9b8, 0xc001d28c00, 0xc00155c5a0, 0xc001bd1aa0, 0x19b97a3, 0x4, 0x30d0fe, 0xc001614278, 0xc00155c5a0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/podinterface.go:110 +0x903
kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network.SetupNetworkInterfacesPhase1(0xc001d28c00, 0x30d0fe, 0xc001ae57a0, 0x21)
        /go/src/kubevirt.io/kubevirt/pkg/virt-launcher/virtwrap/network/network.go:111 +0x267
kubevirt.io/kubevirt/pkg/virt-handler.(*VirtualMachineController).setPodNetworkPhase1.func1(0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/vm.go:277 +0x3a
kubevirt.io/kubevirt/pkg/virt-handler/isolation.(*realIsolationResult).DoNetNS.func1(0x1c82320, 0xc0016121e0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/pkg/virt-handler/isolation/isolation.go:254 +0x26
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1(0x1c82320, 0xc0016121e0, 0x0, 0x0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:183 +0x2c1
github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2(0xc000fa78b0, 0xc001bdf840, 0x1c82320, 0xc0016121e0, 0xc0016121f0)
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:200 +0x76
created by github.com/containernetworking/plugins/pkg/ns.(*netNS).Do
        /go/src/kubevirt.io/kubevirt/vendor/github.com/containernetworking/plugins/pkg/ns/ns_linux.go:197 +0x224

Comment 16 Nelly Credi 2020-05-17 13:18:01 UTC
With the latest build, we see:
{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:68","timestamp":"2020-05-16T01:53:29.347653Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:71","timestamp":"2020-05-16T01:53:29.347987Z"}
{"component":"virt-launcher","level":"info","msg":"Watchdog file created at /var/run/kubevirt/watchdog-files/kubevirt-test-default_testvmi7w54k772bfhd5wmp4xh2vxpg25bx8jsdw9r9jqct88x4bmlq","pos":"virt-launcher.go:155","timestamp":"2020-05-16T01:53:29.350645Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu:///system","pos":"libvirt.go:355","timestamp":"2020-05-16T01:53:29.368118Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')","pos":"libvirt.go:363","timestamp":"2020-05-16T01:53:29.373286Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 6.0.0, package: 17.module+el8.2.0+6257+0d066c28 (Red Hat, Inc. \u003chttp://bugzilla.redhat.com/bugzilla\u003e, 2020-04-08-16:41:56, )","subcomponent":"libvirt","thread":"43","timestamp":"2020-05-16T01:53:29.484000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: testvmi7w54k772bfhd5wmp4xh2vxpg25bx8jsdw9r9jqct88x4bmlq","subcomponent":"libvirt","thread":"43","timestamp":"2020-05-16T01:53:29.484000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (/usr/sbin/dmidecode -q -t 0,1,2,3,4,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2713","subcomponent":"libvirt","thread":"43","timestamp":"2020-05-16T01:53:29.484000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:371","timestamp":"2020-05-16T01:53:29.885439Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:315","timestamp":"2020-05-16T01:53:29.976887Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-05-16T01:53:29.977635Z"}
panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0x45d964b800, 0xc000254c60, 0xc0000aa480, 0x19ca940, 0xc0000e4870, 0xc000254c60)
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:238 +0x3de
main.main()
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:449 +0x115f
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:531","reason":"exit status 2","timestamp":"2020-05-16T01:58:30.015708Z"}

Comment 17 Alona Kaplan 2020-05-17 13:56:55 UTC
Nelly, can you please attach the virt-handler log?

Comment 18 Lukas Bednar 2020-05-17 19:14:57 UTC
Created attachment 1689452 [details]
virt-handler-qpb9t.log

Comment 19 Lukas Bednar 2020-05-17 19:16:27 UTC
Created attachment 1689453 [details]
virt-handler-n6k5j.log

Comment 20 Lukas Bednar 2020-05-17 19:17:38 UTC
Created attachment 1689454 [details]
virt-handler-h4knk.log

Comment 21 Lukas Bednar 2020-05-17 19:19:07 UTC
I uploaded logs from all virt-handlers.
FYI: All are plain text, except virt-handler-n6k5j.log, it was too large I had to compress it with tar.gz.

Comment 22 Radim Hrazdil 2020-05-18 06:18:24 UTC
Hello Alona,
I confirm that a VM with bridge binding NIC starts without issue.

Comment 27 Petr Horáček 2020-05-21 17:00:15 UTC
I deployed 2.4 via IPI couple of minutes ago and my VM started successfully. I will let to to QE to double-check, but it seems that the issue may be gone.

Comment 28 Petr Horáček 2020-05-21 17:04:56 UTC
It started successfully and couple of minutes later it completely crashed. Attaching logs.

panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0x45d964b800, 0xc0003fb5c0, 0xc00197f380, 0x19ca940, 0xc00056c140, 0xc0003fb5c0)
        /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:238 +0x3de
main.main()
        /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:449 +0x115f
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:531","reason":"exit status 2","timestamp":"2020-05-21T16:58:22.218450Z"}

Comment 29 Petr Horáček 2020-05-21 17:05:59 UTC
Created attachment 1690725 [details]
virt-handler.log

Comment 30 Petr Horáček 2020-05-21 17:06:24 UTC
Created attachment 1690726 [details]
virt-launcher.log

Comment 31 Petr Horáček 2020-05-21 17:09:11 UTC
Created attachment 1690727 [details]
virt-handler.prev.log

It seems to be the same issue as previously. Maybe we don't have the latest U/S version yet.

Comment 35 sgott 2020-05-27 20:39:04 UTC
Root cause has been identified. Packages were missing in the virt-handler container as part of shifting iptables/sysctl responsibility to virt-handler. The packages were:

iptables
procps-ng

Comment 38 errata-xmlrpc 2020-07-28 19:10:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194


Note You need to log in before you can comment on or make changes to this bug.