1843456 – virt-launcher goes from running to error state due to panic: timed out waiting for domain to be defined

Bug 1843456 - virt-launcher goes from running to error state due to panic: timed out waiting for domain to be defined

Summary: virt-launcher goes from running to error state due to panic: timed out waitin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	2.4.0
Assignee:	Igor Bezukh
QA Contact:	vsibirsk
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-03 10:52 UTC by Geetika Kapoor
Modified:	2020-07-28 19:10 UTC (History)
CC List:	7 users (show)
Fixed In Version:	hco-bundle-registry-container-v2.3.0-461 virt-operator-container-v2.4.0-53
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-28 19:10:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vm_config (1.53 KB, text/plain) 2020-06-03 10:52 UTC, Geetika Kapoor	no flags	Details
virt-handler-logs (384.33 KB, application/gzip) 2020-06-03 12:50 UTC, Geetika Kapoor	no flags	Details
virt-handler-logs-kedar (2.58 MB, application/gzip) 2020-06-04 10:43 UTC, Geetika Kapoor	no flags	Details
Setup and VM Information (8.17 KB, text/plain) 2020-06-09 10:46 UTC, Kedar Bidarkar	no flags	Details
Attached virt-handler logs (506.00 KB, application/gzip) 2020-06-09 10:47 UTC, Kedar Bidarkar	no flags	Details
VM spec with DV (2.54 KB, text/plain) 2020-06-10 11:37 UTC, Kedar Bidarkar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3194	0	None	None	None	2020-07-28 19:10:26 UTC

Description Geetika Kapoor 2020-06-03 10:52:02 UTC

Created attachment 1694803 [details]
vm_config

Description of problem:

This is seen on baremetal environment not sure about PSI.

I have setup an environment with two vm's and out of 2 vm's one of the vm goes into Error state from running with below error.

{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:371","timestamp":"2020-06-03T10:11:30.236995Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:294","timestamp":"2020-06-03T10:11:30.243378Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-06-03T10:11:30.243966Z"}
panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0x45d964b800, 0xc000804a20, 0xc001b36d80, 0x1a23600, 0xc000806460, 0xc000804a20)
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:226 +0x41a
main.main()
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:434 +0x1031
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:516","reason":"exit status 2","timestamp":"2020-06-03T10:16:30.250797Z"}
[cnv-qe-jenkins@cnv-executor


Version-Release number of selected component (if applicable):

$ oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1
2.3.0


How reproducible:

always

Steps to Reproduce:
1. Create a ns : kubevirt-test-default.
2. setup 2 vm's on each worker node using attached config files.
Check status of pods.
[cnv-qe-jenkins@cnv-executor-bm-rhcos gkapoor]$ oc get pods -n kubevirt-test-default
NAME                           READY   STATUS    RESTARTS   AGE
virt-launcher-testvmi1-s6tz8   2/2     Running   0          63s
virt-launcher-testvmi2-nqfqm   2/2     Running   0          9s

3.Retry the status of pods after sometime.
$ oc get pods -n kubevirt-test-default
NAME                           READY   STATUS    RESTARTS   AGE
virt-launcher-testvmi1-s6tz8   2/2     Running   0          11m
virt-launcher-testvmi2-nqfqm   0/2     Error     0          10m

Actual results:

Out of 2 vm, one vm failed with panic error.

Expected results:

All vm should be in running state.

Additional info:

Comment 1 Kedar Bidarkar 2020-06-03 12:17:36 UTC

@Geetika, could you update the logs here, in particular from virt-handler here.

Comment 2 Kedar Bidarkar 2020-06-03 12:20:07 UTC

Can you try this with 1GB of memory and see if you are able to reproduce this issue.

Comment 3 Geetika Kapoor 2020-06-03 12:49:01 UTC

Attached is the logs file for virt-handler. 
I have setup another testvm and I will wait for a while to see how it behaves.

Comment 4 Geetika Kapoor 2020-06-03 12:50:28 UTC

Created attachment 1694814 [details]
virt-handler-logs

Comment 5 Geetika Kapoor 2020-06-03 12:53:07 UTC

For the third vm also , i could see going into error state from running after changing from 256M to 1G.

{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:294","timestamp":"2020-06-03T12:46:25.959351Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-06-03T12:46:25.959727Z"}
panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0x45d964b800, 0xc001be22a0, 0xc00080a420, 0x1a23600, 0xc0004025f0, 0xc001be22a0)
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:226 +0x41a
main.main()
	/go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:434 +0x1031
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:516","reason":"exit status 2","timestamp":"2020-06-03T12:51:25.965774Z"}


<changes>

spec:
  domain:
    resources:
      requests:
        cpu: 100m
        memory: 1G

</changes>

Comment 6 Geetika Kapoor 2020-06-03 13:32:41 UTC

changed image from image: 'kubevirt/fedora-cloud-container-disk-demo:v0.26.5' to image: 'kubevirt/fedora-cloud-container-disk-demo:v0.29.0' but hitting the same issue again.

Comment 9 Geetika Kapoor 2020-06-04 10:43:14 UTC

Created attachment 1695018 [details]
virt-handler-logs-kedar

Comment 10 Kedar Bidarkar 2020-06-09 10:46:48 UTC

Created attachment 1696268 [details]
Setup and VM Information

Logs which show, Fetched VM spec file, virt-operator version, VM enters Error state after almost 3hrs22min

Comment 11 Kedar Bidarkar 2020-06-09 10:47:29 UTC

Created attachment 1696269 [details]
Attached virt-handler logs

Comment 12 sgott 2020-06-09 12:58:47 UTC

The logs from Comment #11 are large. I've filtered out log chatter. Here is the logs when the VMI transitions to failed.


{"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T17:23:44.880387Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T17:23:44.880498Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:814","timestamp":"2020-06-08T17:23:44.880541Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:13.929438Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:13.929582Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:814","timestamp":"2020-06-08T21:03:13.929645Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain deleted","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:1586","timestamp":"2020-06-08T21:03:36.297144Z","uid":""}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.297272Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.297311Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:816","timestamp":"2020-06-08T21:03:36.297328Z"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.327816Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Failed\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.327925Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
{"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:816","timestamp":"2020-06-08T21:03:36.327984Z"}
{"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.336643Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Failed\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.336716Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}

Comment 14 Kedar Bidarkar 2020-06-10 11:36:24 UTC

I can see the VM error's even when the VM uses DV/PVC.

I had created a VM with rhel7 DV and it was running for almost 4hrs 22mins and then the pod entered the "Error" state.

NAME                           READY   STATUS    RESTARTS   AGE     IP           NODE               NOMINATED NODE   READINESS GATES
virt-launcher-vm-rhel7-bhwqt   1/1     Running   0          4h23m   10.129.3.0   host-172-16-0-20   <none>           <none>
NAME                           READY   STATUS   RESTARTS   AGE     IP           NODE               NOMINATED NODE   READINESS GATES
virt-launcher-vm-rhel7-bhwqt   0/1     Error    0          4h24m   10.129.3.0   host-172-16-0-20   <none>           <none>
NAME                           READY   STATUS   RESTARTS   AGE     IP           NODE               NOMINATED NODE   READINESS GATES
virt-launcher-vm-rhel7-bhwqt   0/1     Error    0          4h24m   10.129.3.0   host-172-16-0-20   <none>           <none>

Finding it hard to find the corresponding logs related to this exact event, will check and update here soon.

Comment 15 Kedar Bidarkar 2020-06-10 11:37:12 UTC

Created attachment 1696474 [details]
VM spec with DV

Comment 16 Igor Bezukh 2020-06-16 07:39:38 UTC

I've followed the reproduction steps and I can't reproduce it. I am using OCP 4.4 and CNV 2.3 , VMI spec as attached in the ticket.
Need more info about reproduction.

Comment 17 Igor Bezukh 2020-06-16 07:43:47 UTC

VMI Pods remain stable for the last 19 hours:

» oc get pods
NAME                           READY     STATUS    RESTARTS   AGE
virt-launcher-testvmi1-9hr29   2/2       Running   0          19h
virt-launcher-testvmi2-rnv8d   2/2       Running   0          19h


@Geetika can you please provide us with information about the nodes (oc describe node)?

Comment 18 Geetika Kapoor 2020-06-24 13:15:15 UTC

Worked with Igor. We are seeing and able to reproduce both the issues as reported above.

1. VM going into Error state after some time
2. VM moving to panic state and later goes to error <=5 mins.

Comment 21 Igor Bezukh 2020-06-29 09:41:16 UTC

I've found the root cause of the issue. The solution is that we should reject creation of objects that already have "kubevirt.io" labels, as the case in this bug.
The VMI YAML file contains a manually added "kubevirt.io/nodeName: host-172-16-0-16" label. That causes chaos :)
This label should be added automatically only by virt-controller, and then treated by virt-handler. This label is the way of virt-handler to know that VMI is scheduled on the node
this virt-handler is responsible for. When you create that label manually you trigger the virt-handler before the pod was even scheduled. This scenario has revealed 2 panics:
1. Panic in virt-controller. It happens when virt-handler kicks in right after object creation, moving VMI phase to scheduled. IT causes race condition in VMI phase transition
and therefore VMI pod never created and VMI enters failed state because of virt-handler watchdog timeout. When trying to delete such VMI we got panic in virt-controller
pod disruption budget controller code. PR is simple. I will add it as soon as Github will be back to life.
2. Panic in virt-launcher. That is because the Phase race condition sometimes manage to create a virt-launcher pod but the virt-handler that resides on the same host the virt-launcher
was created on don't care about that virt-launcher pod because the nodename in the label was pointing to another hostname. Hence the virt-handler didn't initialize container-disk
and networking for the virt-launcher pod VM, and libvirt failed to create the domain, and virt-launcher panics on timeout of retrieving the UUID of the domain from libvirtd.
The solution for 2 is to prevent manual adding of kubevirt.io labels on any kubevirt related objects.

Comment 22 Igor Bezukh 2020-07-02 16:32:50 UTC

Created PR https://github.com/kubevirt/kubevirt/pull/3738

Comment 23 zhe peng 2020-07-20 12:06:58 UTC

verify with build: 

virt-operator-container-v2.4.0-54
%virtctl version
Server Version: version.Info{GitVersion:"v0.30.4", GitCommit:"7f69b0f26cbf310f6e089ed6e489551de445ee47", GitTreeState:"clean", BuildDate:"2020-07-13T18:38:32Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
$ oc version
Client Version: 4.4.0-0.nightly-2020-02-17-022408
Server Version: 4.5.1
Kubernetes Version: v1.18.3+8b0a82f

step:
1. create vm spec as attached in description
$ oc create -f vm.yaml
The request is invalid: metadata.labels: creation of the following reserved kubevirt.io/ labels on a VMI object is prohibited

also test with node affinity
....
spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - zpeng-ocp45-kkttp-worker-kwwc5
....
create two vm on different node and start for long time
$ oc get pods
NAME                             READY   STATUS    RESTARTS   AGE
virt-launcher-vm-cirros-crzs2    2/2     Running   0          128m
virt-launcher-vm-cirros2-2qbvr   2/2     Running   0          114m

all vm worked, also test vm lifecycle(restart&stop)
move to verified.

Comment 26 errata-xmlrpc 2020-07-28 19:10:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194

Note You need to log in before you can comment on or make changes to this bug.