Created attachment 1694803 [details] vm_config Description of problem: This is seen on baremetal environment not sure about PSI. I have setup an environment with two vm's and out of 2 vm's one of the vm goes into Error state from running with below error. {"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:371","timestamp":"2020-06-03T10:11:30.236995Z"} {"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:294","timestamp":"2020-06-03T10:11:30.243378Z"} {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-06-03T10:11:30.243966Z"} panic: timed out waiting for domain to be defined goroutine 1 [running]: main.waitForDomainUUID(0x45d964b800, 0xc000804a20, 0xc001b36d80, 0x1a23600, 0xc000806460, 0xc000804a20) /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:226 +0x41a main.main() /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:434 +0x1031 {"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:516","reason":"exit status 2","timestamp":"2020-06-03T10:16:30.250797Z"} [cnv-qe-jenkins@cnv-executor Version-Release number of selected component (if applicable): $ oc get csv -n openshift-cnv | awk ' { print $4 } ' | tail -n1 2.3.0 How reproducible: always Steps to Reproduce: 1. Create a ns : kubevirt-test-default. 2. setup 2 vm's on each worker node using attached config files. Check status of pods. [cnv-qe-jenkins@cnv-executor-bm-rhcos gkapoor]$ oc get pods -n kubevirt-test-default NAME READY STATUS RESTARTS AGE virt-launcher-testvmi1-s6tz8 2/2 Running 0 63s virt-launcher-testvmi2-nqfqm 2/2 Running 0 9s 3.Retry the status of pods after sometime. $ oc get pods -n kubevirt-test-default NAME READY STATUS RESTARTS AGE virt-launcher-testvmi1-s6tz8 2/2 Running 0 11m virt-launcher-testvmi2-nqfqm 0/2 Error 0 10m Actual results: Out of 2 vm, one vm failed with panic error. Expected results: All vm should be in running state. Additional info:
@Geetika, could you update the logs here, in particular from virt-handler here.
Can you try this with 1GB of memory and see if you are able to reproduce this issue.
Attached is the logs file for virt-handler. I have setup another testvm and I will wait for a while to see how it behaves.
Created attachment 1694814 [details] virt-handler-logs
For the third vm also , i could see going into error state from running after changing from 256M to 1G. {"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:294","timestamp":"2020-06-03T12:46:25.959351Z"} {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:73","timestamp":"2020-06-03T12:46:25.959727Z"} panic: timed out waiting for domain to be defined goroutine 1 [running]: main.waitForDomainUUID(0x45d964b800, 0xc001be22a0, 0xc00080a420, 0x1a23600, 0xc0004025f0, 0xc001be22a0) /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:226 +0x41a main.main() /go/src/kubevirt.io/kubevirt/cmd/virt-launcher/virt-launcher.go:434 +0x1031 {"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown","pos":"virt-launcher.go:516","reason":"exit status 2","timestamp":"2020-06-03T12:51:25.965774Z"} <changes> spec: domain: resources: requests: cpu: 100m memory: 1G </changes>
changed image from image: 'kubevirt/fedora-cloud-container-disk-demo:v0.26.5' to image: 'kubevirt/fedora-cloud-container-disk-demo:v0.29.0' but hitting the same issue again.
Created attachment 1695018 [details] virt-handler-logs-kedar
Created attachment 1696268 [details] Setup and VM Information Logs which show, Fetched VM spec file, virt-operator version, VM enters Error state after almost 3hrs22min
Created attachment 1696269 [details] Attached virt-handler logs
The logs from Comment #11 are large. I've filtered out log chatter. Here is the logs when the VMI transitions to failed. {"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T17:23:44.880387Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T17:23:44.880498Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:814","timestamp":"2020-06-08T17:23:44.880541Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:13.929438Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:13.929582Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:814","timestamp":"2020-06-08T21:03:13.929645Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain deleted","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:1586","timestamp":"2020-06-08T21:03:36.297144Z","uid":""} {"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.297272Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.297311Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:816","timestamp":"2020-06-08T21:03:36.297328Z"} {"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.327816Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Failed\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.327925Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"} {"component":"virt-handler","level":"info","msg":"Domain does not exist","pos":"vm.go:816","timestamp":"2020-06-08T21:03:36.327984Z"} {"component":"virt-handler","level":"info","msg":"Processing event default/vm-fedora-cdisk","pos":"vm.go:807","timestamp":"2020-06-08T21:03:36.336643Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Failed\n","name":"vm-fedora-cdisk","namespace":"default","pos":"vm.go:809","timestamp":"2020-06-08T21:03:36.336716Z","uid":"3a5d632f-cc23-4df4-a733-a45eb8449fab"}
I can see the VM error's even when the VM uses DV/PVC. I had created a VM with rhel7 DV and it was running for almost 4hrs 22mins and then the pod entered the "Error" state. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vm-rhel7-bhwqt 1/1 Running 0 4h23m 10.129.3.0 host-172-16-0-20 <none> <none> NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vm-rhel7-bhwqt 0/1 Error 0 4h24m 10.129.3.0 host-172-16-0-20 <none> <none> NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES virt-launcher-vm-rhel7-bhwqt 0/1 Error 0 4h24m 10.129.3.0 host-172-16-0-20 <none> <none> Finding it hard to find the corresponding logs related to this exact event, will check and update here soon.
Created attachment 1696474 [details] VM spec with DV
I've followed the reproduction steps and I can't reproduce it. I am using OCP 4.4 and CNV 2.3 , VMI spec as attached in the ticket. Need more info about reproduction.
VMI Pods remain stable for the last 19 hours: ยป oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-testvmi1-9hr29 2/2 Running 0 19h virt-launcher-testvmi2-rnv8d 2/2 Running 0 19h @Geetika can you please provide us with information about the nodes (oc describe node)?
Worked with Igor. We are seeing and able to reproduce both the issues as reported above. 1. VM going into Error state after some time 2. VM moving to panic state and later goes to error <=5 mins.
I've found the root cause of the issue. The solution is that we should reject creation of objects that already have "kubevirt.io" labels, as the case in this bug. The VMI YAML file contains a manually added "kubevirt.io/nodeName: host-172-16-0-16" label. That causes chaos :) This label should be added automatically only by virt-controller, and then treated by virt-handler. This label is the way of virt-handler to know that VMI is scheduled on the node this virt-handler is responsible for. When you create that label manually you trigger the virt-handler before the pod was even scheduled. This scenario has revealed 2 panics: 1. Panic in virt-controller. It happens when virt-handler kicks in right after object creation, moving VMI phase to scheduled. IT causes race condition in VMI phase transition and therefore VMI pod never created and VMI enters failed state because of virt-handler watchdog timeout. When trying to delete such VMI we got panic in virt-controller pod disruption budget controller code. PR is simple. I will add it as soon as Github will be back to life. 2. Panic in virt-launcher. That is because the Phase race condition sometimes manage to create a virt-launcher pod but the virt-handler that resides on the same host the virt-launcher was created on don't care about that virt-launcher pod because the nodename in the label was pointing to another hostname. Hence the virt-handler didn't initialize container-disk and networking for the virt-launcher pod VM, and libvirt failed to create the domain, and virt-launcher panics on timeout of retrieving the UUID of the domain from libvirtd. The solution for 2 is to prevent manual adding of kubevirt.io labels on any kubevirt related objects.
Created PR https://github.com/kubevirt/kubevirt/pull/3738
verify with build: virt-operator-container-v2.4.0-54 %virtctl version Server Version: version.Info{GitVersion:"v0.30.4", GitCommit:"7f69b0f26cbf310f6e089ed6e489551de445ee47", GitTreeState:"clean", BuildDate:"2020-07-13T18:38:32Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"} $ oc version Client Version: 4.4.0-0.nightly-2020-02-17-022408 Server Version: 4.5.1 Kubernetes Version: v1.18.3+8b0a82f step: 1. create vm spec as attached in description $ oc create -f vm.yaml The request is invalid: metadata.labels: creation of the following reserved kubevirt.io/ labels on a VMI object is prohibited also test with node affinity .... spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - zpeng-ocp45-kkttp-worker-kwwc5 .... create two vm on different node and start for long time $ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-cirros-crzs2 2/2 Running 0 128m virt-launcher-vm-cirros2-2qbvr 2/2 Running 0 114m all vm worked, also test vm lifecycle(restart&stop) move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194