Description of problem: If a virt-handler's domain notify server receives a client error from any VMI on the host, virt-handler is no longer able to receive any notifications from VMIs currently running. Version-Release number of selected component (if applicable): 2.4 How reproducible: 100% Steps to Reproduce: 1. start a 2 vmis on a host oc create -f vmi1.yaml oc create -f vmi2.yaml 2. cycle libvirt one the vmi's pod (which causes a domain Error event to be sent to virt-handler) k exec -it -c compute virt-launcher-vmi1-cnjm6 -- /usr/bin/killall /usr/sbin/libvirtd 3. Both vmis remain running, but both lose the ability to migrate or able to gracefully shutdown. Additional info: *editted to update PR number* There's a PR upstream to address this. https://github.com/kubevirt/kubevirt/pull/3830 Domain event errors can occur for many reasons. Sometimes it's timing related, sometimes the libvirt even connection is reset. When we get one of these errors from a client, it indicates that the local virt-handler domain informer needs to re-sync. The issue here is that during that resync, we restart the grpc server which broke some assumptions for how we tunnel these connections between namespaces. While it would be a great optimization to not restart the grpc domain notify server when a client sends an error, all of this logic needs to be resilient to reconnects regardless, so that's what the fix in 3818 focuses on. Eventually we need to replace this domain notify logic entirely with the work roman has started here, https://github.com/kubevirt/kubevirt/pull/3311. This is too invasive of a change to get into 2.4 at this point.
This has a significant impact on usability. Once this condition is triggered, there's no workaround. I'm on the fence when thinking in terms of data-loss risk. Losing the ability to do a live-migration means a VM must be shut down to be moved to a new node. I'm leaning toward thinking of this as a blocker-level bug.
the PR moved to https://github.com/kubevirt/kubevirt/pull/3830 Note that in that PR, it has been discovered that this requires both a client and server side fix. The client stops sending notifications if the libvirt connection is reset, which can happen for lots of reasons. Basically, if the libvirt connection resets, both the client and server stopped working. We thought it was isolated to the server only, but a new targeted functional test proved otherwise. PR 3830 addresses both the server and client side issues.
First tested the VMI migration, after killing the libvirtd process. NOTE: 1) There is no killall in the downstream virt-launcher pod. 2) And also trying to kill via pkill, does not seem to work, as seen below, we see the libvirtd process-id is the same. 3) so decided to use "kill -9 <libvirtd process id>" for cycling the libvirtd daemon. [kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-9spz6 sh-4.4# ls -l /usr/bin/killall ls: cannot access '/usr/bin/killall': No such file or directory sh-4.4# ls -l /usr/bin/kill -rwxr-xr-x. 1 root root 38280 Dec 17 2019 /usr/bin/kill sh-4.4# ps -ef | grep libvirtd root 17224 18 0 13:29 ? 00:00:00 /usr/sbin/libvirtd root 25310 24528 0 13:50 pts/0 00:00:00 grep libvirtd sh-4.4# /usr/bin/pkill /usr/sbin/libvirtd sh-4.4# ps -ef | grep libvirtd root 17224 18 0 13:29 ? 00:00:00 /usr/sbin/libvirtd root 25419 24528 0 13:50 pts/0 00:00:00 grep libvirtd sh-4.4# ps -ef | grep libvirtd root 34 20 0 14:16 ? 00:00:00 /usr/sbin/libvirtd root 1932 1409 0 14:20 pts/0 00:00:00 grep libvirtd sh-4.4# /usr/bin/kill -9 34 sh-4.4# ps -ef | grep libvirtd root 2388 20 1 14:21 ? 00:00:00 /usr/sbin/libvirtd root 2433 1409 0 14:22 pts/0 00:00:00 grep libvirtd ---- killing the process-id of libvirtd resulted in below entries in the virt-handler pod log, running on the corresponding node, where the VMI was running. {"component":"virt-handler","kind":"","level":"info","msg":"Received Domain Event of type ERROR","name":"","pos":"server.go:77","timestamp":"2020-07-20T14:22:57.866040Z","uid":""} {"component":"virt-handler","kind":"","level":"error","msg":"Domain error event with message: Libvirt reconnect","name":"","pos":"server.go:86","timestamp":"2020-07-20T14:22:57.866301Z","uid":""} {"component":"virt-handler","level":"info","msg":"stopping notify server","pos":"server.go:158","timestamp":"2020-07-20T14:22:57.866670Z"} {"component":"virt-handler","kind":"","level":"info","msg":"gracefully closed notify pipe connection for vmi","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:246","timestamp":"2020-07-20T14:22:57.867435Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"} {"component":"virt-handler","level":"error","msg":"error connecting to domain-notify.sock for proxy connection","pos":"vm.go:224","reason":"dial unix /var/run/kubevirt/domain-notify.sock: connect: no such file or directory","timestamp":"2020-07-20T14:22:57.868303Z"} {"component":"virt-handler","kind":"","level":"info","msg":"Accepted new notify pipe connection for vmi","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:229","timestamp":"2020-07-20T14:22:58.870724Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"} {"component":"virt-handler","level":"info","msg":"Processing event domain-notify-errors/vm-rhel8-migration","pos":"vm.go:1068","timestamp":"2020-07-20T14:22:58.879384Z"} {"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:1070","timestamp":"2020-07-20T14:22:58.879440Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"} {"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:1075","timestamp":"2020-07-20T14:22:58.879462Z","uid":""} More details in next comment.
As seen below migration was successful after cycling the libvirtd daemon, in the VMI pod. [kbidarka@kbidarka-host final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel8-migration 16s Scheduled 10.129.2.45 kbid24c-z9j88-worker-crzzl [kbidarka@kbidarka-host final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel8-migration 21s Running 10.129.2.45 kbid24c-z9j88-worker-crzzl [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 24s [kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration Successfully connected to vm-rhel8-migration console. The escape sequence is ^] Red Hat Enterprise Linux 8.2 (Ootpa) Kernel 4.18.0-193.14.1.el8_2.x86_64 on an x86_64 Activate the web console with: systemctl enable --now cockpit.socket vm-rhel8-migration login: cloud-user Password: Last login: Mon Jul 20 10:15:08 on ttyS0 [cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) [cloud-user@vm-rhel8-migration ~]$ [kbidarka@kbidarka-host final]$ ---------------------------------- [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 3m9s [kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-brr87 sh-4.4# ps -ef | grep libvirtd root 34 20 0 14:16 ? 00:00:00 /usr/sbin/libvirtd root 1854 1409 0 14:20 pts/0 00:00:00 grep libvirtd sh-4.4# /usr/bin/kill -9 34 sh-4.4# ps -ef | grep libvirtd root 2388 20 1 14:21 ? 00:00:00 /usr/sbin/libvirtd root 2433 1409 0 14:22 pts/0 00:00:00 grep libvirtd sh-4.4# virsh list Id Name State --------------------------------------------------------- 1 domain-notify-errors_vm-rhel8-migration running sh-4.4# exit exit ------------------------------- [kbidarka@kbidarka-host final]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE vm-rhel8-migration 9m25s Running 10.129.2.45 kbid24c-z9j88-worker-crzzl True [kbidarka@kbidarka-host final]$ cat migration-job2-rhel8-dv.yaml apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachineInstanceMigration metadata: creationTimestamp: null name: job-rhel8-dv spec: vmiName: vm-rhel8-migration status: {} [kbidarka@kbidarka-host final]$ oc apply -f migration-job2-rhel8-dv.yaml virtualmachineinstancemigration.kubevirt.io/job-rhel8-dv created [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-2p8dg 0/1 PodInitializing 0 12s virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 10m [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-2p8dg 0/1 Running 0 15s virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 10m [kbidarka@kbidarka-host final]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE vm-rhel8-migration 10m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 True [kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration Successfully connected to vm-rhel8-migration console. The escape sequence is ^] [cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa)
As seen below shutdown was successful after cycling the libvirtd daemon, in the VMI pod. [kbidarka@kbidarka-host final]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE vm-rhel8-migration 10m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 True [kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration Successfully connected to vm-rhel8-migration console. The escape sequence is ^] [cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-2p8dg 1/1 Running 0 8m57s virt-launcher-vm-rhel8-migration-brr87 0/1 Completed 0 18m [kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-2p8dg sh-4.4# virsh list Id Name State --------------------------------------------------------- 1 domain-notify-errors_vm-rhel8-migration running sh-4.4# ps -ef | grep libvirtd root 30 20 0 14:25 ? 00:00:00 /usr/sbin/libvirtd root 3852 3649 0 14:35 pts/0 00:00:00 grep libvirtd sh-4.4# kill -9 30 sh-4.4# ps -ef | grep libvirtd root 3913 20 0 14:35 ? 00:00:00 /usr/sbin/libvirtd root 4754 3649 0 14:37 pts/0 00:00:00 grep libvirtd sh-4.4# kill -9 3913 sh-4.4# exit exit [kbidarka@kbidarka-host final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel8-migration 22m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 [kbidarka@kbidarka-host final]$ virtctl stop vm-rhel8-migration VM vm-rhel8-migration was scheduled to stop [kbidarka@kbidarka-host final]$ oc get vmi No resources found in domain-notify-errors namespace. [kbidarka@kbidarka-host final]$ oc get vm NAME AGE VOLUME vm-rhel8-migration 150m
Created attachment 1701778 [details] logs seen when we cycle libvirtd daemon and shutdown the VM
Created attachment 1701801 [details] with runstrategy halted
Created attachment 1701802 [details] with runstrategy manual
Created attachment 1701803 [details] with runstrategy always
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194