Bug 1856979
| Summary: | Domain notify errors break VMI migrations and graceful shutdown | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | David Vossel <dvossel> | ||||||||||
| Component: | Virtualization | Assignee: | sgott | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 2.4.0 | CC: | cnv-qe-bugs, fdeutsch, ipinto | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | 2.4.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | virt-operator-container-v2.4.0-55 | Doc Type: | If docs needed, set a value | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2020-07-28 19:10:53 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1859661 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
David Vossel
2020-07-14 20:23:17 UTC
This has a significant impact on usability. Once this condition is triggered, there's no workaround. I'm on the fence when thinking in terms of data-loss risk. Losing the ability to do a live-migration means a VM must be shut down to be moved to a new node. I'm leaning toward thinking of this as a blocker-level bug. the PR moved to https://github.com/kubevirt/kubevirt/pull/3830 Note that in that PR, it has been discovered that this requires both a client and server side fix. The client stops sending notifications if the libvirt connection is reset, which can happen for lots of reasons. Basically, if the libvirt connection resets, both the client and server stopped working. We thought it was isolated to the server only, but a new targeted functional test proved otherwise. PR 3830 addresses both the server and client side issues. First tested the VMI migration, after killing the libvirtd process.
NOTE:
1) There is no killall in the downstream virt-launcher pod.
2) And also trying to kill via pkill, does not seem to work,
as seen below, we see the libvirtd process-id is the same.
3) so decided to use "kill -9 <libvirtd process id>" for cycling the libvirtd daemon.
[kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-9spz6
sh-4.4# ls -l /usr/bin/killall
ls: cannot access '/usr/bin/killall': No such file or directory
sh-4.4# ls -l /usr/bin/kill
-rwxr-xr-x. 1 root root 38280 Dec 17 2019 /usr/bin/kill
sh-4.4# ps -ef | grep libvirtd
root 17224 18 0 13:29 ? 00:00:00 /usr/sbin/libvirtd
root 25310 24528 0 13:50 pts/0 00:00:00 grep libvirtd
sh-4.4# /usr/bin/pkill /usr/sbin/libvirtd
sh-4.4# ps -ef | grep libvirtd
root 17224 18 0 13:29 ? 00:00:00 /usr/sbin/libvirtd
root 25419 24528 0 13:50 pts/0 00:00:00 grep libvirtd
sh-4.4# ps -ef | grep libvirtd
root 34 20 0 14:16 ? 00:00:00 /usr/sbin/libvirtd
root 1932 1409 0 14:20 pts/0 00:00:00 grep libvirtd
sh-4.4# /usr/bin/kill -9 34
sh-4.4# ps -ef | grep libvirtd
root 2388 20 1 14:21 ? 00:00:00 /usr/sbin/libvirtd
root 2433 1409 0 14:22 pts/0 00:00:00 grep libvirtd
----
killing the process-id of libvirtd resulted in below entries in the virt-handler pod log, running on the corresponding node, where the VMI was running.
{"component":"virt-handler","kind":"","level":"info","msg":"Received Domain Event of type ERROR","name":"","pos":"server.go:77","timestamp":"2020-07-20T14:22:57.866040Z","uid":""}
{"component":"virt-handler","kind":"","level":"error","msg":"Domain error event with message: Libvirt reconnect","name":"","pos":"server.go:86","timestamp":"2020-07-20T14:22:57.866301Z","uid":""}
{"component":"virt-handler","level":"info","msg":"stopping notify server","pos":"server.go:158","timestamp":"2020-07-20T14:22:57.866670Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"gracefully closed notify pipe connection for vmi","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:246","timestamp":"2020-07-20T14:22:57.867435Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"}
{"component":"virt-handler","level":"error","msg":"error connecting to domain-notify.sock for proxy connection","pos":"vm.go:224","reason":"dial unix /var/run/kubevirt/domain-notify.sock: connect: no such file or directory","timestamp":"2020-07-20T14:22:57.868303Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"Accepted new notify pipe connection for vmi","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:229","timestamp":"2020-07-20T14:22:58.870724Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"}
{"component":"virt-handler","level":"info","msg":"Processing event domain-notify-errors/vm-rhel8-migration","pos":"vm.go:1068","timestamp":"2020-07-20T14:22:58.879384Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:1070","timestamp":"2020-07-20T14:22:58.879440Z","uid":"f9b37564-03d4-4e77-b009-90f19ca68692"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"vm-rhel8-migration","namespace":"domain-notify-errors","pos":"vm.go:1075","timestamp":"2020-07-20T14:22:58.879462Z","uid":""}
More details in next comment.
As seen below migration was successful after cycling the libvirtd daemon, in the VMI pod.
[kbidarka@kbidarka-host final]$ oc get vmi
NAME AGE PHASE IP NODENAME
vm-rhel8-migration 16s Scheduled 10.129.2.45 kbid24c-z9j88-worker-crzzl
[kbidarka@kbidarka-host final]$ oc get vmi
NAME AGE PHASE IP NODENAME
vm-rhel8-migration 21s Running 10.129.2.45 kbid24c-z9j88-worker-crzzl
[kbidarka@kbidarka-host final]$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 24s
[kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration
Successfully connected to vm-rhel8-migration console. The escape sequence is ^]
Red Hat Enterprise Linux 8.2 (Ootpa)
Kernel 4.18.0-193.14.1.el8_2.x86_64 on an x86_64
Activate the web console with: systemctl enable --now cockpit.socket
vm-rhel8-migration login: cloud-user
Password:
Last login: Mon Jul 20 10:15:08 on ttyS0
[cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)
[cloud-user@vm-rhel8-migration ~]$ [kbidarka@kbidarka-host final]$
----------------------------------
[kbidarka@kbidarka-host final]$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 3m9s
[kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-brr87
sh-4.4# ps -ef | grep libvirtd
root 34 20 0 14:16 ? 00:00:00 /usr/sbin/libvirtd
root 1854 1409 0 14:20 pts/0 00:00:00 grep libvirtd
sh-4.4# /usr/bin/kill -9 34
sh-4.4# ps -ef | grep libvirtd
root 2388 20 1 14:21 ? 00:00:00 /usr/sbin/libvirtd
root 2433 1409 0 14:22 pts/0 00:00:00 grep libvirtd
sh-4.4# virsh list
Id Name State
---------------------------------------------------------
1 domain-notify-errors_vm-rhel8-migration running
sh-4.4# exit
exit
-------------------------------
[kbidarka@kbidarka-host final]$ oc get vmi -o wide
NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE
vm-rhel8-migration 9m25s Running 10.129.2.45 kbid24c-z9j88-worker-crzzl True
[kbidarka@kbidarka-host final]$ cat migration-job2-rhel8-dv.yaml
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstanceMigration
metadata:
creationTimestamp: null
name: job-rhel8-dv
spec:
vmiName: vm-rhel8-migration
status: {}
[kbidarka@kbidarka-host final]$ oc apply -f migration-job2-rhel8-dv.yaml
virtualmachineinstancemigration.kubevirt.io/job-rhel8-dv created
[kbidarka@kbidarka-host final]$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-rhel8-migration-2p8dg 0/1 PodInitializing 0 12s
virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 10m
[kbidarka@kbidarka-host final]$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-rhel8-migration-2p8dg 0/1 Running 0 15s
virt-launcher-vm-rhel8-migration-brr87 1/1 Running 0 10m
[kbidarka@kbidarka-host final]$ oc get vmi -o wide
NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE
vm-rhel8-migration 10m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 True
[kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration
Successfully connected to vm-rhel8-migration console. The escape sequence is ^]
[cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)
As seen below shutdown was successful after cycling the libvirtd daemon, in the VMI pod. [kbidarka@kbidarka-host final]$ oc get vmi -o wide NAME AGE PHASE IP NODENAME LIVE-MIGRATABLE vm-rhel8-migration 10m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 True [kbidarka@kbidarka-host final]$ virtctl console vm-rhel8-migration Successfully connected to vm-rhel8-migration console. The escape sequence is ^] [cloud-user@vm-rhel8-migration ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) [kbidarka@kbidarka-host final]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-rhel8-migration-2p8dg 1/1 Running 0 8m57s virt-launcher-vm-rhel8-migration-brr87 0/1 Completed 0 18m [kbidarka@kbidarka-host final]$ oc rsh -c compute virt-launcher-vm-rhel8-migration-2p8dg sh-4.4# virsh list Id Name State --------------------------------------------------------- 1 domain-notify-errors_vm-rhel8-migration running sh-4.4# ps -ef | grep libvirtd root 30 20 0 14:25 ? 00:00:00 /usr/sbin/libvirtd root 3852 3649 0 14:35 pts/0 00:00:00 grep libvirtd sh-4.4# kill -9 30 sh-4.4# ps -ef | grep libvirtd root 3913 20 0 14:35 ? 00:00:00 /usr/sbin/libvirtd root 4754 3649 0 14:37 pts/0 00:00:00 grep libvirtd sh-4.4# kill -9 3913 sh-4.4# exit exit [kbidarka@kbidarka-host final]$ oc get vmi NAME AGE PHASE IP NODENAME vm-rhel8-migration 22m Running 10.129.2.45 kbid24c-z9j88-worker-ddfc8 [kbidarka@kbidarka-host final]$ virtctl stop vm-rhel8-migration VM vm-rhel8-migration was scheduled to stop [kbidarka@kbidarka-host final]$ oc get vmi No resources found in domain-notify-errors namespace. [kbidarka@kbidarka-host final]$ oc get vm NAME AGE VOLUME vm-rhel8-migration 150m Created attachment 1701778 [details]
logs seen when we cycle libvirtd daemon and shutdown the VM
Created attachment 1701801 [details]
with runstrategy halted
Created attachment 1701802 [details]
with runstrategy manual
Created attachment 1701803 [details]
with runstrategy always
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194 |