Some background: ------------------------- I'm running a mini-scale OpenShift setup with 30 OpenShift nodes, the CNV build is the latest nightly 4.12.0-745, I'm currently running 1500 fedora VMs, and I have been doing some migration testing, unfortunately after triggering 1000 migration 161 VMs are stuck at an error state e.g: virt-launcher-fedora-vm1421-7qjfj 0/2 Error 0 98m virt-launcher-fedora-vm1422-plnpg 0/2 Error 0 64m virt-launcher-fedora-vm1423-vw5nv 0/2 Error 0 81m virt-launcher-fedora-vm1424-85bh7 0/2 Error 0 15m virt-launcher-fedora-vm1425-vvflp 0/2 Error 0 48m virt-launcher-fedora-vm1427-xmzls 0/2 Error 0 31m virt-launcher-fedora-vm1428-8c7cf 0/2 Error 0 15m virt-launcher-fedora-vm1430-wm56l 0/2 Error 0 15m logs show: ----------- { "component": "virt-launcher", "level": "info", "msg": "Collected all requested hook sidecar sockets", "pos": "manager.go:86", "timestamp": "2022-12-01T12:41:01.973479Z" } { "component": "virt-launcher", "level": "info", "msg": "Sorted all collected sidecar sockets per hook point based on their priority and name: map[]", "pos": "manager.go:89", "timestamp": "2022-12-01T12:41:01.973531Z" } { "component": "virt-launcher", "level": "info", "msg": "Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/libvirt-sock", "pos": "libvirt.go:497", "timestamp": "2022-12-01T12:41:01.974885Z" } { "component": "virt-launcher", "level": "info", "msg": "Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')", "pos": "libvirt.go:505", "timestamp": "2022-12-01T12:41:01.975225Z" } { "component": "virt-launcher", "level": "info", "msg": "libvirt version: 8.0.0, package: 5.module+el8.6.0+14495+7194fa43 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-16-19:03:54, )", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "info", "msg": "hostname: virt-launcher-fedora-vm1430-wm56l", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "error", "msg": "internal error: Unable to get session bus connection: Cannot autolaunch D-Bus without X11 $DISPLAY", "pos": "virGDBusGetSessionBus:128", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "error", "msg": "internal error: Unable to get system bus connection: Could not connect: No such file or directory", "pos": "virGDBusGetSystemBus:101", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "info", "msg": "Connected to libvirt daemon", "pos": "libvirt.go:513", "timestamp": "2022-12-01T12:41:02.476806Z" } { "component": "virt-launcher", "level": "info", "msg": "Registered libvirt event notify callback", "pos": "client.go:510", "timestamp": "2022-12-01T12:41:02.479265Z" } { "component": "virt-launcher", "level": "info", "msg": "Marked as ready", "pos": "virt-launcher.go:74", "timestamp": "2022-12-01T12:41:02.479412Z" } parse error: Invalid numeric literal at line 12, column 6 this is my migration config: ---------------------------- liveMigrationConfig: completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 20 parallelOutboundMigrationsPerNode: 4 progressTimeout: 150 workloads: {} -------------------- Versions of all relevant components: CNV 4.12.0-745 OCP 4.11.4 CNV must-gather: ----------------- http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/pods_crah_after_migration.tar.gz
note that this issue is easily reproduced, however it could also be avoided by reducing the number of VMs migration triggered in parallel.
I just had the same issue on OpenShift Virtualization 4.11.1: after triggering 1000 migration 816 VMs got stuck at pending state and later on, 580/816 ended up at failed stat Pending: ======== kubevirt-migrate-vm-zpf8r Pending fedora-vm0342 kubevirt-migrate-vm-zq7rs Pending fedora-vm0653 kubevirt-migrate-vm-zqcgk Pending fedora-vm0714 kubevirt-migrate-vm-zqgtg Pending fedora-vm0738 kubevirt-migrate-vm-zsgzf Pending fedora-vm0664 kubevirt-migrate-vm-zshhg Pending fedora-vm0878 kubevirt-migrate-vm-zsspf Pending fedora-vm0703 kubevirt-migrate-vm-ztstq Pending fedora-vm0662 failed: ====== kubevirt-migrate-vm-zp4fz Failed fedora-vm0753 kubevirt-migrate-vm-zpf8r Failed fedora-vm0342 kubevirt-migrate-vm-zq7rs Failed fedora-vm0653 kubevirt-migrate-vm-zqgtg Failed fedora-vm0738 kubevirt-migrate-vm-zsspf Failed fedora-vm0703 kubevirt-migrate-vm-ztstq Failed fedora-vm0662 kubevirt-migrate-vm-zwpgp Failed fedora-vm0846 Versions of all relevant components: CNV 4.11.1 OCP 4.11.4 CNV must-gather: ----------------- http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_stuck_on_pending.tar.gz http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_failed.tar.gz
note that on my last instance I had the following logs set: logVerbosityConfig: kubevirt: virtAPI: 9 virtController: 9 virtHandler: 9 virtLauncher: 9
Igor, can you take a look at this to understand what's happening? One hypothesis we have is that doing too many migrations at once overloads the networking bandwidth or controllers so target VMs don't actually come online. If they eventually time out then they will simply crash. Again this is just a hunch. A proper analysis is still required.
Boaz, can you take a look if you saw network saturation?
Deferring to 4.13.1 due to capacity.
@bbenshab unfortunately, CNV must-gathers are just useless noise... Is there any chance you have (or could grab) the logs of a failed virt-launcher target pod? Also, I'm really intrigued by that "parse error: Invalid numeric literal at line 12, column 6" in the bug description. Is that something that happened in CNV or is it just from some unhappy log parser? Thank you!
Moving this to 4.13.4 as we still need to gather logs from virt-launcher pod and also investigate it.
I was able to reproduce this issue on a large-scale setup: ================================================================================================================================================================================================================== NAME PHASE VMI kubevirt-migrate-vm-62xxg Failed rhel92-vm1425 kubevirt-migrate-vm-6qs9w Failed rhel92-vm1040 kubevirt-migrate-vm-fw2zm Failed rhel92-vm0900 looking at rhel92-vm1040: ================================================================================================================================================================================================================== virt-launcher-rhel92-vm1040-jm48n 1/1 Running 0 8h virt-launcher-rhel92-vm1040-q6nr6 0/1 Error 0 5h19m ================================================================================================================================================================================================================== failed target pod logs: ================================================================================================================================================================================================================== {"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:86","timestamp":"2023-06-26T11:25:15.793749Z"} {"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:89","timestamp":"2023-06-26T11:25:15.793796Z"} {"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/virtqemud-sock","pos":"libvirt.go:496","timestamp":"2023-06-26T11:25:15.794715Z"} {"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/virtqemud-sock': No such file or directory')","pos":"libvirt.go:504","timestamp":"2023-06-26T11:25:15.795012Z"} {"component":"virt-launcher","level":"info","msg":"libvirt version: 9.0.0, package: 10.el9_2 (Red Hat, Inc. \u003chttp://bugzilla.redhat.com/bugzilla\u003e, 2023-03-22-05:53:56, )","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"} {"component":"virt-launcher","level":"info","msg":"hostname: rhel92-vm1040","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"} {"component":"virt-launcher","level":"error","msg":"internal error: Unable to get session bus connection: Cannot spawn a message bus without a machine-id: Invalid machine ID in /var/lib/dbus/machine-id or /etc/machine-id","pos":"virGDBusGetSessionBus:126","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"} {"component":"virt-launcher","level":"error","msg":"internal error: Unable to get system bus connection: Could not connect: No such file or directory","pos":"virGDBusGetSystemBus:99","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"} {"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:512","timestamp":"2023-06-26T11:25:16.296266Z"} {"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:537","timestamp":"2023-06-26T11:25:16.299548Z"} {"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:75","timestamp":"2023-06-26T11:25:16.299734Z"} panic: timed out waiting for domain to be defined goroutine 1 [running]: main.waitForDomainUUID(0xc00029a600?, 0xc00029a4e0, 0xc00054e300, {0x1c4f248, 0xc0002fc4e0}) /remote-source/app/cmd/virt-launcher/virt-launcher.go:246 +0x434 main.main() /remote-source/app/cmd/virt-launcher/virt-launcher.go:466 +0x1155 {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 12 with status 512","pos":"virt-launcher-monitor.go:124","timestamp":"2023-06-26T11:29:28.305094Z"} {"component":"virt-launcher-monitor","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher-monitor.go:142","timestamp":"2023-06-26T11:29:28.305199Z"} ================================================================================================================================================================================================================== pod description: ================================================================================================================================================================================================================== Name: virt-launcher-rhel92-vm1040-q6nr6 Namespace: default Priority: 0 Service Account: default Node: worker084/192.168.216.97 Start Time: Mon, 26 Jun 2023 11:25:09 +0000 Labels: kubevirt.io=virt-launcher kubevirt.io/created-by=d032413e-b344-4e78-b7fa-351e07375db6 kubevirt.io/migrationJobUID=55362155-f700-419f-819f-0aab0abcba0b kubevirt.io/nodeName=worker074 kubevirt.io/vm=rhel92-vm1040-root-disk vm.kubevirt.io/name=rhel92-vm1040 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.131.47.187/23"],"mac_address":"0a:58:0a:83:2f:bb","gateway_ips":["10.131.46.1"],"ip_address":"10.131.47.18... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.131.47.187" ], "mac": "0a:58:0a:83:2f:bb", "default": true, "dns": {} },{ "name": "default/br-vlan", "interface": "net1", "mac": "02:4a:b4:00:05:6e", "dns": {} }] k8s.v1.cni.cncf.io/networks: [{"interface":"net1","mac":"02:4a:b4:00:05:6e","name":"br-vlan","namespace":"default"}] kubectl.kubernetes.io/default-container: compute kubevirt.io/domain: rhel92-vm1040 kubevirt.io/migrationJobName: kubevirt-migrate-vm-6qs9w kubevirt.io/migrationTransportUnix: true kubevirt.io/nonroot: true post.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--unfreeze", "--name", "rhel92-vm1040", "--namespace", "default"] post.hook.backup.velero.io/container: compute pre.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--freeze", "--name", "rhel92-vm1040", "--namespace", "default"] pre.hook.backup.velero.io/container: compute Status: Failed IP: 10.131.47.187 IPs: IP: 10.131.47.187 Controlled By: VirtualMachineInstance/rhel92-vm1040 Containers: compute: Container ID: cri-o://c280816f7c8a0966a38f35bbdf1db73872b5a1bd823524883853732daca67bc6 Image: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97 Image ID: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97 Port: <none> Host Port: <none> Command: /usr/bin/virt-launcher-monitor --qemu-timeout 252s --name rhel92-vm1040 --uid d032413e-b344-4e78-b7fa-351e07375db6 --namespace default --kubevirt-share-dir /var/run/kubevirt --ephemeral-disk-dir /var/run/kubevirt-ephemeral-disks --container-disk-dir /var/run/kubevirt/container-disks --grace-period-seconds 15 --hook-sidecars 0 --ovmf-path /usr/share/OVMF --run-as-nonroot State: Terminated Reason: Error Exit Code: 2 Started: Mon, 26 Jun 2023 11:25:15 +0000 Finished: Mon, 26 Jun 2023 11:29:28 +0000 Ready: False Restart Count: 0 Limits: devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 devices.kubevirt.io/vhost-net: 1 Requests: cpu: 1 devices.kubevirt.io/kvm: 1 devices.kubevirt.io/tun: 1 devices.kubevirt.io/vhost-net: 1 ephemeral-storage: 50M memory: 2257661642 Environment: XDG_CACHE_HOME: /var/run/kubevirt-private XDG_CONFIG_HOME: /var/run/kubevirt-private XDG_RUNTIME_DIR: /var/run KUBEVIRT_RESOURCE_NAME_nic-0: POD_NAME: virt-launcher-rhel92-vm1040-q6nr6 (v1:metadata.name) Mounts: /var/run/kubevirt from public (rw) /var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw) /var/run/kubevirt-private from private (rw) /var/run/kubevirt/container-disks from container-disks (rw) /var/run/kubevirt/hotplug-disks from hotplug-disks (rw) /var/run/kubevirt/sockets from sockets (rw) /var/run/libvirt from libvirt-runtime (rw) Devices: /dev/rhel92-vm1040-root-disk from rhel92-vm1040-root-disk Readiness Gates: Type Status kubevirt.io/virtual-machine-unpaused <none> Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: private: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> public: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> sockets: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> virt-bin-share-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> libvirt-runtime: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> ephemeral-disks: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> container-disks: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> rhel92-vm1040-root-disk: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: rhel92-vm1040-root-disk ReadOnly: false hotplug-disks: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> QoS Class: Burstable Node-Selectors: cpu-feature.node.kubevirt.io/amd-ssbd=true cpu-feature.node.kubevirt.io/amd-stibp=true cpu-feature.node.kubevirt.io/arch-capabilities=true cpu-feature.node.kubevirt.io/avx512ifma=true cpu-feature.node.kubevirt.io/ds=true cpu-feature.node.kubevirt.io/dtes64=true cpu-feature.node.kubevirt.io/fsrm=true cpu-feature.node.kubevirt.io/hypervisor=true cpu-feature.node.kubevirt.io/ibpb=true cpu-feature.node.kubevirt.io/ibrs=true cpu-feature.node.kubevirt.io/ibrs-all=true cpu-feature.node.kubevirt.io/invtsc=true cpu-feature.node.kubevirt.io/md-clear=true cpu-feature.node.kubevirt.io/mds-no=true cpu-feature.node.kubevirt.io/pdcm=true cpu-feature.node.kubevirt.io/pschange-mc-no=true cpu-feature.node.kubevirt.io/rdctl-no=true cpu-feature.node.kubevirt.io/rdpid=true cpu-feature.node.kubevirt.io/sha-ni=true cpu-feature.node.kubevirt.io/skip-l1dfl-vmentry=true cpu-feature.node.kubevirt.io/ss=true cpu-feature.node.kubevirt.io/stibp=true cpu-feature.node.kubevirt.io/tsc_adjust=true cpu-feature.node.kubevirt.io/tsx-ctrl=true cpu-feature.node.kubevirt.io/vmx=true cpu-feature.node.kubevirt.io/xsaves=true cpu-model-migration.node.kubevirt.io/Icelake-Server=true kubevirt.io/schedulable=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none> ==================================================================================================================================================================================================================
*** This bug has been marked as a duplicate of bug 2218435 ***