Bug 2149960
| Summary: | CNV 4.12.0-745 (dev nightly) & 4.11.1| miniscale testing - VMs are crashing during migration / migration is failing. | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Boaz <bbenshab> |
| Component: | Virtualization | Assignee: | Jed Lejosne <jlejosne> |
| Status: | CLOSED DUPLICATE | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.12.0 | CC: | acardace, akamra, fdeutsch, jhopper, sgott |
| Target Milestone: | --- | ||
| Target Release: | 4.13.4 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-29 07:19:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
note that this issue is easily reproduced, however it could also be avoided by reducing the number of VMs migration triggered in parallel. I just had the same issue on OpenShift Virtualization 4.11.1: after triggering 1000 migration 816 VMs got stuck at pending state and later on, 580/816 ended up at failed stat Pending: ======== kubevirt-migrate-vm-zpf8r Pending fedora-vm0342 kubevirt-migrate-vm-zq7rs Pending fedora-vm0653 kubevirt-migrate-vm-zqcgk Pending fedora-vm0714 kubevirt-migrate-vm-zqgtg Pending fedora-vm0738 kubevirt-migrate-vm-zsgzf Pending fedora-vm0664 kubevirt-migrate-vm-zshhg Pending fedora-vm0878 kubevirt-migrate-vm-zsspf Pending fedora-vm0703 kubevirt-migrate-vm-ztstq Pending fedora-vm0662 failed: ====== kubevirt-migrate-vm-zp4fz Failed fedora-vm0753 kubevirt-migrate-vm-zpf8r Failed fedora-vm0342 kubevirt-migrate-vm-zq7rs Failed fedora-vm0653 kubevirt-migrate-vm-zqgtg Failed fedora-vm0738 kubevirt-migrate-vm-zsspf Failed fedora-vm0703 kubevirt-migrate-vm-ztstq Failed fedora-vm0662 kubevirt-migrate-vm-zwpgp Failed fedora-vm0846 Versions of all relevant components: CNV 4.11.1 OCP 4.11.4 CNV must-gather: ----------------- http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_stuck_on_pending.tar.gz http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_failed.tar.gz note that on my last instance I had the following logs set:
logVerbosityConfig:
kubevirt:
virtAPI: 9
virtController: 9
virtHandler: 9
virtLauncher: 9
Igor, can you take a look at this to understand what's happening? One hypothesis we have is that doing too many migrations at once overloads the networking bandwidth or controllers so target VMs don't actually come online. If they eventually time out then they will simply crash. Again this is just a hunch. A proper analysis is still required. Boaz, can you take a look if you saw network saturation? Deferring to 4.13.1 due to capacity. @bbenshab unfortunately, CNV must-gathers are just useless noise... Is there any chance you have (or could grab) the logs of a failed virt-launcher target pod? Also, I'm really intrigued by that "parse error: Invalid numeric literal at line 12, column 6" in the bug description. Is that something that happened in CNV or is it just from some unhappy log parser? Thank you! Moving this to 4.13.4 as we still need to gather logs from virt-launcher pod and also investigate it.
I was able to reproduce this issue on a large-scale setup:
==================================================================================================================================================================================================================
NAME PHASE VMI
kubevirt-migrate-vm-62xxg Failed rhel92-vm1425
kubevirt-migrate-vm-6qs9w Failed rhel92-vm1040
kubevirt-migrate-vm-fw2zm Failed rhel92-vm0900
looking at rhel92-vm1040:
==================================================================================================================================================================================================================
virt-launcher-rhel92-vm1040-jm48n 1/1 Running 0 8h
virt-launcher-rhel92-vm1040-q6nr6 0/1 Error 0 5h19m
==================================================================================================================================================================================================================
failed target pod logs:
==================================================================================================================================================================================================================
{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:86","timestamp":"2023-06-26T11:25:15.793749Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:89","timestamp":"2023-06-26T11:25:15.793796Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/virtqemud-sock","pos":"libvirt.go:496","timestamp":"2023-06-26T11:25:15.794715Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/virtqemud-sock': No such file or directory')","pos":"libvirt.go:504","timestamp":"2023-06-26T11:25:15.795012Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 9.0.0, package: 10.el9_2 (Red Hat, Inc. \u003chttp://bugzilla.redhat.com/bugzilla\u003e, 2023-03-22-05:53:56, )","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: rhel92-vm1040","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Unable to get session bus connection: Cannot spawn a message bus without a machine-id: Invalid machine ID in /var/lib/dbus/machine-id or /etc/machine-id","pos":"virGDBusGetSessionBus:126","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Unable to get system bus connection: Could not connect: No such file or directory","pos":"virGDBusGetSystemBus:99","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:512","timestamp":"2023-06-26T11:25:16.296266Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:537","timestamp":"2023-06-26T11:25:16.299548Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:75","timestamp":"2023-06-26T11:25:16.299734Z"}
panic: timed out waiting for domain to be defined
goroutine 1 [running]:
main.waitForDomainUUID(0xc00029a600?, 0xc00029a4e0, 0xc00054e300, {0x1c4f248, 0xc0002fc4e0})
/remote-source/app/cmd/virt-launcher/virt-launcher.go:246 +0x434
main.main()
/remote-source/app/cmd/virt-launcher/virt-launcher.go:466 +0x1155
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 12 with status 512","pos":"virt-launcher-monitor.go:124","timestamp":"2023-06-26T11:29:28.305094Z"}
{"component":"virt-launcher-monitor","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher-monitor.go:142","timestamp":"2023-06-26T11:29:28.305199Z"}
==================================================================================================================================================================================================================
pod description:
==================================================================================================================================================================================================================
Name: virt-launcher-rhel92-vm1040-q6nr6
Namespace: default
Priority: 0
Service Account: default
Node: worker084/192.168.216.97
Start Time: Mon, 26 Jun 2023 11:25:09 +0000
Labels: kubevirt.io=virt-launcher
kubevirt.io/created-by=d032413e-b344-4e78-b7fa-351e07375db6
kubevirt.io/migrationJobUID=55362155-f700-419f-819f-0aab0abcba0b
kubevirt.io/nodeName=worker074
kubevirt.io/vm=rhel92-vm1040-root-disk
vm.kubevirt.io/name=rhel92-vm1040
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.131.47.187/23"],"mac_address":"0a:58:0a:83:2f:bb","gateway_ips":["10.131.46.1"],"ip_address":"10.131.47.18...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.131.47.187"
],
"mac": "0a:58:0a:83:2f:bb",
"default": true,
"dns": {}
},{
"name": "default/br-vlan",
"interface": "net1",
"mac": "02:4a:b4:00:05:6e",
"dns": {}
}]
k8s.v1.cni.cncf.io/networks: [{"interface":"net1","mac":"02:4a:b4:00:05:6e","name":"br-vlan","namespace":"default"}]
kubectl.kubernetes.io/default-container: compute
kubevirt.io/domain: rhel92-vm1040
kubevirt.io/migrationJobName: kubevirt-migrate-vm-6qs9w
kubevirt.io/migrationTransportUnix: true
kubevirt.io/nonroot: true
post.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--unfreeze", "--name", "rhel92-vm1040", "--namespace", "default"]
post.hook.backup.velero.io/container: compute
pre.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--freeze", "--name", "rhel92-vm1040", "--namespace", "default"]
pre.hook.backup.velero.io/container: compute
Status: Failed
IP: 10.131.47.187
IPs:
IP: 10.131.47.187
Controlled By: VirtualMachineInstance/rhel92-vm1040
Containers:
compute:
Container ID: cri-o://c280816f7c8a0966a38f35bbdf1db73872b5a1bd823524883853732daca67bc6
Image: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97
Image ID: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97
Port: <none>
Host Port: <none>
Command:
/usr/bin/virt-launcher-monitor
--qemu-timeout
252s
--name
rhel92-vm1040
--uid
d032413e-b344-4e78-b7fa-351e07375db6
--namespace
default
--kubevirt-share-dir
/var/run/kubevirt
--ephemeral-disk-dir
/var/run/kubevirt-ephemeral-disks
--container-disk-dir
/var/run/kubevirt/container-disks
--grace-period-seconds
15
--hook-sidecars
0
--ovmf-path
/usr/share/OVMF
--run-as-nonroot
State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 26 Jun 2023 11:25:15 +0000
Finished: Mon, 26 Jun 2023 11:29:28 +0000
Ready: False
Restart Count: 0
Limits:
devices.kubevirt.io/kvm: 1
devices.kubevirt.io/tun: 1
devices.kubevirt.io/vhost-net: 1
Requests:
cpu: 1
devices.kubevirt.io/kvm: 1
devices.kubevirt.io/tun: 1
devices.kubevirt.io/vhost-net: 1
ephemeral-storage: 50M
memory: 2257661642
Environment:
XDG_CACHE_HOME: /var/run/kubevirt-private
XDG_CONFIG_HOME: /var/run/kubevirt-private
XDG_RUNTIME_DIR: /var/run
KUBEVIRT_RESOURCE_NAME_nic-0:
POD_NAME: virt-launcher-rhel92-vm1040-q6nr6 (v1:metadata.name)
Mounts:
/var/run/kubevirt from public (rw)
/var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw)
/var/run/kubevirt-private from private (rw)
/var/run/kubevirt/container-disks from container-disks (rw)
/var/run/kubevirt/hotplug-disks from hotplug-disks (rw)
/var/run/kubevirt/sockets from sockets (rw)
/var/run/libvirt from libvirt-runtime (rw)
Devices:
/dev/rhel92-vm1040-root-disk from rhel92-vm1040-root-disk
Readiness Gates:
Type Status
kubevirt.io/virtual-machine-unpaused <none>
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
private:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
public:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
sockets:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
virt-bin-share-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
libvirt-runtime:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
ephemeral-disks:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
container-disks:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
rhel92-vm1040-root-disk:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rhel92-vm1040-root-disk
ReadOnly: false
hotplug-disks:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: cpu-feature.node.kubevirt.io/amd-ssbd=true
cpu-feature.node.kubevirt.io/amd-stibp=true
cpu-feature.node.kubevirt.io/arch-capabilities=true
cpu-feature.node.kubevirt.io/avx512ifma=true
cpu-feature.node.kubevirt.io/ds=true
cpu-feature.node.kubevirt.io/dtes64=true
cpu-feature.node.kubevirt.io/fsrm=true
cpu-feature.node.kubevirt.io/hypervisor=true
cpu-feature.node.kubevirt.io/ibpb=true
cpu-feature.node.kubevirt.io/ibrs=true
cpu-feature.node.kubevirt.io/ibrs-all=true
cpu-feature.node.kubevirt.io/invtsc=true
cpu-feature.node.kubevirt.io/md-clear=true
cpu-feature.node.kubevirt.io/mds-no=true
cpu-feature.node.kubevirt.io/pdcm=true
cpu-feature.node.kubevirt.io/pschange-mc-no=true
cpu-feature.node.kubevirt.io/rdctl-no=true
cpu-feature.node.kubevirt.io/rdpid=true
cpu-feature.node.kubevirt.io/sha-ni=true
cpu-feature.node.kubevirt.io/skip-l1dfl-vmentry=true
cpu-feature.node.kubevirt.io/ss=true
cpu-feature.node.kubevirt.io/stibp=true
cpu-feature.node.kubevirt.io/tsc_adjust=true
cpu-feature.node.kubevirt.io/tsx-ctrl=true
cpu-feature.node.kubevirt.io/vmx=true
cpu-feature.node.kubevirt.io/xsaves=true
cpu-model-migration.node.kubevirt.io/Icelake-Server=true
kubevirt.io/schedulable=true
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
==================================================================================================================================================================================================================
*** This bug has been marked as a duplicate of bug 2218435 *** |
Some background: ------------------------- I'm running a mini-scale OpenShift setup with 30 OpenShift nodes, the CNV build is the latest nightly 4.12.0-745, I'm currently running 1500 fedora VMs, and I have been doing some migration testing, unfortunately after triggering 1000 migration 161 VMs are stuck at an error state e.g: virt-launcher-fedora-vm1421-7qjfj 0/2 Error 0 98m virt-launcher-fedora-vm1422-plnpg 0/2 Error 0 64m virt-launcher-fedora-vm1423-vw5nv 0/2 Error 0 81m virt-launcher-fedora-vm1424-85bh7 0/2 Error 0 15m virt-launcher-fedora-vm1425-vvflp 0/2 Error 0 48m virt-launcher-fedora-vm1427-xmzls 0/2 Error 0 31m virt-launcher-fedora-vm1428-8c7cf 0/2 Error 0 15m virt-launcher-fedora-vm1430-wm56l 0/2 Error 0 15m logs show: ----------- { "component": "virt-launcher", "level": "info", "msg": "Collected all requested hook sidecar sockets", "pos": "manager.go:86", "timestamp": "2022-12-01T12:41:01.973479Z" } { "component": "virt-launcher", "level": "info", "msg": "Sorted all collected sidecar sockets per hook point based on their priority and name: map[]", "pos": "manager.go:89", "timestamp": "2022-12-01T12:41:01.973531Z" } { "component": "virt-launcher", "level": "info", "msg": "Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/libvirt-sock", "pos": "libvirt.go:497", "timestamp": "2022-12-01T12:41:01.974885Z" } { "component": "virt-launcher", "level": "info", "msg": "Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')", "pos": "libvirt.go:505", "timestamp": "2022-12-01T12:41:01.975225Z" } { "component": "virt-launcher", "level": "info", "msg": "libvirt version: 8.0.0, package: 5.module+el8.6.0+14495+7194fa43 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-16-19:03:54, )", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "info", "msg": "hostname: virt-launcher-fedora-vm1430-wm56l", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "error", "msg": "internal error: Unable to get session bus connection: Cannot autolaunch D-Bus without X11 $DISPLAY", "pos": "virGDBusGetSessionBus:128", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "error", "msg": "internal error: Unable to get system bus connection: Could not connect: No such file or directory", "pos": "virGDBusGetSystemBus:101", "subcomponent": "libvirt", "thread": "46", "timestamp": "2022-12-01T12:41:02.286000Z" } { "component": "virt-launcher", "level": "info", "msg": "Connected to libvirt daemon", "pos": "libvirt.go:513", "timestamp": "2022-12-01T12:41:02.476806Z" } { "component": "virt-launcher", "level": "info", "msg": "Registered libvirt event notify callback", "pos": "client.go:510", "timestamp": "2022-12-01T12:41:02.479265Z" } { "component": "virt-launcher", "level": "info", "msg": "Marked as ready", "pos": "virt-launcher.go:74", "timestamp": "2022-12-01T12:41:02.479412Z" } parse error: Invalid numeric literal at line 12, column 6 this is my migration config: ---------------------------- liveMigrationConfig: completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 20 parallelOutboundMigrationsPerNode: 4 progressTimeout: 150 workloads: {} -------------------- Versions of all relevant components: CNV 4.12.0-745 OCP 4.11.4 CNV must-gather: ----------------- http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/pods_crah_after_migration.tar.gz