Bug 2149960 - CNV 4.12.0-745 (dev nightly) & 4.11.1| miniscale testing - VMs are crashing during migration / migration is failing.
Summary: CNV 4.12.0-745 (dev nightly) & 4.11.1| miniscale testing - VMs are crashing d...
Keywords:
Status: CLOSED DUPLICATE of bug 2218435
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.12.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.13.4
Assignee: Jed Lejosne
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-01 13:06 UTC by Boaz
Modified: 2023-06-29 07:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-29 07:19:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-23056 0 None None None 2022-12-01 13:16:02 UTC

Description Boaz 2022-12-01 13:06:05 UTC
Some background:
-------------------------
I'm running a mini-scale OpenShift setup with 30 OpenShift nodes, the CNV build is the latest nightly 4.12.0-745, I'm currently running 1500 fedora VMs,  and I have been doing some migration testing, unfortunately after triggering 1000 migration 161 VMs are stuck at an error state e.g:

virt-launcher-fedora-vm1421-7qjfj   0/2     Error             0          98m
virt-launcher-fedora-vm1422-plnpg   0/2     Error             0          64m
virt-launcher-fedora-vm1423-vw5nv   0/2     Error             0          81m
virt-launcher-fedora-vm1424-85bh7   0/2     Error             0          15m
virt-launcher-fedora-vm1425-vvflp   0/2     Error             0          48m
virt-launcher-fedora-vm1427-xmzls   0/2     Error             0          31m
virt-launcher-fedora-vm1428-8c7cf   0/2     Error             0          15m
virt-launcher-fedora-vm1430-wm56l   0/2     Error             0          15m


logs show:
-----------
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Collected all requested hook sidecar sockets",
  "pos": "manager.go:86",
  "timestamp": "2022-12-01T12:41:01.973479Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Sorted all collected sidecar sockets per hook point based on their priority and name: map[]",
  "pos": "manager.go:89",
  "timestamp": "2022-12-01T12:41:01.973531Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/libvirt-sock",
  "pos": "libvirt.go:497",
  "timestamp": "2022-12-01T12:41:01.974885Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')",
  "pos": "libvirt.go:505",
  "timestamp": "2022-12-01T12:41:01.975225Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "libvirt version: 8.0.0, package: 5.module+el8.6.0+14495+7194fa43 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-16-19:03:54, )",
  "subcomponent": "libvirt",
  "thread": "46",
  "timestamp": "2022-12-01T12:41:02.286000Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "hostname: virt-launcher-fedora-vm1430-wm56l",
  "subcomponent": "libvirt",
  "thread": "46",
  "timestamp": "2022-12-01T12:41:02.286000Z"
}
{
  "component": "virt-launcher",
  "level": "error",
  "msg": "internal error: Unable to get session bus connection: Cannot autolaunch D-Bus without X11 $DISPLAY",
  "pos": "virGDBusGetSessionBus:128",
  "subcomponent": "libvirt",
  "thread": "46",
  "timestamp": "2022-12-01T12:41:02.286000Z"
}
{
  "component": "virt-launcher",
  "level": "error",
  "msg": "internal error: Unable to get system bus connection: Could not connect: No such file or directory",
  "pos": "virGDBusGetSystemBus:101",
  "subcomponent": "libvirt",
  "thread": "46",
  "timestamp": "2022-12-01T12:41:02.286000Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Connected to libvirt daemon",
  "pos": "libvirt.go:513",
  "timestamp": "2022-12-01T12:41:02.476806Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Registered libvirt event notify callback",
  "pos": "client.go:510",
  "timestamp": "2022-12-01T12:41:02.479265Z"
}
{
  "component": "virt-launcher",
  "level": "info",
  "msg": "Marked as ready",
  "pos": "virt-launcher.go:74",
  "timestamp": "2022-12-01T12:41:02.479412Z"
}
parse error: Invalid numeric literal at line 12, column 6
  


this is my migration config:
----------------------------
  liveMigrationConfig:
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 20
    parallelOutboundMigrationsPerNode: 4
    progressTimeout: 150
  workloads: {}
--------------------

Versions of all relevant components:
CNV	4.12.0-745
OCP     4.11.4


CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/pods_crah_after_migration.tar.gz

Comment 1 Boaz 2022-12-04 11:23:40 UTC
note that this issue is easily reproduced, however it could also be avoided by reducing the number of VMs migration triggered in parallel.

Comment 2 Boaz 2022-12-05 09:42:25 UTC
I just had the same issue on OpenShift Virtualization 4.11.1:
after triggering 1000 migration 816 VMs got stuck at pending state and later on, 580/816 ended up at failed stat


Pending:
========
kubevirt-migrate-vm-zpf8r   Pending     fedora-vm0342
kubevirt-migrate-vm-zq7rs   Pending     fedora-vm0653
kubevirt-migrate-vm-zqcgk   Pending     fedora-vm0714
kubevirt-migrate-vm-zqgtg   Pending     fedora-vm0738
kubevirt-migrate-vm-zsgzf   Pending     fedora-vm0664
kubevirt-migrate-vm-zshhg   Pending     fedora-vm0878
kubevirt-migrate-vm-zsspf   Pending     fedora-vm0703
kubevirt-migrate-vm-ztstq   Pending     fedora-vm0662



failed:
======
kubevirt-migrate-vm-zp4fz   Failed            fedora-vm0753
kubevirt-migrate-vm-zpf8r   Failed            fedora-vm0342
kubevirt-migrate-vm-zq7rs   Failed            fedora-vm0653
kubevirt-migrate-vm-zqgtg   Failed            fedora-vm0738
kubevirt-migrate-vm-zsspf   Failed            fedora-vm0703
kubevirt-migrate-vm-ztstq   Failed            fedora-vm0662
kubevirt-migrate-vm-zwpgp   Failed            fedora-vm0846

  




Versions of all relevant components:
CNV	4.11.1
OCP     4.11.4


CNV must-gather:
-----------------
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_stuck_on_pending.tar.gz
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/migrating_vms_failed.tar.gz

Comment 3 Boaz 2022-12-05 09:43:55 UTC
note that on my last instance I had the following logs set:

  logVerbosityConfig:
    kubevirt:
      virtAPI: 9
      virtController: 9
      virtHandler: 9
      virtLauncher: 9

Comment 4 sgott 2022-12-07 13:17:48 UTC
Igor, can you take a look at this to understand what's happening?

One hypothesis we have is that doing too many migrations at once overloads the networking bandwidth or controllers so target VMs don't actually come online. If they eventually time out then they will simply crash. Again this is just a hunch. A proper analysis is still required.

Comment 6 Fabian Deutsch 2022-12-07 16:45:34 UTC
Boaz, can you take a look if you saw network saturation?

Comment 10 Antonio Cardace 2023-03-03 15:26:18 UTC
Deferring to 4.13.1 due to capacity.

Comment 11 Jed Lejosne 2023-05-30 17:55:16 UTC
@bbenshab unfortunately, CNV must-gathers are just useless noise... Is there any chance you have (or could grab) the logs of a failed virt-launcher target pod?
Also, I'm really intrigued by that "parse error: Invalid numeric literal at line 12, column 6" in the bug description. Is that something that happened in CNV or is it just from some unhappy log parser?
Thank you!

Comment 12 Kedar Bidarkar 2023-06-21 12:17:25 UTC
Moving this to 4.13.4 as we still need to gather logs from virt-launcher pod and also investigate it.

Comment 13 Boaz 2023-06-26 16:50:45 UTC
I was able to reproduce this issue on a large-scale setup:

==================================================================================================================================================================================================================

NAME                        PHASE       VMI
kubevirt-migrate-vm-62xxg   Failed      rhel92-vm1425
kubevirt-migrate-vm-6qs9w   Failed      rhel92-vm1040
kubevirt-migrate-vm-fw2zm   Failed      rhel92-vm0900

looking at rhel92-vm1040:
==================================================================================================================================================================================================================

virt-launcher-rhel92-vm1040-jm48n   1/1     Running     0          8h
virt-launcher-rhel92-vm1040-q6nr6   0/1     Error       0          5h19m
==================================================================================================================================================================================================================

failed target pod logs:

==================================================================================================================================================================================================================

{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:86","timestamp":"2023-06-26T11:25:15.793749Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:89","timestamp":"2023-06-26T11:25:15.793796Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/virtqemud-sock","pos":"libvirt.go:496","timestamp":"2023-06-26T11:25:15.794715Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/virtqemud-sock': No such file or directory')","pos":"libvirt.go:504","timestamp":"2023-06-26T11:25:15.795012Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 9.0.0, package: 10.el9_2 (Red Hat, Inc. \u003chttp://bugzilla.redhat.com/bugzilla\u003e, 2023-03-22-05:53:56, )","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: rhel92-vm1040","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Unable to get session bus connection: Cannot spawn a message bus without a machine-id: Invalid machine ID in /var/lib/dbus/machine-id or /etc/machine-id","pos":"virGDBusGetSessionBus:126","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Unable to get system bus connection: Could not connect: No such file or directory","pos":"virGDBusGetSystemBus:99","subcomponent":"libvirt","thread":"46","timestamp":"2023-06-26T11:25:15.820000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:512","timestamp":"2023-06-26T11:25:16.296266Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:537","timestamp":"2023-06-26T11:25:16.299548Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:75","timestamp":"2023-06-26T11:25:16.299734Z"}
panic: timed out waiting for domain to be defined

goroutine 1 [running]:
main.waitForDomainUUID(0xc00029a600?, 0xc00029a4e0, 0xc00054e300, {0x1c4f248, 0xc0002fc4e0})
	/remote-source/app/cmd/virt-launcher/virt-launcher.go:246 +0x434
main.main()
	/remote-source/app/cmd/virt-launcher/virt-launcher.go:466 +0x1155
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 12 with status 512","pos":"virt-launcher-monitor.go:124","timestamp":"2023-06-26T11:29:28.305094Z"}
{"component":"virt-launcher-monitor","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher-monitor.go:142","timestamp":"2023-06-26T11:29:28.305199Z"}
==================================================================================================================================================================================================================


pod description:
==================================================================================================================================================================================================================

Name:             virt-launcher-rhel92-vm1040-q6nr6
Namespace:        default
Priority:         0
Service Account:  default
Node:             worker084/192.168.216.97
Start Time:       Mon, 26 Jun 2023 11:25:09 +0000
Labels:           kubevirt.io=virt-launcher
                  kubevirt.io/created-by=d032413e-b344-4e78-b7fa-351e07375db6
                  kubevirt.io/migrationJobUID=55362155-f700-419f-819f-0aab0abcba0b
                  kubevirt.io/nodeName=worker074
                  kubevirt.io/vm=rhel92-vm1040-root-disk
                  vm.kubevirt.io/name=rhel92-vm1040
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.131.47.187/23"],"mac_address":"0a:58:0a:83:2f:bb","gateway_ips":["10.131.46.1"],"ip_address":"10.131.47.18...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "10.131.47.187"
                        ],
                        "mac": "0a:58:0a:83:2f:bb",
                        "default": true,
                        "dns": {}
                    },{
                        "name": "default/br-vlan",
                        "interface": "net1",
                        "mac": "02:4a:b4:00:05:6e",
                        "dns": {}
                    }]
                  k8s.v1.cni.cncf.io/networks: [{"interface":"net1","mac":"02:4a:b4:00:05:6e","name":"br-vlan","namespace":"default"}]
                  kubectl.kubernetes.io/default-container: compute
                  kubevirt.io/domain: rhel92-vm1040
                  kubevirt.io/migrationJobName: kubevirt-migrate-vm-6qs9w
                  kubevirt.io/migrationTransportUnix: true
                  kubevirt.io/nonroot: true
                  post.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--unfreeze", "--name", "rhel92-vm1040", "--namespace", "default"]
                  post.hook.backup.velero.io/container: compute
                  pre.hook.backup.velero.io/command: ["/usr/bin/virt-freezer", "--freeze", "--name", "rhel92-vm1040", "--namespace", "default"]
                  pre.hook.backup.velero.io/container: compute
Status:           Failed
IP:               10.131.47.187
IPs:
  IP:           10.131.47.187
Controlled By:  VirtualMachineInstance/rhel92-vm1040
Containers:
  compute:
    Container ID:  cri-o://c280816f7c8a0966a38f35bbdf1db73872b5a1bd823524883853732daca67bc6
    Image:         registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97
    Image ID:      registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:0ab71be49e0a6a7eff00264c72da8a9ef0fa3fbca6b2381de0d15540527cce97
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/virt-launcher-monitor
      --qemu-timeout
      252s
      --name
      rhel92-vm1040
      --uid
      d032413e-b344-4e78-b7fa-351e07375db6
      --namespace
      default
      --kubevirt-share-dir
      /var/run/kubevirt
      --ephemeral-disk-dir
      /var/run/kubevirt-ephemeral-disks
      --container-disk-dir
      /var/run/kubevirt/container-disks
      --grace-period-seconds
      15
      --hook-sidecars
      0
      --ovmf-path
      /usr/share/OVMF
      --run-as-nonroot
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 26 Jun 2023 11:25:15 +0000
      Finished:     Mon, 26 Jun 2023 11:29:28 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
    Requests:
      cpu:                            1
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      ephemeral-storage:              50M
      memory:                         2257661642
    Environment:
      XDG_CACHE_HOME:                /var/run/kubevirt-private
      XDG_CONFIG_HOME:               /var/run/kubevirt-private
      XDG_RUNTIME_DIR:               /var/run
      KUBEVIRT_RESOURCE_NAME_nic-0:
      POD_NAME:                      virt-launcher-rhel92-vm1040-q6nr6 (v1:metadata.name)
    Mounts:
      /var/run/kubevirt from public (rw)
      /var/run/kubevirt-ephemeral-disks from ephemeral-disks (rw)
      /var/run/kubevirt-private from private (rw)
      /var/run/kubevirt/container-disks from container-disks (rw)
      /var/run/kubevirt/hotplug-disks from hotplug-disks (rw)
      /var/run/kubevirt/sockets from sockets (rw)
      /var/run/libvirt from libvirt-runtime (rw)
    Devices:
      /dev/rhel92-vm1040-root-disk from rhel92-vm1040-root-disk
Readiness Gates:
  Type                                   Status
  kubevirt.io/virtual-machine-unpaused   <none>
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  private:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  public:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  sockets:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  virt-bin-share-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  libvirt-runtime:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  ephemeral-disks:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  container-disks:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  rhel92-vm1040-root-disk:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rhel92-vm1040-root-disk
    ReadOnly:   false
  hotplug-disks:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  cpu-feature.node.kubevirt.io/amd-ssbd=true
                 cpu-feature.node.kubevirt.io/amd-stibp=true
                 cpu-feature.node.kubevirt.io/arch-capabilities=true
                 cpu-feature.node.kubevirt.io/avx512ifma=true
                 cpu-feature.node.kubevirt.io/ds=true
                 cpu-feature.node.kubevirt.io/dtes64=true
                 cpu-feature.node.kubevirt.io/fsrm=true
                 cpu-feature.node.kubevirt.io/hypervisor=true
                 cpu-feature.node.kubevirt.io/ibpb=true
                 cpu-feature.node.kubevirt.io/ibrs=true
                 cpu-feature.node.kubevirt.io/ibrs-all=true
                 cpu-feature.node.kubevirt.io/invtsc=true
                 cpu-feature.node.kubevirt.io/md-clear=true
                 cpu-feature.node.kubevirt.io/mds-no=true
                 cpu-feature.node.kubevirt.io/pdcm=true
                 cpu-feature.node.kubevirt.io/pschange-mc-no=true
                 cpu-feature.node.kubevirt.io/rdctl-no=true
                 cpu-feature.node.kubevirt.io/rdpid=true
                 cpu-feature.node.kubevirt.io/sha-ni=true
                 cpu-feature.node.kubevirt.io/skip-l1dfl-vmentry=true
                 cpu-feature.node.kubevirt.io/ss=true
                 cpu-feature.node.kubevirt.io/stibp=true
                 cpu-feature.node.kubevirt.io/tsc_adjust=true
                 cpu-feature.node.kubevirt.io/tsx-ctrl=true
                 cpu-feature.node.kubevirt.io/vmx=true
                 cpu-feature.node.kubevirt.io/xsaves=true
                 cpu-model-migration.node.kubevirt.io/Icelake-Server=true
                 kubevirt.io/schedulable=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>
==================================================================================================================================================================================================================

Comment 14 Boaz 2023-06-29 07:19:57 UTC

*** This bug has been marked as a duplicate of bug 2218435 ***


Note You need to log in before you can comment on or make changes to this bug.