CNV cluster with 24+ nodes, 850 virtual machines Windows 10 VM's seem to fall offline. When using the UI console - screen shows blank. For some of the Windows logs we see: Event log shows "Reset to device, \Device\RaidPort2, was issued. " also pods are showing: error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "<sandbox_id>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox <sandbox_id>: failed to stop container k8s_compute_virt-launcher-<pod>.virtualmachines_<container_id>: context deadline exceeded"] This seemed to happen after a mass windows update: The guest was Windows 10 all updates. Then these patches were applied to the Windows VM’s: KB5005700 KB5005566 After this, 150 out of 700 went rogue and had the symptoms described above. sample windows VM yaml --- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: labels: kubevirt.io/vm: <$VM> name: <$VM> namespace: virtualmachines spec: dataVolumeTemplates: - metadata: name: <$VM> spec: pvc: accessModes: - ReadWriteMany resources: requests: storage: 100Gi storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block source: blank: {} status: {} running: false template: metadata: creationTimestamp: null labels: kubevirt.io/vm: <$VM> spec: domain: clock: timer: hpet: present: false hyperv: {} pit: tickPolicy: delay rtc: tickPolicy: catchup utc: {} cpu: cores: 1 model: host-model sockets: 2 devices: disks: - bootOrder: 2 disk: bus: virtio pciAddress: "0000:00:02.0" name: os-disk interfaces: - bootOrder: 1 bridge: {} macAddress: <$MAC> name: vnic0 pciAddress: "0000:00:03.0" networkInterfaceMultiqueue: true features: acpi: {} apic: {} hyperv: evmcs: {} frequencies: {} ipi: {} reenlightenment: {} relaxed: {} reset: {} runtime: {} spinlocks: spinlocks: 8191 synic: {} synictimer: {} tlbflush: {} vapic: {} vpindex: {} firmware: uuid: <$UUID> resources: requests: cpu: 1500m memory: 11Gi networks: - multus: networkName: <$VLAN_ID> name: vnic0 terminationGracePeriodSeconds: 30 evictionStrategy: LiveMigrate volumes: - dataVolume: name: <$VOL_NAME> name: os-disk status: {}
Vadim: Does this feel the same as https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623 ?
Jonathan, what virtio drivers does the customer use? Also: "This seemed to happen after a mass windows update", wasn't it fixed after _Rebooting_ the VMs?
Hi, @
Cleaning up this bug. The research on a different case (03148335) revealed that the customer was using a hyperv flag (evmcs) which was affected by a bug, leading to VM crashes. The issue got solved by removing the hyperv flag from the customers vm definition, at the same time there are bug fixes staged in RHEL to address the known evmcs issue. This bug is attached to a different case, but we strongly suspect that the root cause for this bug (and attached case) is the same as for case 03148335. Closing as deferred as the root cause will be addressed with rhbz #1940837.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days