Bug 2010485 - Windows VMs offline after update
Summary: Windows VMs offline after update
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.8.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Igor Bezukh
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On: 2013976 2028000
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-04 19:10 UTC by Jonathan Edwards
Modified: 2025-04-04 13:17 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-11 12:30:45 UTC
Target Upstream Version:
Embargoed:
ibezukh: needinfo+
ibezukh: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-14352 0 None None None 2023-01-20 10:34:22 UTC

Description Jonathan Edwards 2021-10-04 19:10:33 UTC
CNV cluster with 24+ nodes, 850 virtual machines

Windows 10 VM's seem to fall offline.   When using the UI console - screen shows blank.

For some of the Windows logs we see:
Event log shows "Reset to device, \Device\RaidPort2, was issued. "

also pods are showing:
error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "<sandbox_id>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox <sandbox_id>: failed to stop container k8s_compute_virt-launcher-<pod>.virtualmachines_<container_id>: context deadline exceeded"]

This seemed to happen after a mass windows update:

The guest was Windows 10 all updates.
Then these patches were applied to the Windows VM’s:
KB5005700
KB5005566
After this, 150 out of 700 went rogue and had the symptoms described above.

sample windows VM yaml
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: <$VM>
  name: <$VM>
  namespace: virtualmachines
spec:
  dataVolumeTemplates:
  - metadata:
      name: <$VM>
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 100Gi
        storageClassName: ocs-storagecluster-ceph-rbd
        volumeMode: Block
      source:
        blank: {}
    status: {}
  running: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/vm: <$VM>
    spec:
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 1
          model: host-model
          sockets: 2
        devices:
          disks:
          - bootOrder: 2
            disk:
              bus: virtio
              pciAddress: "0000:00:02.0"
            name: os-disk
          interfaces:
          - bootOrder: 1
            bridge: {}
            macAddress: <$MAC>
            name: vnic0
            pciAddress: "0000:00:03.0"
          networkInterfaceMultiqueue: true
        features:
          acpi: {}
          apic: {}
          hyperv:
            evmcs: {}
            frequencies: {}
            ipi: {}
            reenlightenment: {}
            relaxed: {}
            reset: {}
            runtime: {}
            spinlocks:
              spinlocks: 8191
            synic: {}
            synictimer: {}
            tlbflush: {}
            vapic: {}
            vpindex: {}
        firmware:
          uuid: <$UUID>
        resources:
          requests:
            cpu: 1500m
            memory: 11Gi
      networks:
      - multus:
          networkName: <$VLAN_ID>
        name: vnic0
      terminationGracePeriodSeconds: 30
      evictionStrategy: LiveMigrate
      volumes:
      - dataVolume:
          name: <$VOL_NAME>
        name: os-disk
status: {}

Comment 1 Dr. David Alan Gilbert 2021-10-05 10:40:23 UTC
Vadim: Does this feel the same as https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623 ?

Comment 6 Fabian Deutsch 2021-10-06 12:25:16 UTC
Jonathan, what virtio drivers does the customer use?
Also: "This seemed to happen after a mass windows update", wasn't it fixed after _Rebooting_ the VMs?

Comment 30 Igor Bezukh 2021-10-21 10:45:50 UTC
Hi,

@

Comment 41 Fabian Deutsch 2022-05-11 12:30:45 UTC
Cleaning up this bug.

The research on a different case (03148335) revealed that the customer was using a hyperv flag (evmcs) which was affected by a bug, leading to VM crashes. The issue got solved by removing the hyperv flag from the customers vm definition, at the same time there are bug fixes staged in RHEL to address the known evmcs issue.
This bug is attached to a different case, but we strongly suspect that the root cause for this bug (and attached case) is the same as for case 03148335.

Closing as deferred as the root cause will be addressed with rhbz #1940837.

Comment 42 Red Hat Bugzilla 2023-09-18 04:26:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.