Bug 2010485

Summary: Windows VMs offline after update
Product: Container Native Virtualization (CNV) Reporter: Jonathan Edwards <joedward>
Component: VirtualizationAssignee: Igor Bezukh <ibezukh>
Status: CLOSED DEFERRED QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 4.8.2CC: ailan, cnv-qe-bugs, coli, dgilbert, dholler, fdeutsch, giridhar.ramaraju, guchen, ibezukh, ipinto, jhopper, kbidarka, lijin, mdean, mkedzier, mprivozn, owasserm, phou, qizhu, sgott, vrozenfe, xiagao, zhguo
Target Milestone: ---Keywords: TestCannotAutomate, TestOnly
Target Release: 4.11.0Flags: ibezukh: needinfo+
ibezukh: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-11 12:30:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2013976, 2028000    
Bug Blocks:    

Description Jonathan Edwards 2021-10-04 19:10:33 UTC
CNV cluster with 24+ nodes, 850 virtual machines

Windows 10 VM's seem to fall offline.   When using the UI console - screen shows blank.

For some of the Windows logs we see:
Event log shows "Reset to device, \Device\RaidPort2, was issued. "

also pods are showing:
error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "<sandbox_id>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox <sandbox_id>: failed to stop container k8s_compute_virt-launcher-<pod>.virtualmachines_<container_id>: context deadline exceeded"]

This seemed to happen after a mass windows update:

The guest was Windows 10 all updates.
Then these patches were applied to the Windows VM’s:
KB5005700
KB5005566
After this, 150 out of 700 went rogue and had the symptoms described above.

sample windows VM yaml
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: <$VM>
  name: <$VM>
  namespace: virtualmachines
spec:
  dataVolumeTemplates:
  - metadata:
      name: <$VM>
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 100Gi
        storageClassName: ocs-storagecluster-ceph-rbd
        volumeMode: Block
      source:
        blank: {}
    status: {}
  running: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/vm: <$VM>
    spec:
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 1
          model: host-model
          sockets: 2
        devices:
          disks:
          - bootOrder: 2
            disk:
              bus: virtio
              pciAddress: "0000:00:02.0"
            name: os-disk
          interfaces:
          - bootOrder: 1
            bridge: {}
            macAddress: <$MAC>
            name: vnic0
            pciAddress: "0000:00:03.0"
          networkInterfaceMultiqueue: true
        features:
          acpi: {}
          apic: {}
          hyperv:
            evmcs: {}
            frequencies: {}
            ipi: {}
            reenlightenment: {}
            relaxed: {}
            reset: {}
            runtime: {}
            spinlocks:
              spinlocks: 8191
            synic: {}
            synictimer: {}
            tlbflush: {}
            vapic: {}
            vpindex: {}
        firmware:
          uuid: <$UUID>
        resources:
          requests:
            cpu: 1500m
            memory: 11Gi
      networks:
      - multus:
          networkName: <$VLAN_ID>
        name: vnic0
      terminationGracePeriodSeconds: 30
      evictionStrategy: LiveMigrate
      volumes:
      - dataVolume:
          name: <$VOL_NAME>
        name: os-disk
status: {}

Comment 1 Dr. David Alan Gilbert 2021-10-05 10:40:23 UTC
Vadim: Does this feel the same as https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623 ?

Comment 6 Fabian Deutsch 2021-10-06 12:25:16 UTC
Jonathan, what virtio drivers does the customer use?
Also: "This seemed to happen after a mass windows update", wasn't it fixed after _Rebooting_ the VMs?

Comment 30 Igor Bezukh 2021-10-21 10:45:50 UTC
Hi,

@

Comment 41 Fabian Deutsch 2022-05-11 12:30:45 UTC
Cleaning up this bug.

The research on a different case (03148335) revealed that the customer was using a hyperv flag (evmcs) which was affected by a bug, leading to VM crashes. The issue got solved by removing the hyperv flag from the customers vm definition, at the same time there are bug fixes staged in RHEL to address the known evmcs issue.
This bug is attached to a different case, but we strongly suspect that the root cause for this bug (and attached case) is the same as for case 03148335.

Closing as deferred as the root cause will be addressed with rhbz #1940837.

Comment 42 Red Hat Bugzilla 2023-09-18 04:26:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days