2010485 – Windows VMs offline after update

Bug 2010485 - Windows VMs offline after update

Summary: Windows VMs offline after update

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Igor Bezukh
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:	2013976 2028000
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-04 19:10 UTC by Jonathan Edwards
Modified:	2025-04-04 13:17 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-11 12:30:45 UTC
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	ibezukh: needinfo+ ibezukh: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-14352	0	None	None	None	2023-01-20 10:34:22 UTC

Description Jonathan Edwards 2021-10-04 19:10:33 UTC

CNV cluster with 24+ nodes, 850 virtual machines

Windows 10 VM's seem to fall offline.   When using the UI console - screen shows blank.

For some of the Windows logs we see:
Event log shows "Reset to device, \Device\RaidPort2, was issued. "

also pods are showing:
error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "<sandbox_id>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox <sandbox_id>: failed to stop container k8s_compute_virt-launcher-<pod>.virtualmachines_<container_id>: context deadline exceeded"]

This seemed to happen after a mass windows update:

The guest was Windows 10 all updates.
Then these patches were applied to the Windows VM’s:
KB5005700
KB5005566
After this, 150 out of 700 went rogue and had the symptoms described above.

sample windows VM yaml
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: <$VM>
  name: <$VM>
  namespace: virtualmachines
spec:
  dataVolumeTemplates:
  - metadata:
      name: <$VM>
    spec:
      pvc:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 100Gi
        storageClassName: ocs-storagecluster-ceph-rbd
        volumeMode: Block
      source:
        blank: {}
    status: {}
  running: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        kubevirt.io/vm: <$VM>
    spec:
      domain:
        clock:
          timer:
            hpet:
              present: false
            hyperv: {}
            pit:
              tickPolicy: delay
            rtc:
              tickPolicy: catchup
          utc: {}
        cpu:
          cores: 1
          model: host-model
          sockets: 2
        devices:
          disks:
          - bootOrder: 2
            disk:
              bus: virtio
              pciAddress: "0000:00:02.0"
            name: os-disk
          interfaces:
          - bootOrder: 1
            bridge: {}
            macAddress: <$MAC>
            name: vnic0
            pciAddress: "0000:00:03.0"
          networkInterfaceMultiqueue: true
        features:
          acpi: {}
          apic: {}
          hyperv:
            evmcs: {}
            frequencies: {}
            ipi: {}
            reenlightenment: {}
            relaxed: {}
            reset: {}
            runtime: {}
            spinlocks:
              spinlocks: 8191
            synic: {}
            synictimer: {}
            tlbflush: {}
            vapic: {}
            vpindex: {}
        firmware:
          uuid: <$UUID>
        resources:
          requests:
            cpu: 1500m
            memory: 11Gi
      networks:
      - multus:
          networkName: <$VLAN_ID>
        name: vnic0
      terminationGracePeriodSeconds: 30
      evictionStrategy: LiveMigrate
      volumes:
      - dataVolume:
          name: <$VOL_NAME>
        name: os-disk
status: {}

Comment 1 Dr. David Alan Gilbert 2021-10-05 10:40:23 UTC

Vadim: Does this feel the same as https://github.com/virtio-win/kvm-guest-drivers-windows/issues/623 ?

Comment 6 Fabian Deutsch 2021-10-06 12:25:16 UTC

Jonathan, what virtio drivers does the customer use?
Also: "This seemed to happen after a mass windows update", wasn't it fixed after _Rebooting_ the VMs?

Comment 30 Igor Bezukh 2021-10-21 10:45:50 UTC

Hi,

@

Comment 41 Fabian Deutsch 2022-05-11 12:30:45 UTC

Cleaning up this bug.

The research on a different case (03148335) revealed that the customer was using a hyperv flag (evmcs) which was affected by a bug, leading to VM crashes. The issue got solved by removing the hyperv flag from the customers vm definition, at the same time there are bug fixes staged in RHEL to address the known evmcs issue.
This bug is attached to a different case, but we strongly suspect that the root cause for this bug (and attached case) is the same as for case 03148335.

Closing as deferred as the root cause will be addressed with rhbz #1940837.

Comment 42 Red Hat Bugzilla 2023-09-18 04:26:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

ailan
cnv-qe-bugs
coli
dgilbert
dholler
fdeutsch
giridhar.ramaraju
guchen
ibezukh
ipinto
jhopper
kbidarka
lijin
mdean
mkedzier
mprivozn
owasserm
phou
qizhu
sgott
vrozenfe
xiagao
zhguo