Bug 1975840

Summary:	Windows guest hangs after updating and restarting from the guest OS
Product:	Red Hat Enterprise Linux 8	Reporter:	Marian Jankular <mjankula>
Component:	qemu-kvm	Assignee:	Paolo Bonzini <pbonzini>
qemu-kvm sub component:	General	QA Contact:	liunana <nanliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	abpatil, ailan, chayang, coli, dgilbert, dholler, fdeutsch, gveitmic, jfindysz, jhopper, jinzhao, josgutie, jsaucier, juzhang, knoel, lijin, lmiksik, lrotenbe, mdean, menli, michal.skrivanek, mkedzier, nanliu, pbonzini, pelauter, qinwang, qizhu, raldaz, rhodain, sfroemer, shipatil, virt-maint, vkuznets, vrozenfe, xfu, xiagao, yama, ycui, zhguo
Version:	8.4	Keywords:	Triaged, ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Windows
Whiteboard:
Fixed In Version:	qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2070417 2074737 2074738 (view as bug list)		Environment:
Last Closed:	2022-05-10 13:18:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	7.0
Embargoed:
Bug Depends On:
Bug Blocks:	2070417, 2074737, 2074738

Description Marian Jankular 2021-06-24 14:40:13 UTC

Description of problem:
Windows guest hangs after updating and restarting from the guest OS

Version-Release number of selected component (if applicable):
qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.x86_64
redhat-release-virtualization-host-4.4.5-4.el8ev.x86_64

How reproducible:
very often

Steps to Reproduce:
1. apply windows patches
2. reboot the os within the windows os
3.

Actual results:
windows guest gets stuck booting

Expected results:
windows guest will boot

Additional info:
powering off (stopping qemu process) and powering up does workaround the issue

Comment 5 FuXiangChun 2021-07-01 03:00:49 UTC

QE can not reproduce it with qemu-kvm-core-4.2.0-34.module+el8.3.0+7976+077be4ec.x86_64, Tested win2016-64 and win2012-64r2 guest.  Can you provide me qemu cli and guest name? Thanks.

This is my steps.
1. qemu cli:

/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-machine pc  \
-nodefaults \
-device VGA,bus=pci.0,addr=0x2 \
-device i6300esb,bus=pci.0,addr=0x3 \
-watchdog-action reset \
-device pci-bridge,id=pci_bridge,bus=pci.0,addr=0x4,chassis_nr=1 \
-m 4096 \
-object memory-backend-file,size=4G,mem-path=/dev/shm,share=yes,id=mem-mem1  \
-smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
-numa node,memdev=mem-mem1,nodeid=0  \
-cpu 'Cascadelake-Server-noTSX',hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,+kvm_pv_unhalt \
-device intel-hda,bus=pci.0,addr=0x5 \
-device hda-duplex \
-device ich9-usb-ehci1,id=usb1,addr=0x1d.0x7,multifunction=on,bus=pci.0 \
-device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0x0,firstport=0,bus=pci.0 \
-device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.0x2,firstport=2,bus=pci.0 \
-device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.0x4,firstport=4,bus=pci.0 \
-device qemu-xhci,id=usb2,bus=pci.0,addr=0x7 \
-device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1 \
-blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/win2016-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
-device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x8 \
-device virtio-net-pci,mac=9a:41:63:d8:a7:38,id=idX1csiZ,netdev=idtIArqE,bus=pci.0,addr=0x9  \
-netdev tap,id=idtIArqE,vhost=on \
-blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/winutils.iso,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
-device ide-cd,id=cd1,drive=drive_cd1,bootindex=1,write-cache=on,bus=ide.0,unit=0 \
-blockdev node-name=file_virtio,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/virtio-win-prewhql-0.1-202.iso,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_virtio,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_virtio \
-device ide-cd,id=virtio,drive=drive_virtio,bootindex=2,write-cache=on,bus=ide.0,unit=1  \
-vnc :0  \
-rtc base=localtime,clock=host,driftfix=slew  \
-boot menu=off,order=cdn,once=c,strict=off  \
-no-hpet \
-enable-kvm \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa \
-monitor stdio \
-vnc :1 \

2. "Windows update" inside guest.

3. "restart" inside guest.

Comment 11 John Ferlan 2021-07-07 18:14:35 UTC

Assigned to Meirav to assign since it's been with virt-maint for longer than the expected untriaged cases.

Comment 33 xiagao 2021-08-18 01:17:00 UTC

@Menli, as this bz may be related with hyper-v, could you also have a look at it from QE side?

Thanks.
Xiaoling

Comment 39 John Ferlan 2021-09-08 19:09:19 UTC

Bulk update: Move RHEL-AV bugs to RHEL8

Comment 59 xiagao 2021-10-19 02:01:50 UTC

Hi Menli,
Could you also check event log according to https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system hang?

Thanks,
Xiaoling

Comment 60 menli@redhat.com 2021-10-20 02:44:18 UTC

(In reply to xiagao from comment #59)
> Hi Menli,
> Could you also check event log according to
> https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system
> hang?
> 
> Thanks,
> Xiaoling

I check the previous image , can also  see the Event ID 129.

Comment 61 xiagao 2021-10-26 01:45:20 UTC

Roman hi,
Based on the above comments, could you check the windows event log on the guest if there was 'Event ID 129' at the issue happening time?
If yes, it maybe the same issue with https://bugzilla.redhat.com/show_bug.cgi?id=2010485

Thanks
Xiaoling

Comment 66 Fabian Deutsch 2021-12-09 12:53:59 UTC

I'm not sure if tlbflsh is used by the customer.

And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of 232)

But thanks for bringing it up. @jhopper do you happen to know if they are using tlbflush?

Comment 67 Fabian Deutsch 2021-12-09 12:56:28 UTC

Also: https://bugzilla.redhat.com/show_bug.cgi?id=1868572#c142 -  Says removing hyperv all together does also not fix the issue. Thoughts?

Comment 69 Jenifer Abrams 2021-12-09 16:22:28 UTC

(In reply to Fabian Deutsch from comment #66)
> I'm not sure if tlbflsh is used by the customer.
> 
> And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of
> 232)
> 
> But thanks for bringing it up. @jhopper do you happen to know if
> they are using tlbflush?

CNV default Win templates include this feature:
tlbflush: {}

which translates to libvirt xml:
    <hyperv>
      <tlbflush state='on'/>

Comment 70 Fabian Deutsch 2021-12-09 21:10:32 UTC

Yeah, I also looked it up in the templates.
Vitaly, would you generally recommend to not use tlbflush?

if so, then in CNV we could change the default Windows templates to not include this flag anymore.
Or are we saying we will have a fix for the known issues soon?

@dholler FYI

Comment 71 Vitaly Kuznetsov 2021-12-09 21:29:03 UTC

(In reply to Fabian Deutsch from comment #70)
> Yeah, I also looked it up in the templates.
> Vitaly, would you generally recommend to not use tlbflush?
> 
> if so, then in CNV we could change the default Windows templates to not
> include this flag anymore.
> Or are we saying we will have a fix for the known issues soon?

No, generally hv-tlbflush is a good one, it should be improving performance 
especially in CPU overcommited environments (in case target vCPU is not
running we can postpone flushing it instead of waiting until it comes 
back online). It's just that I've found a bug in its implementation which
in theory can result in sporadic crashes and maybe hangs. Hope it's also
the root cause of BZ#1868572.

Comment 72 Fabian Deutsch 2021-12-10 12:34:42 UTC

Okay, then we'll stick to tlbflush for now, however, know that there are some improvements in the pipe.

Comment 112 Paolo Bonzini 2022-03-02 09:42:45 UTC

> In 4.4.9, with rebase to RHEL-8.5 we got new major version of QEMU-6.0. Just a guess, but probably something which was missed to include in major version 6 of QEMU, which was fixed in version 5.2?

No, there are no minor/major versions. The first number of the version is simply bumped every year.  Are we 100% sure that 4.4.7 works?  If so, would it be possible to try either qemu 6.0 or a -348 kernel on a 4.4.{7,8} image?

Comment 144 Paolo Bonzini 2022-04-01 07:17:58 UTC

*** Bug 2061442 has been marked as a duplicate of this bug. ***

Comment 145 Paolo Bonzini 2022-04-02 15:33:06 UTC

Requesting blocker to give QE more time for testing.

Comment 148 Paolo Bonzini 2022-04-04 10:10:24 UTC

> 1. What is the scope of harm if this BZ is not resolved in this release?  Reviewers want to know which RHEL
> features or customers are affected and if it will impact any Layered Product or Hardware partner plans.

This impacts all virtualization layered products (customer is using RHV, but CNV and OpenStack are
affected too).

> 2. What are the risks associated with resolving this BZ?  Reviewers want to know the scope of retesting, potential regressions

The fix covers a specific path (reboot) which can be tested with automated tests.
The fix also makes the VM behave in a way that is similar to bare metal, so the
probability of regressions is considered low.

> 3. Provide any other details that meet blocker criteria or should be weighed in making a decision (Other releases affected, upstream status, business impacts, etc).

With respect to business impact, this is an important customer escalation.

Comment 155 Yanhui Ma 2022-04-07 01:54:06 UTC

Based on comment 151, set it verified.

Comment 161 errata-xmlrpc 2022-05-10 13:18:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1759