RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1975840 - Windows guest hangs after updating and restarting from the guest OS
Summary: Windows guest hangs after updating and restarting from the guest OS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: Unspecified
OS: Windows
urgent
urgent
Target Milestone: rc
: ---
Assignee: Paolo Bonzini
QA Contact: liunana
URL:
Whiteboard:
: 2061442 (view as bug list)
Depends On:
Blocks: 2070417 2074737 2074738
TreeView+ depends on / blocked
 
Reported: 2021-06-24 14:40 UTC by Marian Jankular
Modified: 2024-12-20 20:19 UTC (History)
39 users (show)

Fixed In Version: qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2070417 2074737 2074738 (view as bug list)
Environment:
Last Closed: 2022-05-10 13:18:42 UTC
Type: Bug
Target Upstream Version: 7.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/qemu-kvm qemu-kvm merge_requests 137 0 None None None 2022-03-29 14:12:06 UTC
Red Hat Knowledge Base (Solution) 6836351 0 None None None 2022-03-23 04:41:17 UTC
Red Hat Product Errata RHSA-2022:1759 0 None None None 2022-05-10 13:19:52 UTC

Description Marian Jankular 2021-06-24 14:40:13 UTC
Description of problem:
Windows guest hangs after updating and restarting from the guest OS

Version-Release number of selected component (if applicable):
qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.x86_64
redhat-release-virtualization-host-4.4.5-4.el8ev.x86_64

How reproducible:
very often

Steps to Reproduce:
1. apply windows patches
2. reboot the os within the windows os
3.

Actual results:
windows guest gets stuck booting

Expected results:
windows guest will boot

Additional info:
powering off (stopping qemu process) and powering up does workaround the issue

Comment 5 FuXiangChun 2021-07-01 03:00:49 UTC
QE can not reproduce it with qemu-kvm-core-4.2.0-34.module+el8.3.0+7976+077be4ec.x86_64, Tested win2016-64 and win2012-64r2 guest.  Can you provide me qemu cli and guest name? Thanks.

This is my steps.
1. qemu cli:

/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-machine pc  \
-nodefaults \
-device VGA,bus=pci.0,addr=0x2 \
-device i6300esb,bus=pci.0,addr=0x3 \
-watchdog-action reset \
-device pci-bridge,id=pci_bridge,bus=pci.0,addr=0x4,chassis_nr=1 \
-m 4096 \
-object memory-backend-file,size=4G,mem-path=/dev/shm,share=yes,id=mem-mem1  \
-smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
-numa node,memdev=mem-mem1,nodeid=0  \
-cpu 'Cascadelake-Server-noTSX',hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,+kvm_pv_unhalt \
-device intel-hda,bus=pci.0,addr=0x5 \
-device hda-duplex \
-device ich9-usb-ehci1,id=usb1,addr=0x1d.0x7,multifunction=on,bus=pci.0 \
-device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0x0,firstport=0,bus=pci.0 \
-device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.0x2,firstport=2,bus=pci.0 \
-device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.0x4,firstport=4,bus=pci.0 \
-device qemu-xhci,id=usb2,bus=pci.0,addr=0x7 \
-device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1 \
-blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/win2016-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
-device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x8 \
-device virtio-net-pci,mac=9a:41:63:d8:a7:38,id=idX1csiZ,netdev=idtIArqE,bus=pci.0,addr=0x9  \
-netdev tap,id=idtIArqE,vhost=on \
-blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/winutils.iso,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \
-device ide-cd,id=cd1,drive=drive_cd1,bootindex=1,write-cache=on,bus=ide.0,unit=0 \
-blockdev node-name=file_virtio,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/virtio-win-prewhql-0.1-202.iso,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_virtio,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_virtio \
-device ide-cd,id=virtio,drive=drive_virtio,bootindex=2,write-cache=on,bus=ide.0,unit=1  \
-vnc :0  \
-rtc base=localtime,clock=host,driftfix=slew  \
-boot menu=off,order=cdn,once=c,strict=off  \
-no-hpet \
-enable-kvm \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa \
-monitor stdio \
-vnc :1 \

2. "Windows update" inside guest.

3. "restart" inside guest.

Comment 11 John Ferlan 2021-07-07 18:14:35 UTC
Assigned to Meirav to assign since it's been with virt-maint for longer than the expected untriaged cases.

Comment 33 xiagao 2021-08-18 01:17:00 UTC
@Menli, as this bz may be related with hyper-v, could you also have a look at it from QE side?

Thanks.
Xiaoling

Comment 39 John Ferlan 2021-09-08 19:09:19 UTC
Bulk update: Move RHEL-AV bugs to RHEL8

Comment 59 xiagao 2021-10-19 02:01:50 UTC
Hi Menli,
Could you also check event log according to https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system hang?

Thanks,
Xiaoling

Comment 60 menli@redhat.com 2021-10-20 02:44:18 UTC
(In reply to xiagao from comment #59)
> Hi Menli,
> Could you also check event log according to
> https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system
> hang?
> 
> Thanks,
> Xiaoling

I check the previous image , can also  see the Event ID 129.

Comment 61 xiagao 2021-10-26 01:45:20 UTC
Roman hi,
Based on the above comments, could you check the windows event log on the guest if there was 'Event ID 129' at the issue happening time?
If yes, it maybe the same issue with https://bugzilla.redhat.com/show_bug.cgi?id=2010485

Thanks
Xiaoling

Comment 66 Fabian Deutsch 2021-12-09 12:53:59 UTC
I'm not sure if tlbflsh is used by the customer.

And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of 232)

But thanks for bringing it up. @jhopper do you happen to know if they are using tlbflush?

Comment 67 Fabian Deutsch 2021-12-09 12:56:28 UTC
Also: https://bugzilla.redhat.com/show_bug.cgi?id=1868572#c142 -  Says removing hyperv all together does also not fix the issue. Thoughts?

Comment 69 Jenifer Abrams 2021-12-09 16:22:28 UTC
(In reply to Fabian Deutsch from comment #66)
> I'm not sure if tlbflsh is used by the customer.
> 
> And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of
> 232)
> 
> But thanks for bringing it up. @jhopper do you happen to know if
> they are using tlbflush?

CNV default Win templates include this feature:
tlbflush: {}

which translates to libvirt xml:
    <hyperv>
      <tlbflush state='on'/>

Comment 70 Fabian Deutsch 2021-12-09 21:10:32 UTC
Yeah, I also looked it up in the templates.
Vitaly, would you generally recommend to not use tlbflush?

if so, then in CNV we could change the default Windows templates to not include this flag anymore.
Or are we saying we will have a fix for the known issues soon?

@dholler FYI

Comment 71 Vitaly Kuznetsov 2021-12-09 21:29:03 UTC
(In reply to Fabian Deutsch from comment #70)
> Yeah, I also looked it up in the templates.
> Vitaly, would you generally recommend to not use tlbflush?
> 
> if so, then in CNV we could change the default Windows templates to not
> include this flag anymore.
> Or are we saying we will have a fix for the known issues soon?

No, generally hv-tlbflush is a good one, it should be improving performance 
especially in CPU overcommited environments (in case target vCPU is not
running we can postpone flushing it instead of waiting until it comes 
back online). It's just that I've found a bug in its implementation which
in theory can result in sporadic crashes and maybe hangs. Hope it's also
the root cause of BZ#1868572.

Comment 72 Fabian Deutsch 2021-12-10 12:34:42 UTC
Okay, then we'll stick to tlbflush for now, however, know that there are some improvements in the pipe.

Comment 112 Paolo Bonzini 2022-03-02 09:42:45 UTC
> In 4.4.9, with rebase to RHEL-8.5 we got new major version of QEMU-6.0. Just a guess, but probably something which was missed to include in major version 6 of QEMU, which was fixed in version 5.2?

No, there are no minor/major versions. The first number of the version is simply bumped every year.  Are we 100% sure that 4.4.7 works?  If so, would it be possible to try either qemu 6.0 or a -348 kernel on a 4.4.{7,8} image?

Comment 144 Paolo Bonzini 2022-04-01 07:17:58 UTC
*** Bug 2061442 has been marked as a duplicate of this bug. ***

Comment 145 Paolo Bonzini 2022-04-02 15:33:06 UTC
Requesting blocker to give QE more time for testing.

Comment 148 Paolo Bonzini 2022-04-04 10:10:24 UTC
> 1. What is the scope of harm if this BZ is not resolved in this release?  Reviewers want to know which RHEL
> features or customers are affected and if it will impact any Layered Product or Hardware partner plans.

This impacts all virtualization layered products (customer is using RHV, but CNV and OpenStack are
affected too).

> 2. What are the risks associated with resolving this BZ?  Reviewers want to know the scope of retesting, potential regressions

The fix covers a specific path (reboot) which can be tested with automated tests.
The fix also makes the VM behave in a way that is similar to bare metal, so the
probability of regressions is considered low.

> 3. Provide any other details that meet blocker criteria or should be weighed in making a decision (Other releases affected, upstream status, business impacts, etc).

With respect to business impact, this is an important customer escalation.

Comment 155 Yanhui Ma 2022-04-07 01:54:06 UTC
Based on comment 151, set it verified.

Comment 161 errata-xmlrpc 2022-05-10 13:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1759


Note You need to log in before you can comment on or make changes to this bug.