Bug 1975840
Summary: | Windows guest hangs after updating and restarting from the guest OS | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Marian Jankular <mjankula> | |
Component: | qemu-kvm | Assignee: | Paolo Bonzini <pbonzini> | |
qemu-kvm sub component: | General | QA Contact: | liunana <nanliu> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | abpatil, ailan, chayang, coli, dgilbert, dholler, fdeutsch, gveitmic, jfindysz, jhopper, jinzhao, josgutie, jsaucier, juzhang, knoel, lijin, lmiksik, lrotenbe, mdean, menli, michal.skrivanek, mkedzier, nanliu, pbonzini, pelauter, qinwang, qizhu, raldaz, rhodain, sfroemer, shipatil, virt-maint, vkuznets, vrozenfe, xfu, xiagao, yama, ycui, zhguo | |
Version: | 8.4 | Keywords: | Triaged, ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Windows | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-6.2.0-11.module+el8.6.0+14707+5aa4b42d | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2070417 2074737 2074738 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-10 13:18:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | 7.0 | |
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2070417, 2074737, 2074738 |
Description
Marian Jankular
2021-06-24 14:40:13 UTC
QE can not reproduce it with qemu-kvm-core-4.2.0-34.module+el8.3.0+7976+077be4ec.x86_64, Tested win2016-64 and win2012-64r2 guest. Can you provide me qemu cli and guest name? Thanks. This is my steps. 1. qemu cli: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -device i6300esb,bus=pci.0,addr=0x3 \ -watchdog-action reset \ -device pci-bridge,id=pci_bridge,bus=pci.0,addr=0x4,chassis_nr=1 \ -m 4096 \ -object memory-backend-file,size=4G,mem-path=/dev/shm,share=yes,id=mem-mem1 \ -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2 \ -numa node,memdev=mem-mem1,nodeid=0 \ -cpu 'Cascadelake-Server-noTSX',hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,+kvm_pv_unhalt \ -device intel-hda,bus=pci.0,addr=0x5 \ -device hda-duplex \ -device ich9-usb-ehci1,id=usb1,addr=0x1d.0x7,multifunction=on,bus=pci.0 \ -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0x0,firstport=0,bus=pci.0 \ -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.0x2,firstport=2,bus=pci.0 \ -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.0x4,firstport=4,bus=pci.0 \ -device qemu-xhci,id=usb2,bus=pci.0,addr=0x7 \ -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/win2016-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x8 \ -device virtio-net-pci,mac=9a:41:63:d8:a7:38,id=idX1csiZ,netdev=idtIArqE,bus=pci.0,addr=0x9 \ -netdev tap,id=idtIArqE,vhost=on \ -blockdev node-name=file_cd1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/winutils.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_cd1,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_cd1 \ -device ide-cd,id=cd1,drive=drive_cd1,bootindex=1,write-cache=on,bus=ide.0,unit=0 \ -blockdev node-name=file_virtio,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/iso/windows/virtio-win-prewhql-0.1-202.iso,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_virtio,driver=raw,read-only=on,cache.direct=on,cache.no-flush=off,file=file_virtio \ -device ide-cd,id=virtio,drive=drive_virtio,bootindex=2,write-cache=on,bus=ide.0,unit=1 \ -vnc :0 \ -rtc base=localtime,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -no-hpet \ -enable-kvm \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa \ -monitor stdio \ -vnc :1 \ 2. "Windows update" inside guest. 3. "restart" inside guest. Assigned to Meirav to assign since it's been with virt-maint for longer than the expected untriaged cases. @Menli, as this bz may be related with hyper-v, could you also have a look at it from QE side? Thanks. Xiaoling Bulk update: Move RHEL-AV bugs to RHEL8 Hi Menli, Could you also check event log according to https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system hang? Thanks, Xiaoling (In reply to xiagao from comment #59) > Hi Menli, > Could you also check event log according to > https://bugzilla.redhat.com/show_bug.cgi?id=2010485#c21 if you hit system > hang? > > Thanks, > Xiaoling I check the previous image , can also see the Event ID 129. Roman hi, Based on the above comments, could you check the windows event log on the guest if there was 'Event ID 129' at the issue happening time? If yes, it maybe the same issue with https://bugzilla.redhat.com/show_bug.cgi?id=2010485 Thanks Xiaoling I'm not sure if tlbflsh is used by the customer. And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of 232) But thanks for bringing it up. @jhopper do you happen to know if they are using tlbflush? Also: https://bugzilla.redhat.com/show_bug.cgi?id=1868572#c142 - Says removing hyperv all together does also not fix the issue. Thoughts? (In reply to Fabian Deutsch from comment #66) > I'm not sure if tlbflsh is used by the customer. > > And the proble is reproducibility: Currently it's roughly 0,4% (~1 out of > 232) > > But thanks for bringing it up. @jhopper do you happen to know if > they are using tlbflush? CNV default Win templates include this feature: tlbflush: {} which translates to libvirt xml: <hyperv> <tlbflush state='on'/> Yeah, I also looked it up in the templates. Vitaly, would you generally recommend to not use tlbflush? if so, then in CNV we could change the default Windows templates to not include this flag anymore. Or are we saying we will have a fix for the known issues soon? @dholler FYI (In reply to Fabian Deutsch from comment #70) > Yeah, I also looked it up in the templates. > Vitaly, would you generally recommend to not use tlbflush? > > if so, then in CNV we could change the default Windows templates to not > include this flag anymore. > Or are we saying we will have a fix for the known issues soon? No, generally hv-tlbflush is a good one, it should be improving performance especially in CPU overcommited environments (in case target vCPU is not running we can postpone flushing it instead of waiting until it comes back online). It's just that I've found a bug in its implementation which in theory can result in sporadic crashes and maybe hangs. Hope it's also the root cause of BZ#1868572. Okay, then we'll stick to tlbflush for now, however, know that there are some improvements in the pipe. > In 4.4.9, with rebase to RHEL-8.5 we got new major version of QEMU-6.0. Just a guess, but probably something which was missed to include in major version 6 of QEMU, which was fixed in version 5.2?
No, there are no minor/major versions. The first number of the version is simply bumped every year. Are we 100% sure that 4.4.7 works? If so, would it be possible to try either qemu 6.0 or a -348 kernel on a 4.4.{7,8} image?
*** Bug 2061442 has been marked as a duplicate of this bug. *** Requesting blocker to give QE more time for testing. > 1. What is the scope of harm if this BZ is not resolved in this release? Reviewers want to know which RHEL > features or customers are affected and if it will impact any Layered Product or Hardware partner plans. This impacts all virtualization layered products (customer is using RHV, but CNV and OpenStack are affected too). > 2. What are the risks associated with resolving this BZ? Reviewers want to know the scope of retesting, potential regressions The fix covers a specific path (reboot) which can be tested with automated tests. The fix also makes the VM behave in a way that is similar to bare metal, so the probability of regressions is considered low. > 3. Provide any other details that meet blocker criteria or should be weighed in making a decision (Other releases affected, upstream status, business impacts, etc). With respect to business impact, this is an important customer escalation. Based on comment 151, set it verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1759 |