Bug 1056982
| Summary: | win2008.x86_64 guest BSOD on AMD (error code:0x19, BAD_POOL_HEADER) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | CongLi <coli> | ||||||||||||
| Component: | kernel | Assignee: | Radim Krčmář <rkrcmar> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | medium | ||||||||||||||
| Version: | 7.0 | CC: | acathrow, bcao, coli, drjones, hhuang, juzhang, knoel, marcel, michen, mst, pbonzini, rkrcmar, shuang, svenkatr, virt-bugs, virt-maint, vrozenfe, xhan, xwei, yvugenfi | ||||||||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | kernel-3.10.0-111.el7 | Doc Type: | Bug Fix | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2014-06-13 12:06:31 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1069309 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
CongLi
2014-01-23 09:53:06 UTC
1. qemu-img check /home/staf-kvm-devel/autotest-devel/client/tests/virt/shared/data/images/win2008-64-virtio.qcow2
No errors were found on the image.
166193/491520 = 33.81% allocated, 4.31% fragmented, 0.00% compressed clusters
Image end offset: 10893787136
2. qemu-img info /home/staf-kvm-devel/autotest-devel/client/tests/virt/shared/data/images/win2008-64-virtio.qcow2
image: /home/staf-kvm-devel/autotest-devel/client/tests/virt/shared/data/images/win2008-64-virtio.qcow2
file format: qcow2
virtual size: 30G (32212254720 bytes)
disk size: 10G
cluster_size: 65536
Format specific information:
compat: 0.10
Hi, It is not clear what the disk size is. It says 10G, but also "166193/491520 = 33.81% allocated". 491,520 does not match with 10G. And if the memory size is 30G, then it is recommended to have a larger disk. And q35 is in tech-preview for 7.0 Need to isolate, which of those (if any) is the issue. Thanks. Hi Cong, Could you have a look comment4 and update the testing result? Best Regards, Junyi Created attachment 861292 [details]
QEMU CML
Since the BSOD changes crash to crash, then I have a feeling that Windows goes off in the weeds, and there would be nothing to gain from wading through a bunch of assembly code trying to find out what the guest was doing at the time. We'll need to add tracing to the host side in order to find the problem. To do that, we need to reduce the amount of code we trace. To get that reduction we need a very minimal guest config that still reproduces the issue. A while back I asked that we create a bare minimal qemu command line that we can use as a basis for config option isolation. This command line could possibly even configure a guest that has no disk and no nic, just a cdrom to boot the installer off (which of course would eventually complain about the lack of a disk, but for early boot problems we don't care). Do we have that super minimal command line constructed yet? If so, does this problem reproduce with such a minimal command line? If the guest can boot with a minimal command line, then can we add each option, one at at time, until we find the device that introduces this bug? Once we know what that is we can start tracing the qemu and kvm wrt to the code paths that device uses. (In reply to Andrew Jones from comment #10) Hi Andrew, From the mail we talked, here are the questions which need to test: 1. Was another exactly identical host (Opteron_G?), but not the same exact machine, ever tried? 2. Are all 5-6 bugs on the same exact host? do non win2008 guests work? 3. The CML is too large and complex. We need to reduce the command line down to a minimal reproducer. See my comment https://bugzilla.redhat.com/show_bug.cgi?id=1056982#c10 4. is this a regression? or, conversely, does the problem still reproduce using the latest kernel and qemu? (probably not, because rhel7 isn't far off the latest, but it's still worth trying if the issue isn't a regression). I will have a test about the above questions, and if there is other question to test, feel free to ask me. Thanks, Cong (In reply to CongLi from comment #11) To make sure the image is good, I have boot and done many times' 'system_reset' to the win2008.x86_64 guest on a intel machine, the test result is good. > 1. Was another exactly identical host (Opteron_G?), but not the same > exact machine, ever tried? Yes, it can be reproduced on another exactly identical host as comment 0. And also can be reproduced on machine: processor : 3 vendor_id : AuthenticAMD cpu family : 21 model : 16 model name : AMD A10-5800K APU with Radeon(tm) HD Graphics but not all AMD hosts can hit this problem. > 2. Are all 5-6 bugs on the same exact host? do non win2008 guests work? 2.1 Not sure whether all these bugs are on the same host for some aren't reported by me and there is no host info in the bug, but those which I reported are the same host. 2.2 didn't hit this bug with guest win2008.i386, win7(i386, x86_64), win8.0(i386, x86_64), win8.1(i386, x86_64) > 3. The CML is too large and complex. We need to reduce the command line > down to a minimal reproducer. See my comment > > https://bugzilla.redhat.com/show_bug.cgi?id=1056982#c10 This bug can be reproduced w/ the following CML: /home/staf-kvm-devel/autotest-devel/client/tests/virt/qemu/qemu \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=/home/staf-kvm-devel/autotest-devel/client/tests/virt/shared/data/images/win2008-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=05 \ -monitor stdio \ -vnc :0 \ -m 4096 \ > 4. is this a regression? > or, conversely, does the problem still reproduce using > the latest kernel and qemu? (probably not, because rhel7 > isn't far off the latest, but it's still worth trying if the > issue isn't a regression). 4.1 No, it's not a regression. I have downgraded qemu to qemu-kvm-1.5.3-10.el7.x86_64, kernel to kernel-3.10.0-35.el7.x86_64, still hit this bug. 4.2 Yes, this problem still existed on the following version, and the above results are tested on the version: kernel-3.10.0-85.el7.x86_64 qemu-kvm-1.5.3-45.el7.x86_64 Thanks, Cong (In reply to CongLi from comment #12) > (In reply to CongLi from comment #11) Thanks for the quick reply. > > To make sure the image is good, I have boot and done many times' > 'system_reset' to the win2008.x86_64 guest on a intel machine, the test > result is good. > > > 1. Was another exactly identical host (Opteron_G?), but not the same > > exact machine, ever tried? > > Yes, it can be reproduced on another exactly identical host as comment 0. > And also can be reproduced on machine: > processor : 3 > vendor_id : AuthenticAMD > cpu family : 21 > model : 16 > model name : AMD A10-5800K APU with Radeon(tm) HD Graphics > but not all AMD hosts can hit this problem. OK, good to know. > > > 2. Are all 5-6 bugs on the same exact host? do non win2008 guests work? > > 2.1 Not sure whether all these bugs are on the same host for some aren't > reported by me and there is no host info in the bug, but those which I > reported are the same host. > 2.2 didn't hit this bug with guest win2008.i386, win7(i386, x86_64), > win8.0(i386, x86_64), win8.1(i386, x86_64) Also good to know. > > > 3. The CML is too large and complex. We need to reduce the command line > > down to a minimal reproducer. See my comment > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1056982#c10 > > This bug can be reproduced w/ the following CML: > /home/staf-kvm-devel/autotest-devel/client/tests/virt/qemu/qemu \ > -drive > id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=/home/staf- > kvm-devel/autotest-devel/client/tests/virt/shared/data/images/win2008-64- > virtio.qcow2 \ > -device > virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=05 \ > -monitor stdio \ > -vnc :0 \ > -m 4096 \ This is a nicely reduced command line. Thanks for this. > > > 4. is this a regression? > > or, conversely, does the problem still reproduce using > > the latest kernel and qemu? (probably not, because rhel7 > > isn't far off the latest, but it's still worth trying if the > > issue isn't a regression). > > 4.1 No, it's not a regression. > I have downgraded qemu to qemu-kvm-1.5.3-10.el7.x86_64, kernel to > kernel-3.10.0-35.el7.x86_64, still hit this bug. I was actually thinking about checking for a regression since rhel6. The version numbers tested here don't have a very big delta. > > 4.2 Yes, this problem still existed on the following version, and the > above results are tested on the version: > kernel-3.10.0-85.el7.x86_64 > qemu-kvm-1.5.3-45.el7.x86_64 > I was thinking about testing the absolute latest, but as I said before the chance that it will have recently been magically fixed is pretty small, so no need for that test. I can now begin reproducing the bug with some tracing enabled on the host side. Can I get access to a host where it reproduces and has a windows image that it reproduces with? Thanks, drew (In reply to Andrew Jones from comment #13) > > > 4. is this a regression? > > > or, conversely, does the problem still reproduce using > > > the latest kernel and qemu? (probably not, because rhel7 > > > isn't far off the latest, but it's still worth trying if the > > > issue isn't a regression). > > > > 4.1 No, it's not a regression. > > I have downgraded qemu to qemu-kvm-1.5.3-10.el7.x86_64, kernel to > > kernel-3.10.0-35.el7.x86_64, still hit this bug. > > I was actually thinking about checking for a regression since rhel6. The > version numbers tested here don't have a very big delta. Not met this bug on RHEL.6.6 host w/ the same machine in comment 0. kernel-2.6.32-444.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.420.el6.x86_64 Tested on the following version: kernel-3.10.0-95.el7.x86_64 seabios-1.7.2.2-11.el7.x86_64 1. qemu-kvm-1.5.3-2.el7 --> fail 2. qemu-kvm-1.5.2-1.el7 --> pass qemu-kvm-1.5.2-4.el7 --> pass Use the guest which is BSOD on qemu-kvm-1.5.3-2.el7, can boot up successfully with qemu-kvm-1.5.2-4.el7, as well do system_reset. And I can't get qemu-kvm-1.5.3-1.el7 for it is deleted in brew, but I guess this bug was introduced from qemu-kvm-1.5.3, and I found qemu-kvm-1.5.3.1 has rebase. QEMU CML: /usr/libexec/qemu-kvm \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=/home/win2008-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=05 \ -monitor stdio \ -vnc :0 \ -m 4096 \ Created attachment 869957 [details] list of commits from 1.5.2-4 to 1.5.3-2 CongLi, how can you be sure that qemu-kvm-1.5.2 works if the bug is only reproduced sometimes? In any case, I rebuilt qemu-kvm-1.5.3-1.el7 here: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7132551 Please test with it. I attach the list of patches between 1.5.2-4 and 1.5.3-2. The only ones that remotely could cause the failure are pc: Haswell doesn't have rdtscp on rhel6.x pc: SandyBridge rhel6.x compat fixes pc: Remove PCLMULQDQ from Westmere on rhel6.x machine-types pc: set compat CPUID[0x80000001].EDX bits on Westmere for rhel6.x pc: rhel6.x has x2apic present on Conroe/Penryn/Nehalem CPU models pc: Remove incorrect rhel6.x compat "model" value for Conroe/Penryn/Nehalem pc: set level/xlevel correctly on 486/qemu32 CPU models for rhel6.x (In reply to Paolo Bonzini from comment #18) > In any case, I rebuilt qemu-kvm-1.5.3-1.el7 here: > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7132551 > > Please test with it. qemu-kvm-1.5.3-1.el7 --> pass qemu-kvm-1.5.3-2.el7 --> fail As Radim said, this bug is caused by '-machine kernel_irqchip=on|off'. qemu-kvm-1.5.3-2.el7: 1. -machine kernel_irqchip=on --> fail (BSOD) 2. -machine kernel_irqchip=off --> pass CML: /home/staf-kvm-devel/autotest-devel/client/tests/virt/qemu/qemu \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=win2008-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=05 \ -monitor stdio \ -vnc :0 \ -m 4096 \ -machine kernel_irqchip=on Will attach the strace log later. 1. -machine kernel_irqchip=on: quit directly when met BSOD 2. -machine kernel_irqchip=off: guest boot successfully Created attachment 870243 [details]
strace-kernel_irqchip_off.txt
Created attachment 870244 [details]
strace-kernel_irqchip_on.txt
Posted upstream: http://comments.gmane.org/gmane.linux.kernel/1664960 Patch(es) available on kernel-3.10.0-111.el7 *** Bug 1038594 has been marked as a duplicate of this bug. *** *** Bug 1038902 has been marked as a duplicate of this bug. *** *** Bug 1049800 has been marked as a duplicate of this bug. *** *** Bug 1049823 has been marked as a duplicate of this bug. *** Tested on: qemu-kvm-rhev-1.5.3-53.el7.x86_64 1. Reproduce this bug on kernel-3.10.0-110.el7.x86_64 dafault=on --> fail -machine kernel_irqchip=on --> fail -machine kernel_irqchip=off --> pass 2. Verify this bug on kernel-3.10.0-112.el7.x86_64 dafault=on --> pass -machine kernel_irqchip=on --> pass -machine kernel_irqchip=off --> pass CML: /usr/libexec/qemu-kvm \ -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=win2008-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=05 \ -monitor stdio \ -vnc :0 \ -m 4096 \ As the above info, we could set the status to 'VERIFIED'. *** Bug 1003751 has been marked as a duplicate of this bug. *** This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |