Bug 1095099
| Summary: | RHEL7.0 guest hang during kdump with qxl shared irq | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | huiqingding <huding> | ||||||||||||||||
| Component: | kernel | Assignee: | jason wang <jasowang> | ||||||||||||||||
| kernel sub component: | Graphics | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||||||
| Severity: | medium | ||||||||||||||||||
| Priority: | high | CC: | djasa, huding, jasowang, juli, juzhang, knoel, mkrcmari, rbalakri, tpelka, virt-maint, xfu | ||||||||||||||||
| Version: | 7.0 | ||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||
| OS: | Linux | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | kernel-3.10.0-143.el7 | Doc Type: | Bug Fix | ||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | 2015-03-05 12:05:10 UTC | Type: | Bug | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
huiqingding
2014-05-07 06:54:45 UTC
Created attachment 893124 [details]
call trace log of do sysrq
Only with e1000 and -vga qxl can reproduce this issue, the command line as following: # /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio -name test-all-qemu-kvm-option -drive file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,unit=1 -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,addr=0x7,bootindex=1 -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off -netdev tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139 -device e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0,addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl -vnc :1 Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be generated successfully. (In reply to huiqingding from comment #3) > Only with e1000 and -vga qxl can reproduce this issue, the command line as > following: > # /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096 > -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio > -name test-all-qemu-kvm-option -drive > file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2, > cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1, > unit=1 -device > virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0, > addr=0x7,bootindex=1 -netdev > tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device > e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0, > addr=0xa,multifunction=off -netdev > tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139 > -device > e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0, > addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl > -vnc :1 > > Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be > generated successfully. This probably mean something was wrong in qxl. Looking at qxl interrupt handler. It always return IRQ_HANDLED which mean when it shares irq with other device during kdump kernel and there's some pending irq in the other device, there will be a infinite loop of irq processing. Since IRQ_HANDLED was returned by qxl, note_interrupt() won't treat it as spurious interrupt so it won't be masked.
Something like this is needed:
diff --git a/drivers/gpu/drm/qxl/qxl_irq.c b/drivers/gpu/drm/qxl/qxl_irq.c
index 21393dc..f4b6b89 100644
--- a/drivers/gpu/drm/qxl/qxl_irq.c
+++ b/drivers/gpu/drm/qxl/qxl_irq.c
@@ -33,6 +33,9 @@ irqreturn_t qxl_irq_handler(DRM_IRQ_ARGS)
pending = xchg(&qdev->ram_header->int_pending, 0);
+ if (!pending)
+ return IRQ_NONE;
+
atomic_inc(&qdev->irq_received);
if (pending & QXL_INTERRUPT_DISPLAY) {
(In reply to huiqingding from comment #9) > Created attachment 894528 [details] > serial log after do sysrq Thanks for the testing. Those calltrace is expected. Since crash kernel could not reset the devices, if some irq were injected before 8139 or e1000 is initialized, you may meet those. Will post the patch upstream first. Patch(es) available on kernel-3.10.0-143.el7 Jason, Huiqing, When booting 7.0 release guest (with kernel -123 as in original report so without the fix) on 7.1 host, I can successfully generate backtrace using "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm not sure if the reproducer is as reliable as claimed. I'll attach the generated qemu cli and the libvirt domain xml. Created attachment 982907 [details]
qemu cli and libvirt xml
(In reply to David Jaša from comment #20) > Jason, Huiqing, > > When booting 7.0 release guest (with kernel -123 as in original report so > without the fix) on 7.1 host, I can successfully generate backtrace using > "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm > not sure if the reproducer is as reliable as claimed. I'll attach the > generated qemu cli and the libvirt domain xml. Hi David: I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq with other device (e.g 8139 or e1000). You can check this through doing "cat /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so you probably won't reproduce the issue. Thanks (In reply to jason wang from comment #22) > ... > I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq > with other device (e.g 8139 or e1000). You can check this through doing "cat > /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so > you probably won't reproduce the issue. > > Thanks Point taken. I changed the network device to e1000 type, I plugged it to bridge network and I fed the interface with WOL packets from outside (to ensure IRQs coming after IP deconfiguration). It didn't make bug occur on 7.1 host however, 7.0.z host was affected only. Here are the details: I reused the the same fresh 7.0 guest for testing, I've run it on 7.0.z host system and on 7.1. The domain xml is the same on both hosts and /proc/interrupts seems the same (apart from per-CPU counts). 7.1 host required addition of "allow br0" entry to /etc/qemu-kvm/bridge.conf so that VM in user session could use <interface type='bridge'> setting. Created attachment 983399 [details]
rhel70.xml
Created attachment 983400 [details]
/proc/interrupts on 7.0 host
Created attachment 983401 [details]
/proc/interrupts on 7.1 host
Created attachment 983404 [details]
qemu log on 7.1 host
Created attachment 983405 [details]
qemu log on 7.0 host
The WOL packets were sent by this loop, invoked on the host right before issuing c to /proc/sysrq-trigger in the guest: i=0 ; while true ; do echo $i; ether-wake -i <bridge> <guest_mac_address> ; sleep 0.01 ; i=$(($i+1)) ; done Thanks for the testing David.
In RHEL7.1 there's a kernel side fix which may make it a little bit to reproduce:
commit f008d31b1c680230d934a18207a6909c97337af4
Author: John Snow <jsnow>
Date: Fri Nov 14 23:32:36 2014 -0500
[virt] kvm/ioapic: conditionally delay irq delivery duringeoi broadcast
Please try to use kernel version which is lower than 205 in host to reproduce.
Thanks.
Test this bug using the following version: kenrel-3.10.0-227.el7.x86_64 qemu-kvm-rhev-2.1.2-23.el7.x86_64 The guest kernel is kenrel-3.10.0-227.el7.x86_64 Steps to test: 1. boot a rhel7.1 guest: /usr/libexec/qemu-kvm \ -M pc \ -cpu Opteron_G3 \ -enable-kvm \ -m 4096 -smp 4,sockets=2,cores=2,threads=1 \ -nodefconfig \ -nodefaults \ -monitor stdio \ -name rhel7 \ -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,event_idx=on,multifunction=on,rombar=100 \ -usbdevice tablet \ -drive file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2,cache=writethrough,werror=stop,rerror=stop \ -device virtio-scsi-pci,id=scsi1,addr=0x13 \ -device scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \ -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \ -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off \ -serial unix:/tmp/monitor2,server,nowait \ -vga qxl \ -vnc :1 2. inside guest, do sysrq # echo c > /proc/sysrq-trigger 3. Actual results: the guest can reboot automatcially and vmcore can be generated. Based on the above results, I think this bug has been fixed. According to comment35, kvm qe plan to set this issue as verified. If Desktop qe has more testing, free to update it and update the status accordingly. Best Regards, Junyi (In reply to huiqingding from comment #35) > Test this bug using the following version: > kenrel-3.10.0-227.el7.x86_64 > qemu-kvm-rhev-2.1.2-23.el7.x86_64 > > The guest kernel is kenrel-3.10.0-227.el7.x86_64 > > Steps to test: > 1. boot a rhel7.1 guest: > /usr/libexec/qemu-kvm \ > -M pc \ > -cpu Opteron_G3 \ > -enable-kvm \ > -m 4096 -smp 4,sockets=2,cores=2,threads=1 \ > -nodefconfig \ > -nodefaults \ > -monitor stdio \ > -name rhel7 \ > -device > virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on, > event_idx=on,multifunction=on,rombar=100 \ > -usbdevice tablet \ > -drive > file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2, > cache=writethrough,werror=stop,rerror=stop \ > -device virtio-scsi-pci,id=scsi1,addr=0x13 \ > -device > scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \ > -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \ > -device > e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0, > addr=0xa,multifunction=off \ > -serial unix:/tmp/monitor2,server,nowait \ > -vga qxl \ > -vnc :1 > > 2. inside guest, do sysrq > # echo c > /proc/sysrq-trigger > 3. > > Actual results: > the guest can reboot automatcially and vmcore can be generated. > > Based on the above results, I think this bug has been fixed. See comment #30. Better verify this bug on host kernel lower than 205 or a RHEL6 host. Thanks Hi, Jason, Thanks for reminding. I also test RHEL7.1 guest on RHEL6 host: kernel-2.6.32-524.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64 The guest kernel is kenrel-3.10.0-227.el7.x86_64 The test stpes are same as comment 35, the result is ok, the guest can reboot automatically and vmcore can be generated. Jason, will you help to comfirm whether this bug has been fixed? Best regards Huiqing (In reply to huiqingding from comment #38) > Hi, Jason, > > Thanks for reminding. > > I also test RHEL7.1 guest on RHEL6 host: > kernel-2.6.32-524.el6.x86_64 > qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64 > > The guest kernel is kenrel-3.10.0-227.el7.x86_64 > > The test stpes are same as comment 35, the result is ok, the guest can > reboot automatically and vmcore can be generated. > > Jason, will you help to comfirm whether this bug has been fixed? > > Best regards > Huiqing Yes, I confirm. Thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0290.html |