Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1095099 - RHEL7.0 guest hang during kdump with qxl shared irq
RHEL7.0 guest hang during kdump with qxl shared irq
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel (Show other bugs)
7.0
x86_64 Linux
high Severity medium
: rc
: ---
Assigned To: jason wang
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-07 02:54 EDT by huiqingding
Modified: 2015-03-05 07:05 EST (History)
11 users (show)

See Also:
Fixed In Version: kernel-3.10.0-143.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-03-05 07:05:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
call trace log of do sysrq (35.64 KB, text/plain)
2014-05-07 02:55 EDT, huiqingding
no flags Details
qemu cli and libvirt xml (8.29 KB, text/plain)
2015-01-22 10:57 EST, David Jaša
no flags Details
rhel70.xml (5.37 KB, text/plain)
2015-01-23 10:30 EST, David Jaša
no flags Details
/proc/interrupts on 7.0 host (2.41 KB, text/plain)
2015-01-23 10:31 EST, David Jaša
no flags Details
/proc/interrupts on 7.1 host (2.41 KB, text/plain)
2015-01-23 10:32 EST, David Jaša
no flags Details
qemu log on 7.1 host (12.14 KB, text/plain)
2015-01-23 10:34 EST, David Jaša
no flags Details
qemu log on 7.0 host (17.64 KB, text/plain)
2015-01-23 10:36 EST, David Jaša
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0290 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2015-03-05 11:13:58 EST

  None (edit)
Description huiqingding 2014-05-07 02:54:45 EDT
Description of problem:
Boot a RHEL7.0 guest with virtio-balloon, usb tablet and e1000/rtl8139 nic card, do sysrq inside guest, guest hang and vmcore file cannot be generated.

Version-Release number of selected component (if applicable):
kenrel-3.10.0-123.el7.x86_64
qemu-kvm-1.5.3-60.el7_0.1.x86_64

The guest kernel is kenrel-3.10.0-123.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot a rhel7.0 guest with virtio-balloon, usb tablet and e1000/rtl8139 nic
# /usr/libexec/qemu-kvm \
  -M pc \
  -cpu Westmere \
  -enable-kvm  \
  -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
  -nodefconfig \
  -nodefaults \
  -monitor stdio \
  -name rhel7 \
  -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,event_idx=on,multifunction=on,rombar=100 \
  -usbdevice tablet \
  -drive file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,unit=1 \
  -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,addr=0x7,bootindex=1,physical_block_size=512,logical_block_size=512,multifunction=on,scsi=on,event_idx=on,indirect_desc=on,vectors=32,x-data-plane=off,ioeventfd=on,serial=fuxc,discard_granularity=1,min_io_size=4096,opt_io_size=4096 \
  -netdev tap,id=hostnet1,vhost=off,script=/etc/ovs-ifup,downscript=/etc/ovs-ifdown,ifname=fuxc-net1 \
  -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off \
  -serial unix:/tmp/monitor2,server,nowait \
  -vga qxl \
  -vnc :1
2. inside guest, do sysrq
# echo c > /proc/sysrq-trigger
3.

Actual results:
after step2 vmcore cannot be generated and the serial log is as the attachment file.

Expected results:
vmcore should be generated and the guest reboot automatically.

Additional info:
Comment 1 huiqingding 2014-05-07 02:55:59 EDT
Created attachment 893124 [details]
call trace log of do sysrq
Comment 3 huiqingding 2014-05-07 04:08:37 EDT
Only with e1000 and -vga qxl can reproduce this issue, the command line as following:
# /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio -name test-all-qemu-kvm-option -drive file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,unit=1 -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,addr=0x7,bootindex=1 -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off -netdev tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139 -device e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0,addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl -vnc :1

Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be generated successfully.
Comment 4 jason wang 2014-05-09 05:33:59 EDT
(In reply to huiqingding from comment #3)
> Only with e1000 and -vga qxl can reproduce this issue, the command line as
> following:
> # /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096
> -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio
> -name test-all-qemu-kvm-option -drive
> file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,
> cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,
> unit=1 -device
> virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,
> addr=0x7,bootindex=1 -netdev
> tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device
> e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,
> addr=0xa,multifunction=off -netdev
> tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139
> -device
> e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0,
> addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl
> -vnc :1
> 
> Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be
> generated successfully.

This probably mean something was wrong in qxl.
Comment 5 jason wang 2014-05-09 05:38:55 EDT
Looking at qxl interrupt handler. It always return IRQ_HANDLED which mean when it shares irq with other device during kdump kernel and there's some pending irq in the other device, there will be a infinite loop of irq processing. Since IRQ_HANDLED was returned by qxl, note_interrupt() won't treat it as spurious interrupt so it won't be masked.

Something like this is needed:

diff --git a/drivers/gpu/drm/qxl/qxl_irq.c b/drivers/gpu/drm/qxl/qxl_irq.c
index 21393dc..f4b6b89 100644
--- a/drivers/gpu/drm/qxl/qxl_irq.c
+++ b/drivers/gpu/drm/qxl/qxl_irq.c
@@ -33,6 +33,9 @@ irqreturn_t qxl_irq_handler(DRM_IRQ_ARGS)
 
        pending = xchg(&qdev->ram_header->int_pending, 0);
 
+       if (!pending)
+               return IRQ_NONE;
+
        atomic_inc(&qdev->irq_received);
 
        if (pending & QXL_INTERRUPT_DISPLAY) {
Comment 10 jason wang 2014-05-11 22:48:19 EDT
(In reply to huiqingding from comment #9)
> Created attachment 894528 [details]
> serial log after do sysrq

Thanks for the testing.

Those calltrace is expected. Since crash kernel could not reset the devices, if some irq were injected before 8139 or e1000 is initialized, you may meet those.

Will post the patch upstream first.
Comment 12 Jarod Wilson 2014-08-07 16:54:46 EDT
Patch(es) available on kernel-3.10.0-143.el7
Comment 20 David Jaša 2015-01-22 10:49:50 EST
Jason, Huiqing,

When booting 7.0 release guest (with kernel -123 as in original report so without the fix) on 7.1 host, I can successfully generate backtrace using "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm not sure if the reproducer is as reliable as claimed. I'll attach the generated qemu cli and the libvirt domain xml.
Comment 21 David Jaša 2015-01-22 10:57:03 EST
Created attachment 982907 [details]
qemu cli and libvirt xml
Comment 22 jason wang 2015-01-22 21:51:24 EST
(In reply to David Jaša from comment #20)
> Jason, Huiqing,
> 
> When booting 7.0 release guest (with kernel -123 as in original report so
> without the fix) on 7.1 host, I can successfully generate backtrace using
> "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm
> not sure if the reproducer is as reliable as claimed. I'll attach the
> generated qemu cli and the libvirt domain xml.

Hi David:

I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq with other device (e.g 8139 or e1000). You can check this through doing "cat /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so you probably won't reproduce the issue.

Thanks
Comment 23 David Jaša 2015-01-23 10:28:25 EST
(In reply to jason wang from comment #22)
> ...
> I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq
> with other device (e.g 8139 or e1000). You can check this through doing "cat
> /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so
> you probably won't reproduce the issue.
> 
> Thanks

Point taken. I changed the network device to e1000 type, I plugged it to bridge network and I fed the interface with WOL packets from outside (to ensure IRQs coming after IP deconfiguration). It didn't make bug occur on 7.1 host however, 7.0.z host was affected only.

Here are the details: I reused the the same fresh 7.0 guest for testing, I've run it on 7.0.z host system and on 7.1. The domain xml is the same on both hosts and /proc/interrupts seems the same (apart from per-CPU counts). 7.1 host required addition of "allow br0" entry to /etc/qemu-kvm/bridge.conf so that VM in user session could use <interface type='bridge'> setting.
Comment 24 David Jaša 2015-01-23 10:30:47 EST
Created attachment 983399 [details]
rhel70.xml
Comment 25 David Jaša 2015-01-23 10:31:35 EST
Created attachment 983400 [details]
/proc/interrupts on 7.0 host
Comment 26 David Jaša 2015-01-23 10:32:03 EST
Created attachment 983401 [details]
/proc/interrupts on 7.1 host
Comment 27 David Jaša 2015-01-23 10:34:01 EST
Created attachment 983404 [details]
qemu log on 7.1 host
Comment 28 David Jaša 2015-01-23 10:36:20 EST
Created attachment 983405 [details]
qemu log on 7.0 host
Comment 29 David Jaša 2015-01-23 10:42:28 EST
The WOL packets were sent by this loop, invoked on the host right before issuing c to /proc/sysrq-trigger in the guest:
i=0 ; while true ; do echo $i; ether-wake -i <bridge> <guest_mac_address> ; sleep 0.01 ; i=$(($i+1)) ; done
Comment 30 jason wang 2015-01-25 22:01:33 EST
Thanks for the testing David.

In RHEL7.1 there's a kernel side fix which may make it a little bit to reproduce:

commit f008d31b1c680230d934a18207a6909c97337af4
Author: John Snow <jsnow@redhat.com>
Date:   Fri Nov 14 23:32:36 2014 -0500

    [virt] kvm/ioapic: conditionally delay irq delivery duringeoi broadcast

Please try to use kernel version which is lower than 205 in host to reproduce.

Thanks.
Comment 35 huiqingding 2015-01-29 21:08:34 EST
Test this bug using the following version:
kenrel-3.10.0-227.el7.x86_64
qemu-kvm-rhev-2.1.2-23.el7.x86_64

The guest kernel is kenrel-3.10.0-227.el7.x86_64

Steps to test:
1. boot a rhel7.1 guest:
/usr/libexec/qemu-kvm \
  -M pc \
  -cpu Opteron_G3 \
  -enable-kvm  \
  -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
  -nodefconfig \
  -nodefaults \
  -monitor stdio \
  -name rhel7 \
  -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,event_idx=on,multifunction=on,rombar=100 \
  -usbdevice tablet \
  -drive file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2,cache=writethrough,werror=stop,rerror=stop \
  -device virtio-scsi-pci,id=scsi1,addr=0x13 \
  -device scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \
  -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \
  -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off \
  -serial unix:/tmp/monitor2,server,nowait \
  -vga qxl \
  -vnc :1

2. inside guest, do sysrq
# echo c > /proc/sysrq-trigger
3.

Actual results:
the guest can reboot automatcially and vmcore can be generated.

Based on the above results, I think this bug has been fixed.
Comment 36 juzhang 2015-01-29 21:12:09 EST
According to comment35, kvm qe plan to set this issue as verified. If Desktop qe has more testing, free to update it and update the status accordingly.

Best Regards,
Junyi
Comment 37 jason wang 2015-01-29 21:50:36 EST
(In reply to huiqingding from comment #35)
> Test this bug using the following version:
> kenrel-3.10.0-227.el7.x86_64
> qemu-kvm-rhev-2.1.2-23.el7.x86_64
> 
> The guest kernel is kenrel-3.10.0-227.el7.x86_64
> 
> Steps to test:
> 1. boot a rhel7.1 guest:
> /usr/libexec/qemu-kvm \
>   -M pc \
>   -cpu Opteron_G3 \
>   -enable-kvm  \
>   -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
>   -nodefconfig \
>   -nodefaults \
>   -monitor stdio \
>   -name rhel7 \
>   -device
> virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,
> event_idx=on,multifunction=on,rombar=100 \
>   -usbdevice tablet \
>   -drive
> file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2,
> cache=writethrough,werror=stop,rerror=stop \
>   -device virtio-scsi-pci,id=scsi1,addr=0x13 \
>   -device
> scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \
>   -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \
>   -device
> e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,
> addr=0xa,multifunction=off \
>   -serial unix:/tmp/monitor2,server,nowait \
>   -vga qxl \
>   -vnc :1
> 
> 2. inside guest, do sysrq
> # echo c > /proc/sysrq-trigger
> 3.
> 
> Actual results:
> the guest can reboot automatcially and vmcore can be generated.
> 
> Based on the above results, I think this bug has been fixed.

See comment #30. Better verify this bug on host kernel lower than 205 or a RHEL6 host.

Thanks
Comment 38 huiqingding 2015-01-29 22:08:15 EST
Hi, Jason,

Thanks for reminding.

I also test RHEL7.1 guest on RHEL6 host:
kernel-2.6.32-524.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64

The guest kernel is kenrel-3.10.0-227.el7.x86_64

The test stpes are same as comment 35, the result is ok, the guest can reboot automatically and vmcore can be generated.

Jason, will you help to comfirm whether this bug has been fixed?

Best regards
Huiqing
Comment 39 jason wang 2015-01-30 00:06:53 EST
(In reply to huiqingding from comment #38)
> Hi, Jason,
> 
> Thanks for reminding.
> 
> I also test RHEL7.1 guest on RHEL6 host:
> kernel-2.6.32-524.el6.x86_64
> qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64
> 
> The guest kernel is kenrel-3.10.0-227.el7.x86_64
> 
> The test stpes are same as comment 35, the result is ok, the guest can
> reboot automatically and vmcore can be generated.
> 
> Jason, will you help to comfirm whether this bug has been fixed?
> 
> Best regards
> Huiqing

Yes, I confirm.

Thanks
Comment 42 errata-xmlrpc 2015-03-05 07:05:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html

Note You need to log in before you can comment on or make changes to this bug.