1095099 – RHEL7.0 guest hang during kdump with qxl shared irq

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1095099 - RHEL7.0 guest hang during kdump with qxl shared irq

Summary: RHEL7.0 guest hang during kdump with qxl shared irq

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	jason wang
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-07 06:54 UTC by huiqingding
Modified:	2015-03-05 12:05 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-3.10.0-143.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-05 12:05:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
call trace log of do sysrq (35.64 KB, text/plain) 2014-05-07 06:55 UTC, huiqingding	no flags	Details
qemu cli and libvirt xml (8.29 KB, text/plain) 2015-01-22 15:57 UTC, David Jaša	no flags	Details
rhel70.xml (5.37 KB, text/plain) 2015-01-23 15:30 UTC, David Jaša	no flags	Details
/proc/interrupts on 7.0 host (2.41 KB, text/plain) 2015-01-23 15:31 UTC, David Jaša	no flags	Details
/proc/interrupts on 7.1 host (2.41 KB, text/plain) 2015-01-23 15:32 UTC, David Jaša	no flags	Details
qemu log on 7.1 host (12.14 KB, text/plain) 2015-01-23 15:34 UTC, David Jaša	no flags	Details
qemu log on 7.0 host (17.64 KB, text/plain) 2015-01-23 15:36 UTC, David Jaša	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1031488	0	high	CLOSED	Restore the mask bit correctly in eoi_ioapic_irq()	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2015:0290	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2015-03-05 16:13:58 UTC

Internal Links: 1031488

Description huiqingding 2014-05-07 06:54:45 UTC

Description of problem:
Boot a RHEL7.0 guest with virtio-balloon, usb tablet and e1000/rtl8139 nic card, do sysrq inside guest, guest hang and vmcore file cannot be generated.

Version-Release number of selected component (if applicable):
kenrel-3.10.0-123.el7.x86_64
qemu-kvm-1.5.3-60.el7_0.1.x86_64

The guest kernel is kenrel-3.10.0-123.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot a rhel7.0 guest with virtio-balloon, usb tablet and e1000/rtl8139 nic
# /usr/libexec/qemu-kvm \
  -M pc \
  -cpu Westmere \
  -enable-kvm  \
  -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
  -nodefconfig \
  -nodefaults \
  -monitor stdio \
  -name rhel7 \
  -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,event_idx=on,multifunction=on,rombar=100 \
  -usbdevice tablet \
  -drive file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,unit=1 \
  -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,addr=0x7,bootindex=1,physical_block_size=512,logical_block_size=512,multifunction=on,scsi=on,event_idx=on,indirect_desc=on,vectors=32,x-data-plane=off,ioeventfd=on,serial=fuxc,discard_granularity=1,min_io_size=4096,opt_io_size=4096 \
  -netdev tap,id=hostnet1,vhost=off,script=/etc/ovs-ifup,downscript=/etc/ovs-ifdown,ifname=fuxc-net1 \
  -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off \
  -serial unix:/tmp/monitor2,server,nowait \
  -vga qxl \
  -vnc :1
2. inside guest, do sysrq
# echo c > /proc/sysrq-trigger
3.

Actual results:
after step2 vmcore cannot be generated and the serial log is as the attachment file.

Expected results:
vmcore should be generated and the guest reboot automatically.

Additional info:

Comment 1 huiqingding 2014-05-07 06:55:59 UTC

Created attachment 893124 [details]
call trace log of do sysrq

Comment 3 huiqingding 2014-05-07 08:08:37 UTC

Only with e1000 and -vga qxl can reproduce this issue, the command line as following:
# /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio -name test-all-qemu-kvm-option -drive file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,unit=1 -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,addr=0x7,bootindex=1 -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off -netdev tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139 -device e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0,addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl -vnc :1

Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be generated successfully.

Comment 4 jason wang 2014-05-09 09:33:59 UTC

(In reply to huiqingding from comment #3)
> Only with e1000 and -vga qxl can reproduce this issue, the command line as
> following:
> # /usr/libexec/qemu-kvm -M pc -cpu Westmere,hv_relaxed -enable-kvm -m 4096
> -smp 4,sockets=2,cores=2,threads=1 -nodefconfig -nodefaults -monitor stdio
> -name test-all-qemu-kvm-option -drive
> file=/home/rhel7-64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,
> cache=none,aio=native,werror=stop,rerror=stop,media=disk,snapshot=off,bus=1,
> unit=1 -device
> virtio-blk-pci,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bus=pci.0,
> addr=0x7,bootindex=1 -netdev
> tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 -device
> e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,
> addr=0xa,multifunction=off -netdev
> tap,id=hostnet2,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net-rtl8139
> -device
> e1000,netdev=hostnet2,id=virtio-net-pci2,mac=00:01:02:03:04:07,bus=pci.0,
> addr=0xb,multifunction=off -serial unix:/tmp/monitor2,server,nowait -vga qxl
> -vnc :1
> 
> Boot rhel7.0 guest with e1000 card and "-vga cirrus/std", vmcore can be
> generated successfully.

This probably mean something was wrong in qxl.

Comment 5 jason wang 2014-05-09 09:38:55 UTC

Looking at qxl interrupt handler. It always return IRQ_HANDLED which mean when it shares irq with other device during kdump kernel and there's some pending irq in the other device, there will be a infinite loop of irq processing. Since IRQ_HANDLED was returned by qxl, note_interrupt() won't treat it as spurious interrupt so it won't be masked.

Something like this is needed:

diff --git a/drivers/gpu/drm/qxl/qxl_irq.c b/drivers/gpu/drm/qxl/qxl_irq.c
index 21393dc..f4b6b89 100644
--- a/drivers/gpu/drm/qxl/qxl_irq.c
+++ b/drivers/gpu/drm/qxl/qxl_irq.c
@@ -33,6 +33,9 @@ irqreturn_t qxl_irq_handler(DRM_IRQ_ARGS)
 
        pending = xchg(&qdev->ram_header->int_pending, 0);
 
+       if (!pending)
+               return IRQ_NONE;
+
        atomic_inc(&qdev->irq_received);
 
        if (pending & QXL_INTERRUPT_DISPLAY) {

Comment 10 jason wang 2014-05-12 02:48:19 UTC

(In reply to huiqingding from comment #9)
> Created attachment 894528 [details]
> serial log after do sysrq

Thanks for the testing.

Those calltrace is expected. Since crash kernel could not reset the devices, if some irq were injected before 8139 or e1000 is initialized, you may meet those.

Will post the patch upstream first.

Comment 12 Jarod Wilson 2014-08-07 20:54:46 UTC

Patch(es) available on kernel-3.10.0-143.el7

Comment 20 David Jaša 2015-01-22 15:49:50 UTC

Jason, Huiqing,

When booting 7.0 release guest (with kernel -123 as in original report so without the fix) on 7.1 host, I can successfully generate backtrace using "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm not sure if the reproducer is as reliable as claimed. I'll attach the generated qemu cli and the libvirt domain xml.

Comment 21 David Jaša 2015-01-22 15:57:03 UTC

Created attachment 982907 [details]
qemu cli and libvirt xml

Comment 22 jason wang 2015-01-23 02:51:24 UTC

(In reply to David Jaša from comment #20)
> Jason, Huiqing,
> 
> When booting 7.0 release guest (with kernel -123 as in original report so
> without the fix) on 7.1 host, I can successfully generate backtrace using
> "echo c > /proc/sysrq-trigger" method. I'll try 7.0 on 7.0 but for now, I'm
> not sure if the reproducer is as reliable as claimed. I'll attach the
> generated qemu cli and the libvirt domain xml.

Hi David:

I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq with other device (e.g 8139 or e1000). You can check this through doing "cat /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so you probably won't reproduce the issue.

Thanks

Comment 23 David Jaša 2015-01-23 15:28:25 UTC

(In reply to jason wang from comment #22)
> ...
> I don't see e1000 or 8139 in your cli. Please make sure qxl is sharing irq
> with other device (e.g 8139 or e1000). You can check this through doing "cat
> /proc/interrupts" in guest. Virito-net does not allow share irq with qxl, so
> you probably won't reproduce the issue.
> 
> Thanks

Point taken. I changed the network device to e1000 type, I plugged it to bridge network and I fed the interface with WOL packets from outside (to ensure IRQs coming after IP deconfiguration). It didn't make bug occur on 7.1 host however, 7.0.z host was affected only.

Here are the details: I reused the the same fresh 7.0 guest for testing, I've run it on 7.0.z host system and on 7.1. The domain xml is the same on both hosts and /proc/interrupts seems the same (apart from per-CPU counts). 7.1 host required addition of "allow br0" entry to /etc/qemu-kvm/bridge.conf so that VM in user session could use <interface type='bridge'> setting.

Comment 24 David Jaša 2015-01-23 15:30:47 UTC

Created attachment 983399 [details]
rhel70.xml

Comment 25 David Jaša 2015-01-23 15:31:35 UTC

Created attachment 983400 [details]
/proc/interrupts on 7.0 host

Comment 26 David Jaša 2015-01-23 15:32:03 UTC

Created attachment 983401 [details]
/proc/interrupts on 7.1 host

Comment 27 David Jaša 2015-01-23 15:34:01 UTC

Created attachment 983404 [details]
qemu log on 7.1 host

Comment 28 David Jaša 2015-01-23 15:36:20 UTC

Created attachment 983405 [details]
qemu log on 7.0 host

Comment 29 David Jaša 2015-01-23 15:42:28 UTC

The WOL packets were sent by this loop, invoked on the host right before issuing c to /proc/sysrq-trigger in the guest:
i=0 ; while true ; do echo $i; ether-wake -i <bridge> <guest_mac_address> ; sleep 0.01 ; i=$(($i+1)) ; done

Comment 30 jason wang 2015-01-26 03:01:33 UTC

Thanks for the testing David.

In RHEL7.1 there's a kernel side fix which may make it a little bit to reproduce:

commit f008d31b1c680230d934a18207a6909c97337af4
Author: John Snow <jsnow>
Date:   Fri Nov 14 23:32:36 2014 -0500

    [virt] kvm/ioapic: conditionally delay irq delivery duringeoi broadcast

Please try to use kernel version which is lower than 205 in host to reproduce.

Thanks.

Comment 35 huiqingding 2015-01-30 02:08:34 UTC

Test this bug using the following version:
kenrel-3.10.0-227.el7.x86_64
qemu-kvm-rhev-2.1.2-23.el7.x86_64

The guest kernel is kenrel-3.10.0-227.el7.x86_64

Steps to test:
1. boot a rhel7.1 guest:
/usr/libexec/qemu-kvm \
  -M pc \
  -cpu Opteron_G3 \
  -enable-kvm  \
  -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
  -nodefconfig \
  -nodefaults \
  -monitor stdio \
  -name rhel7 \
  -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,event_idx=on,multifunction=on,rombar=100 \
  -usbdevice tablet \
  -drive file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2,cache=writethrough,werror=stop,rerror=stop \
  -device virtio-scsi-pci,id=scsi1,addr=0x13 \
  -device scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \
  -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \
  -device e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,addr=0xa,multifunction=off \
  -serial unix:/tmp/monitor2,server,nowait \
  -vga qxl \
  -vnc :1

2. inside guest, do sysrq
# echo c > /proc/sysrq-trigger
3.

Actual results:
the guest can reboot automatcially and vmcore can be generated.

Based on the above results, I think this bug has been fixed.

Comment 36 juzhang 2015-01-30 02:12:09 UTC

According to comment35, kvm qe plan to set this issue as verified. If Desktop qe has more testing, free to update it and update the status accordingly.

Best Regards,
Junyi

Comment 37 jason wang 2015-01-30 02:50:36 UTC

(In reply to huiqingding from comment #35)
> Test this bug using the following version:
> kenrel-3.10.0-227.el7.x86_64
> qemu-kvm-rhev-2.1.2-23.el7.x86_64
> 
> The guest kernel is kenrel-3.10.0-227.el7.x86_64
> 
> Steps to test:
> 1. boot a rhel7.1 guest:
> /usr/libexec/qemu-kvm \
>   -M pc \
>   -cpu Opteron_G3 \
>   -enable-kvm  \
>   -m 4096 -smp 4,sockets=2,cores=2,threads=1 \
>   -nodefconfig \
>   -nodefaults \
>   -monitor stdio \
>   -name rhel7 \
>   -device
> virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x5,indirect_desc=on,
> event_idx=on,multifunction=on,rombar=100 \
>   -usbdevice tablet \
>   -drive
> file=/mnt/rhel7_1_1222.qcow2,if=none,id=drive-scsi-disk,format=qcow2,
> cache=writethrough,werror=stop,rerror=stop \
>   -device virtio-scsi-pci,id=scsi1,addr=0x13 \
>   -device
> scsi-hd,drive=drive-scsi-disk,bus=scsi1.0,id=data-disk2,bootindex=1 \
>   -netdev tap,id=hostnet1,vhost=off,script=/etc/qemu-ifup,ifname=fuxc-net1 \
>   -device
> e1000,netdev=hostnet1,id=virtio-net-pci1,mac=00:01:02:03:04:06,bus=pci.0,
> addr=0xa,multifunction=off \
>   -serial unix:/tmp/monitor2,server,nowait \
>   -vga qxl \
>   -vnc :1
> 
> 2. inside guest, do sysrq
> # echo c > /proc/sysrq-trigger
> 3.
> 
> Actual results:
> the guest can reboot automatcially and vmcore can be generated.
> 
> Based on the above results, I think this bug has been fixed.

See comment #30. Better verify this bug on host kernel lower than 205 or a RHEL6 host.

Thanks

Comment 38 huiqingding 2015-01-30 03:08:15 UTC

Hi, Jason,

Thanks for reminding.

I also test RHEL7.1 guest on RHEL6 host:
kernel-2.6.32-524.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64

The guest kernel is kenrel-3.10.0-227.el7.x86_64

The test stpes are same as comment 35, the result is ok, the guest can reboot automatically and vmcore can be generated.

Jason, will you help to comfirm whether this bug has been fixed?

Best regards
Huiqing

Comment 39 jason wang 2015-01-30 05:06:53 UTC

(In reply to huiqingding from comment #38)
> Hi, Jason,
> 
> Thanks for reminding.
> 
> I also test RHEL7.1 guest on RHEL6 host:
> kernel-2.6.32-524.el6.x86_64
> qemu-kvm-rhev-0.12.1.2-2.451.el6.x86_64
> 
> The guest kernel is kenrel-3.10.0-227.el7.x86_64
> 
> The test stpes are same as comment 35, the result is ok, the guest can
> reboot automatically and vmcore can be generated.
> 
> Jason, will you help to comfirm whether this bug has been fixed?
> 
> Best regards
> Huiqing

Yes, I confirm.

Thanks

Comment 42 errata-xmlrpc 2015-03-05 12:05:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html

Note You need to log in before you can comment on or make changes to this bug.