Bug 1686705
Summary: | [usb][pci-ohci][fast-train]Boot guest with 38 usb disks on the same pci-ohci usb controller,when ohci is stressed,the guest generates a calltrace | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Minjia Cai <micai> |
Component: | qemu-kvm | Assignee: | Laurent Vivier <lvivier> |
Status: | CLOSED ERRATA | QA Contact: | Minjia Cai <micai> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 8.1 | CC: | ddepaula, dgibson, knoel, kraxel, lvivier, mdeng, micai, ngu, qzhang, rbalakri, virt-maint |
Target Milestone: | rc | ||
Target Release: | 8.1 | ||
Hardware: | ppc64le | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-4.0.0-3.module+el8.1.0+3265+26c4ed71 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-11-06 07:13:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Minjia Cai
2019-03-08 04:45:33 UTC
I'll test it on rhel7.6 next week. Confirm for regression. thanks According to comment0 above and rhel7.6(host and guest) did the same operation, the guest has also had the following Calltrace: [ 173.562420] blk_update_request: I/O error, dev sdz, sector 128 [ 173.562424] Buffer I/O error on dev sdz, logical block 1, lost async page write [ 173.668945] irq 18: nobody cared (try booting with the "irqpoll" option) [ 173.668951] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Not tainted 3.10.0-957.10.1.el7.ppc64 #1 [ 173.668952] Call Trace: [ 173.669139] [c00000007ffd3690] [c00000000001cca8] .show_stack+0x88/0x330 (unreliable) [ 173.669241] [c00000007ffd3750] [c000000000ac4d08] .dump_stack+0x28/0x3c [ 173.669276] [c00000007ffd37c0] [c00000000020393c] .note_interrupt+0x2ec/0x460 [ 173.669278] [c00000007ffd3870] [c0000000001ff860] .handle_irq_event+0xa0/0x100 [ 173.669280] [c00000007ffd3900] [c000000000205294] .handle_fasteoi_irq+0xd4/0x240 [ 173.669282] [c00000007ffd3980] [c0000000001fe684] .generic_handle_irq+0x54/0x80 [ 173.669283] [c00000007ffd3a00] [c000000000015e6c] .__do_irq+0x8c/0x1b0 [ 173.669285] [c00000007ffd3a80] [c00000000001608c] .do_IRQ+0xfc/0x110 [ 173.669287] [c00000007ffd3b30] [c00000000000ab00] restore_check_irq_replay+0x2c/0x70 [ 173.669289] --- Exception: 501 at .arch_local_irq_restore+0xf0/0x140 LR = .arch_local_irq_restore+0xf0/0x140 [ 173.669357] [c00000007ffd3e20] [c000000001702280] jiffies+0x0/0x80 (unreliable) [ 173.669359] [c00000007ffd3e90] [c0000000000feb84] .__do_softirq+0xf4/0x390 [ 173.669360] [c00000007ffd3f90] [c00000000002bf8c] .call_do_softirq+0x14/0x24 [ 173.669362] [c00000007eeb3740] [c0000000000162a0] .do_softirq+0x120/0x170 [ 173.669363] [c00000007eeb37d0] [c0000000000ff174] .irq_exit+0x1e4/0x1f0 [ 173.669365] [c00000007eeb3860] [c000000000025b2c] .timer_interrupt+0x9c/0xd0 [ 173.669367] [c00000007eeb38e0] [c00000000000ab28] restore_check_irq_replay+0x54/0x70 [ 173.669397] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4 LR = .shared_cede_loop+0x48/0x90 [ 173.669399] [c00000007eeb3bd0] [c000000000864484] .shared_cede_loop+0x34/0x90 (unreliable) [ 173.669402] [c00000007eeb3c50] [c000000000861a58] .cpuidle_idle_call+0x128/0x430 [ 173.669403] [c00000007eeb3d30] [c0000000000aa4b8] .pseries_lpar_idle+0x18/0x60 [ 173.669405] [c00000007eeb3da0] [c00000000001d68c] .arch_cpu_idle+0x6c/0x160 [ 173.669407] [c00000007eeb3e20] [c00000000018acf0] .cpu_startup_entry+0x170/0x1e0 [ 173.669409] [c00000007eeb3ed0] [c000000000056860] .start_secondary+0x310/0x340 [ 173.669410] [c00000007eeb3f90] [c000000000009b6c] start_secondary_prolog+0x10/0x14 [ 173.669411] handlers: [ 173.669413] [<c0000000016a98c8>] .usb_hcd_irq [ 173.669414] Disabling IRQ #18 So this is not a regression bug,thanks Minjia, Thanks for the confirmation that this isn't a regression. This is probably a bug under load in the OHCI qemu device, which isn't very widely used or tested. Since we're now encouraging people to use XHCI most of the time, this will remain low priority and may eventually be closed WONTFIX. I think the stress condition can cause an overflow on the link counter: qemu/hw/usb/hcd-ohci.c: ... 44 45 #define ED_LINK_LIMIT 32 46 ... 1182 /* Service an endpoint list. Returns nonzero if active TD were found. */ 1183 static int ohci_service_ed_list(OHCIState *ohci, uint32_t head, int completion) 1184 { ... 1204 if (++link_cnt > ED_LINK_LIMIT) { 1205 ohci_die(ohci); 1206 return 0; 1207 } ... This limit has been indroduced by: commit 95ed56939eb2eaa4e2f349fe6dcd13ca4edfd8fb Author: Li Qiang <liqiang6-s> Date: Tue Feb 7 02:23:33 2017 -0800 usb: ohci: limit the number of link eds The guest may builds an infinite loop with link eds. This patch limit the number of linked ed to avoid this. Signed-off-by: Li Qiang <liqiang6-s> Message-id: 5899a02e.45ca240a.6c373.93c1.com Signed-off-by: Gerd Hoffmann <kraxel> and increased from 4 to 32 by: commit ab6b1105a2259c7072905887f71caa850ce63190 Author: Gerd Hoffmann <kraxel> Date: Tue Mar 7 09:40:18 2017 +0100 ohci: relax link check The strict td link limit added by commit "95ed569 usb: ohci: limit the number of link eds" causes problems with macos guests. Lets raise the limit. Reported-by: Programmingkid <programmingkidx> Reported-by: Howard Spoelstra <hsp.cat7> Signed-off-by: Gerd Hoffmann <kraxel> Reviewed-by: BALATON Zoltan <balaton.hu> Reviewed-by: John Arbuckle <programmingkidx> Message-id: 1488876018-31576-1-git-send-email-kraxel I have downloaded comment6 package, test for 10 times, inside the guest did not produce calltrace. But I have found in dmesg once this error(Then I did not see this error again): [ 18.263280] sd 39:0:0:0: Power-on or device reset occurred [ 18.274607] sd 39:0:0:0: [sdal] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB) [ 18.286287] sd 39:0:0:0: [sdal] Write Protect is off [ 18.286297] sd 39:0:0:0: [sdal] Mode Sense: 63 00 00 08 [ 18.292713] sd 39:0:0:0: [sdal] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 18.369250] sd 39:0:0:0: [sdal] Attached SCSI removable disk [ 525.094612] usb 1-3.8.8.8.8.2: reset full-speed USB device number 38 using ohci-pci [ 526.342169] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) [ 527.404688] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) [ 528.415348] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) [ 529.456521] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) [ 534.064617] usb 1-3.8.8.8.3: reset full-speed USB device number 31 using ohci-pci [ 534.334606] usb 1-3.8.8.4: reset full-speed USB device number 24 using ohci-pci [ 534.604605] usb 1-3.8.8.6: reset full-speed USB device number 26 using ohci-pci [ 534.874601] usb 1-3.8.2: reset full-speed USB device number 14 using ohci-pci [ 534.981245] ohci-pci 0000:00:01.0: leak ed 0000000028eab42d (#00) state 2 [ 535.114603] usb 1-3.8.8.8.8.8: reset full-speed USB device number 44 using ohci-pci [ 535.238339] ohci-pci 0000:00:01.0: leak ed 0000000022370ecd (#00) state 2 host: [root@ibm-p8-garrison-03 ~]# rpm -qa|grep qemu qemu-kvm-block-curl-3.1.0-18.el8.BZ1686705.ppc64le qemu-img-3.1.0-18.el8.BZ1686705.ppc64le qemu-guest-agent-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-rbd-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-tests-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-debugsource-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-common-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-iscsi-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-ssh-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-core-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-tests-3.1.0-18.el8.BZ1686705.ppc64le qemu-img-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-iscsi-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-ssh-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-core-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-guest-agent-3.1.0-18.el8.BZ1686705.ppc64le ipxe-roms-qemu-20181214-1.git133f4c47.el8.noarch qemu-kvm-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-rbd-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-block-curl-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le qemu-kvm-common-debuginfo-3.1.0-18.el8.BZ1686705.ppc64le [root@ibm-p8-garrison-03 ~]# rpm -qa|grep SLOF SLOF-20180702-3.git9b7ab2f.module+el8+2717+98011079.noarch [root@ibm-p8-garrison-03 ~]# UNAME -R -bash: UNAME: command not found [root@ibm-p8-garrison-03 ~]# uname -r 4.18.0-80.el8.ppc64le (In reply to Minjia Cai from comment #8) > I have downloaded comment6 package, test for 10 times, inside the guest did > not produce calltrace. b This seems not ppc64le specific. I don't think we can increase the ED_LINK_LIMIT indefinitely, but what could be the good value? Gerd, do you think we should increase ED_LINK_LIMIT? > But I have found in dmesg once this error(Then I did not see this error > again): > [ 18.263280] sd 39:0:0:0: Power-on or device reset occurred > [ 18.274607] sd 39:0:0:0: [sdal] 20480 512-byte logical blocks: (10.5 > MB/10.0 MiB) > [ 18.286287] sd 39:0:0:0: [sdal] Write Protect is off > [ 18.286297] sd 39:0:0:0: [sdal] Mode Sense: 63 00 00 08 > [ 18.292713] sd 39:0:0:0: [sdal] Write cache: enabled, read cache: > enabled, doesn't support DPO or FUA > [ 18.369250] sd 39:0:0:0: [sdal] Attached SCSI removable disk > [ 525.094612] usb 1-3.8.8.8.8.2: reset full-speed USB device number 38 > using ohci-pci > [ 526.342169] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) > [ 527.404688] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) > [ 528.415348] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) > [ 529.456521] usb 1-3.8.8.8.8-port2: cannot reset (err = -110) This is a timeout. I think this is caused by the stress test. I don't think we can do anything for that. Instead of raising the limit we could try to not ohci_dir(), but just stop processing the descriptor list. The next frame timer tick should resume things. Something like this (untested): diff --git a/hw/usb/hcd-ohci.c b/hw/usb/hcd-ohci.c index 196a9f72002d..81cf5ab7a5a7 100644 --- a/hw/usb/hcd-ohci.c +++ b/hw/usb/hcd-ohci.c @@ -1200,7 +1200,7 @@ static int ohci_service_ed_list(OHCIState *ohci, uint32_t head, int completion) if (head == 0) return 0; - for (cur = head; cur; cur = next_ed) { + for (cur = head; cur && link_cnt++ < ED_LINK_LIMIT; cur = next_ed) { if (ohci_read_ed(ohci, cur, &ed)) { trace_usb_ohci_ed_read_error(cur); ohci_die(ohci); @@ -1209,11 +1209,6 @@ static int ohci_service_ed_list(OHCIState *ohci, uint32_t head, int completion) next_ed = ed.next & OHCI_DPTR_MASK; - if (++link_cnt > ED_LINK_LIMIT) { - ohci_die(ohci); - return 0; - } - if ((ed.head & OHCI_ED_H) || (ed.flags & OHCI_ED_K)) { uint32_t addr; /* Cancel pending packets for ED that have been paused. */ Note on comment 9: > This seems not ppc64le specific. It's not ppc specific from a code or upstream point of view, but it is from a downstream testing point of view, since only POWER uses the ohci device. Patch sent upstream: ohci: don't die on ED_LINK_LIMIT overflow https://patchwork.ozlabs.org/patch/1059900/ Patch is now merged: commit ab8789987f5e8d09b71e7425f3980b259967e17c Author: Laurent Vivier <lvivier> Date: Thu Mar 21 09:52:12 2019 +0100 ohci: don't die on ED_LINK_LIMIT overflow Stop processing the descriptor list instead. The next frame timer tick will resume the work Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1686705 Suggested-by: Gerd Hoffmann <kraxel> Signed-off-by: Laurent Vivier <lvivier> Message-id: 20190321085212.10796-1-lvivier Signed-off-by: Gerd Hoffmann <kraxel> I've downloaded the latest qemu package and tested it several times using the comment0 test steps, Check guest dmesg, didn't find any error and calltrace. HOST: [root@ibm-p9b-04 kar]# rpm -qa|grep qemu qemu-kvm-common-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le qemu-kvm-block-rbd-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le ipxe-roms-qemu-20181214-1.git133f4c47.el8.noarch qemu-kvm-core-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le libvirt-daemon-driver-qemu-5.4.0-1.module+el8.1.0+3304+7eb41d5f.ppc64le qemu-img-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le qemu-kvm-block-curl-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le qemu-kvm-block-iscsi-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le qemu-kvm-block-ssh-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le qemu-kvm-4.0.0-3.module+el8.1.0+3265+26c4ed71.ppc64le (workspace) [root@ibm-p9b-04 kar]# uname -r 4.18.0-103.el8.ppc64le guest dmesg: [ 15.168178] usb 1-3.8.8.8.8.8: New USB device found, idVendor=46f4, idProduct=0001, bcdDevice= 0.00 [ 15.168181] usb 1-3.8.8.8.8.8: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ 15.168183] usb 1-3.8.8.8.8.8: Product: QEMU USB HARDDRIVE [ 15.168185] usb 1-3.8.8.8.8.8: Manufacturer: QEMU [ 15.168186] usb 1-3.8.8.8.8.8: SerialNumber: 1-pci@800000020000000:01.0-3.8.8.8.8.8 [ 15.172185] usb-storage 1-3.8.8.8.8.8:1.0: USB Mass Storage device detected [ 15.203352] scsi host39: usb-storage 1-3.8.8.8.8.8:1.0 [ 15.224119] sd 35:0:0:0: [sdah] Attached SCSI removable disk [ 15.370129] scsi 36:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 15.370392] sd 36:0:0:0: Attached scsi generic sg34 type 0 [ 15.384369] sd 36:0:0:0: Power-on or device reset occurred [ 15.393156] sd 36:0:0:0: [sdai] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB) [ 15.400644] sd 36:0:0:0: [sdai] Write Protect is off [ 15.400647] sd 36:0:0:0: [sdai] Mode Sense: 63 00 00 08 [ 15.406540] sd 36:0:0:0: [sdai] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 15.469261] sd 36:0:0:0: [sdai] Attached SCSI removable disk [ 15.690631] scsi 37:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 15.693029] sd 37:0:0:0: Attached scsi generic sg35 type 0 [ 15.700360] sd 37:0:0:0: Power-on or device reset occurred [ 15.710477] sd 37:0:0:0: [sdaj] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB) [ 15.716494] sd 37:0:0:0: [sdaj] Write Protect is off [ 15.716496] sd 37:0:0:0: [sdaj] Mode Sense: 63 00 00 08 [ 15.722338] sd 37:0:0:0: [sdaj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 15.782166] sd 37:0:0:0: [sdaj] Attached SCSI removable disk [ 15.935216] scsi 38:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 15.935443] sd 38:0:0:0: Attached scsi generic sg36 type 0 [ 15.945187] sd 38:0:0:0: Power-on or device reset occurred [ 15.955426] sd 38:0:0:0: [sdak] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB) [ 15.962447] sd 38:0:0:0: [sdak] Write Protect is off [ 15.962450] sd 38:0:0:0: [sdak] Mode Sense: 63 00 00 08 [ 15.967142] sd 38:0:0:0: [sdak] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.032559] sd 38:0:0:0: [sdak] Attached SCSI removable disk [ 16.248164] scsi 39:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 16.248390] sd 39:0:0:0: Attached scsi generic sg37 type 0 [ 16.258167] sd 39:0:0:0: Power-on or device reset occurred [ 16.268181] sd 39:0:0:0: [sdal] 20480 512-byte logical blocks: (10.5 MB/10.0 MiB) [ 16.274258] sd 39:0:0:0: [sdal] Write Protect is off [ 16.274260] sd 39:0:0:0: [sdal] Mode Sense: 63 00 00 08 [ 16.280202] sd 39:0:0:0: [sdal] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.341233] sd 39:0:0:0: [sdal] Attached SCSI removable disk [ 20.121915] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 25.023824] fuse init (API version 7.27) I think this bug can be verified. Please feel free to contact me if you have any questions Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723 |