| Summary: | BUG() hit after patch set 582435 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Ross Walker <rswwalker> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.6 | CC: | florian, jmoyer, rwheeler |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-04-12 18:48:01 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Hi Ross, Have you seen this upstream? If so, which kernels? Thanks! First, drbd isn't supported. Second, your module is modifying object lifetimes of the very data structure that is causing problems in that BUG(). So, the onus is on you to prove that you're module isn't the problem. I'm going to close this as NOTABUG. If you can somehow prove that this is a more generic problem, I'll be happy to revisit. (In reply to comment #3) > First, drbd isn't supported. Please, allow me to clarify here (for people hitting this bug via a Bugzilla search). While DRBD isn't supported by Red Hat as part of RHEL, LINBIT is a Red Hat Ready ISV with an existing support procedure for mutual customers. So customers on both LINBIT and Red Hat support contracts will not be forced to forgo Red Hat support just because they are using DRBD. More information is available on the Partner Locator page at https://www.redhat.com/wapps/partnerlocator/web/home.html#productId=60400 (In reply to comment #3) > First, drbd isn't supported. Second, your module is modifying object lifetimes > of the very data structure that is causing problems in that BUG(). So, the > onus is on you to prove that you're module isn't the problem. I'm going to > close this as NOTABUG. If you can somehow prove that this is a more generic > problem, I'll be happy to revisit. As far as Linbit's DRBD not supported, they are an authorized partner: https://www.redhat.com/wapps/partnerlocator/web/home.html#productId=60400 Besides that though, I will see if I can have the user reproduce the problem without drbd in the stack. I don't get the BUG() on ANY OTHER kernel but yours and IET is run on ALL distributions as well as several embedded systems. The BUG() didn't occur on RH kernels before the patchset and now it occurs. Is there an outside authority I could bring in on this conversation to validate my claims? -Ross Just for clarity, Red Hat bugzilla is not meant to be a tool for escalation from the field or partners. Please do file issue via your normal support contact (linbit via the partner channel, those with subscriptions via RH support). (In reply to comment #2) > Hi Ross, > Have you seen this upstream? If so, which kernels? > Thanks! I haven't seen this on any other upstream kernels and never saw it on any RH kernels prior to 2.6.18-238. I "suspect" that removing the spin_locks around cfq_exit_lock in bug 582435 may have inadvertantly removed an implicit memory barrier as a side affect. If Redhat is to come out and officially say that RHEL5 does NOT support shared io contexts of any kind then at least I will have that to go with and I can remove that code for RHEL5 and put a notice in IET that RHEL5 users must use 'deadline' to achieve any adequate performance. -Ross OK, I had a look back through the RHEL 5 sources (getting quite long in the tooth already). You are right, io_context sharing is not supported by RHEL 5. I had looked into implementing this support some time ago, but it was too invasive, and definitely broke the kABI. Now, I wouldn't write off the performance of CFQ without testing. There is support for merging cfqq's, so long as processes are interleaving sequential I/O. Of course, for hardware other than SATA disks, I would recommend deadline, anyway. I'll change the state of this bug to WONTFIX, to hopefully at least give you a moral victory. ;-) Sorry for the inconvenience. Cheers! (In reply to comment #8) > OK, I had a look back through the RHEL 5 sources (getting quite long in the > tooth already). You are right, io_context sharing is not supported by RHEL 5. > I had looked into implementing this support some time ago, but it was too > invasive, and definitely broke the kABI. > Now, I wouldn't write off the performance of CFQ without testing. There is > support for merging cfqq's, so long as processes are interleaving sequential > I/O. Of course, for hardware other than SATA disks, I would recommend > deadline, anyway. > I'll change the state of this bug to WONTFIX, to hopefully at least give you a > moral victory. ;-) Sorry for the inconvenience. > Cheers! Ok, I just looked back in bugzilla and saw bug #427709 which supposedly fixes the performance issues with CFQ and cooperating threads/processes. I'll retract the ioc sharing for kernels < 2.6.24 in the code, and put a caveat in the readmes saying that those running RHEL5 kernels < 5.5 should consider 'deadline' scheduler over CFQ due to bug #427709. And thanks, I'll take any small victory. -Ross |
Description of problem: iSCSI Enterprise Target user complains of hitting kernel BUG() in block/cfq-iosched.c:1883 in kernel 2.6.18-238.5.1.el5, but not with 5.5 kernels. Version-Release number of selected component (if applicable): Kernel 2.6.18-238.5.1.el5 How reproducible: Very Steps to Reproduce: 1. Start IET service 2. Perform IO to target 3. While IO is being performed, restart IET service Actual results: Kernel BUG at block/cfq-iosched.c:1883 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:05:00.1/irq CPU 4 Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi iscsi_trgt(U) ipmi_devintf ipmi_si ipmi_msghandler bonding drbd(U) autofs4 ipv6 xfrm_nalgo crypto_api video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) sg ata_piix libata i7core_edac bnx2 edac_mc tpm_tis tpm tpm_bios serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 17366, comm: istiod2 Tainted: G 2.6.18-238.5.1.el5 #1 RIP: 0010:[<ffffffff8001c850>] [<ffffffff8001c850>] cfq_set_request+0x12f/0x454 RSP: 0018:ffff810037b51880 EFLAGS: 00010246 RAX: ffff81012e317400 RBX: ffff810121e744a8 RCX: 0000000000000000 RDX: ffff81011b1fca90 RSI: ffff8101186ed728 RDI: ffff810121e74528 RBP: ffff81011b1fca60 R08: ffff81012e0a5c00 R09: 0000000000000000 R10: ffff810121e744a8 R11: 0000000000000080 R12: ffff810121e74508 R13: ffff810121e744a8 R14: ffff81012e317400 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8101043a4240(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000011d01890 CR3: 000000010d0d1000 CR4: 00000000000026e0 Process istiod2 (pid: 17366, threadinfo ffff810037b50000, task ffff81012c5cf040) Stack: 0000000000000010 000000102dc34cc0 ffff8100379ec5f8 ffff81012e31d928 0000000040000001 ffff81012c5cf040 0000000000000000 000000001777a010 6e016f0169017401 ffff81012e31d948 ffff81012e31d928 ffff8100379ec5f8 Call Trace: [<ffffffff800135dd>] get_request+0x220/0x2f2 [<ffffffff80028d3b>] get_request_wait+0x2a/0x130 [<ffffffff8000c1ea>] __make_request+0x406/0x4ca [<ffffffff8005c2de>] cache_alloc_refill+0x86/0x186 [<ffffffff8001c414>] generic_make_request+0x211/0x228 [<ffffffff880e03f2>] :dm_mod:__map_bio+0x4e/0x125 [<ffffffff880e0f39>] :dm_mod:__split_bio+0x176/0x3b0 [<ffffffff880e1919>] :dm_mod:dm_request+0x115/0x124 [<ffffffff8001c414>] generic_make_request+0x211/0x228 [<ffffffff8001ac96>] bio_alloc_bioset+0x89/0xd9 [<ffffffff8847ebf0>] :drbd:drbd_make_request_common+0xcd8/0xd0c [<ffffffff8847f63c>] :drbd:drbd_make_request_26+0xa18/0xaa8 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff8005c35e>] cache_alloc_refill+0x106/0x186 [<ffffffff8001c414>] generic_make_request+0x211/0x228 [<ffffffff800233cc>] mempool_alloc+0x31/0xe7 [<ffffffff80033550>] submit_bio+0xe4/0xeb [<ffffffff885166ff>] :iscsi_trgt:blockio_make_request+0x1a6/0x213 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff88514731>] :iscsi_trgt:build_write_response+0x9d/0xd7 [<ffffffff8850f3fc>] :iscsi_trgt:send_scsi_rsp+0xb/0x4a [<ffffffff88514458>] :iscsi_trgt:disk_execute_cmnd+0x103/0x25a [<ffffffff88511307>] :iscsi_trgt:worker_thread+0x136/0x2af [<ffffffff8008e40a>] default_wake_function+0x0/0xe [<ffffffff885111d1>] :iscsi_trgt:worker_thread+0x0/0x2af [<ffffffff80032af3>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a269c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800329f5>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 a2 36 2c 80 c2 5b 07 48 89 f0 48 8b 32 48 85 f6 75 RIP [<ffffffff8001c850>] cfq_set_request+0x12f/0x454 RSP <ffff810037b51880> <0>Kernel panic - not syncing: Fatal exception Expected results: No OOPS :-) Additional info: iSCSI Enterprise Target attempts to use shared IO contexts to avoid performance issues with CFQ in conjunction with cooperating IO streams coming from kernel space (much like nfsiod). It creates/assigns the shared io context before the thread starts: if (info->wthread_ioc && atomic_inc_not_zero(&info->wthread_ioc->refcount)) { put_io_context(current->io_context); current->io_context = info->wthread_ioc; } else info->wthread_ioc = get_io_context(GFP_KERNEL); And removes the shared io context when the thread finishes: if (current->io_context) { struct io_context *ioc; task_lock(current); ioc = current->io_context; current->io_context = NULL; task_unlock(current); put_io_context(ioc); } The BUG() only happens if there is active IO happening during the restart. IET will halt all IO on shutdown before it, tears down its threads, clears their iocs and closes their block devices, but some IO may still be occurring on a sub-layer of a layered block device (ie drbd). This of course only happens with CFQ I have told my user base that the deadline scheduler is an adequate work-around until the problem can be solved. -Ross