456047 – Kernel Panic at end_bio_bh_io_sync+44

Bug 456047 - Kernel Panic at end_bio_bh_io_sync+44

Summary: Kernel Panic at end_bio_bh_io_sync+44

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6.z
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Milan Broz
QA Contact:	Gris Ge
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	461304 583726
TreeView+	depends on / blocked

Reported:	2008-07-21 08:32 UTC by Qian Cai
Modified:	2018-11-14 15:58 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-16 15:20:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0263	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update	2011-02-16 15:14:55 UTC

Description Qian Cai 2008-07-21 08:32:14 UTC

Description of problem:
While running some stress testing for a 4.6.z Errata Kernel, one of x86_64
machine (ibm-x3250m2-01.rhts.bos.redhat.com) failed,

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3675069

general protection fault: 0000 [1] SMP 
CPU 0 
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket
pcmcia_core loop button battery ac uhci_hcd ehci_hcd tg3 floppy sr_mod
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata mptscsih mptsas
mptspi mptscsi mptbase sd_mod scsi_mod
Pid: 14782, comm: mkswap Not tainted 2.6.9-67.0.22.ELsmp
RIP: 0010:[<ffffffff8017d699>] <ffffffff8017d699>{end_bio_bh_io_sync+44}
RSP: 0018:ffffffff80460788  EFLAGS: 00010202
RAX: 0000000000000001 RBX: 00000100444e8e80 RCX: 0000010040fa1fa8
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000010040fa1fa8
RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000fffffffa
R10: ffffffffa00802d2 R11: ffffffff8017d66d R12: 00000100412eb900
R13: 0000000000000000 R14: 00000100412f1fa0 R15: 00000100013c0e00
FS:  0000002a959a4da0(0000) GS:ffffffff804f3680(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005c8458 CR3: 0000000000101000 CR4: 00000000000006e0
Process mkswap (pid: 14782, threadinfo 0000010040b88000, task 000001007c82d030)
Stack: 0000010041389840 ffffffffa008044f 0000010041309468 0000000000001000 
       00000100412eb900 0000000000000000 0000010041309468 000000000001b000 
       000000000000d000 ffffffff8025465d 
Call Trace:<IRQ> <ffffffffa008044f>{:dm_mod:clone_endio+381}
<ffffffff8025465d>{__end_that_request_first+238} 
       <ffffffffa0007088>{:scsi_mod:scsi_end_request+40}
<ffffffffa000739d>{:scsi_mod:scsi_io_completion+497} 
       <ffffffffa0002d45>{:scsi_mod:scsi_softirq+213}
<ffffffff8013d000>{__do_softirq+88} 
       <ffffffff8013d0a9>{do_softirq+49} <ffffffff80113257>{do_IRQ+328} 
       <ffffffff8011083f>{ret_from_intr+0}  <EOI> <ffffffff80310833>{_read_lock+28} 
       <ffffffffa00801e7>{:dm_mod:dm_get_table+18}
<ffffffffa0080700>{:dm_mod:__split_bio+25} 
       <ffffffff803104a4>{__down_read+52}
<ffffffffa0080d17>{:dm_mod:dm_request+388} 
       <ffffffff8025414d>{generic_make_request+355}
<ffffffff80136030>{autoremove_wake_function+0} 
       <ffffffff80254259>{submit_bio+247} <ffffffff8017fa7c>{bio_alloc+288} 
       <ffffffff8017d7a7>{submit_bh+255}
<ffffffff8017e772>{__block_write_full_page+440} 
       <ffffffffa00ab444>{:ext3:ext3_get_block+0}
<ffffffffa00a9b76>{:ext3:ext3_ordered_writepage+245} 
       <ffffffff8019bd91>{mpage_writepages+555}
<ffffffff8015d305>{__generic_file_aio_write_nolock+731} 
       <ffffffffa00a9a81>{:ext3:ext3_ordered_writepage+0} 
       <ffffffff8015d5a3>{generic_file_aio_write_nolock+32} 
       <ffffffff8015d66d>{generic_file_aio_write+126}
<ffffffffa00a7f01>{:ext3:ext3_file_write+22} 
       <ffffffff8015af28>{__filemap_fdatawrite_range+95}
<ffffffff8017c818>{sys_fsync+137} 
       <ffffffff80110276>{system_call+126} 

Code: ff 51 38 48 89 df e8 46 24 00 00 31 c0 5b c3 41 55 45 31 ed 
RIP <ffffffff8017d699>{end_bio_bh_io_sync+44} RSP <ffffffff80460788>
 <0>Kernel panic - not syncing: Oops
 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-67.0.22.EL

Comment 1 Milan Broz 2008-07-21 13:34:45 UTC

hm, it seems that scsi layer is trying to use bio after dm unallocated it...strange

end_bio_bh_io_sync(struct bio *bio, unsigned int bytes_done, int err)
{
	struct buffer_head *bh = bio->bi_private;

	if (bio->bi_size)
		return 1;

	if (err == -EOPNOTSUPP) {
		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
->>
...

<ffffffff8017d699>{end_bio_bh_io_sync+44}
<IRQ>

static int clone_endio(struct bio *bio, unsigned int done, int error)
...
	bio->bi_private = md->bs;
	bio_put(bio);
->>
	free_tio(md, tio);
	return r;

<ffffffffa008044f>{:dm_mod:clone_endio+381}
<ffffffff8025465d>{__end_that_request_first+238} 
<ffffffffa0007088>{:scsi_mod:scsi_end_request+40}
<ffffffffa000739d>{:scsi_mod:scsi_io_completion+497} 
<ffffffffa0002d45>{:scsi_mod:scsi_softirq+213}
<ffffffff8013d000>{__do_softirq+88} 
<ffffffff8013d0a9>{do_softirq+49}
<ffffffff80113257>{do_IRQ+328} 
<ffffffff8011083f>{ret_from_intr+0}
<EOI>
...

Comment 2 Linda Wang 2008-07-25 04:36:46 UTC

Cai, is this the first time you run into this issue?
Have you seen this issue on other systems?

Comment 4 Qian Cai 2008-07-25 04:42:04 UTC

Yes, this is the first time I saw this, and it is the first time I started to
run this stress test for a 4.6.z Kernel. I have only seen it on this particular
machine so far.

Comment 5 Milan Broz 2008-07-25 07:53:17 UTC

(In reply to comment #1)
Hm, seems that this my analysis in quite incorrect (wrong line).
Anyway, The problem is with original bio (or buffer_head).
Device-mapper never touch original bio (just keeps reference), it creates
clone(s) of bio, keep reference count and call end_bio on original bio only if
all clones are finished (dm_mod:clone_endio).

So problem is problably somewhere else - I expect some possible corruption (or
race in deallocation) in scsi layer - see that scsi_io_completion is called
through softirq...

If you can reproduce it, and still suspect device-mapper problem, please send me
some crashdump for analysis.

Comment 6 Jeff Burke 2008-07-28 01:08:28 UTC

I have run several jobs using this system. I have not been able to reproduce the
reported problem.

Comment 7 RHEL Program Management 2008-09-03 13:08:40 UTC

Updating PM score.

Comment 11 Milan Broz 2009-03-24 19:29:51 UTC

I expect that this bug is probably the same problem as I recently fixed upstream here
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b35f8caa0890169000fec22902290d9a15274cbd

With adding some busy wait between various places, I was able to kill the system with very similar call trace - so I expect that the reported OOps just accesses another part/pointer of the affected structures.

(here it runs in parallel simple device creation and in another thread mkwap on that device.)

Note that this problem is in all kernel for very long time - it is very hard to trigger it (because clone_endio runs in interrupt context).

I'll prepare a patch for that.

Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
<ffffffff8015e9be>{mempool_free+13}
PML4 0
Oops: 0000 [1] SMP
CPU 2
Modules linked in: parport_pc(U) lp(U) parport(U) autofs4(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) cpufreq_powersave(U) ib_srp(U) ib_sdp(U) ib_ipoib(U) inet_lro(U) md5(U) ipv6(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) i5000_edac(U) edac_mc(U) hw_random(U) tg3(U) ext3(U) jbd(U) qla2400(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) mptscsih(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U)
Pid: 0, comm: swapper Not tainted 2.6.9-prep
RIP: 0010:[<ffffffff8015e9be>] <ffffffff8015e9be>{mempool_free+13}
RSP: 0018:00000101aa153df8  EFLAGS: 00010246
RAX: 00000101a9c4ec00 RBX: 0000000000000000 RCX: 000001000100afb0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000101a2f04bb0
RBP: 0000000000000000 R08: 000001000100afb0 R09: 00000000fffffffa
R10: 0000000300000000 R11: ffffffff8015ea32 R12: 00000101a2f04bb0
R13: 0000000000000000 R14: 00000101a2ff3970 R15: 00000101a9c4ec00
FS:  0000000000000000(0000) GS:ffffffff80503980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000001aa13e000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 00000101aa13c000, task 0000010005d2c7f0)
Stack: 00000101a2f04bb0 0000000000000000 00000101a5695580 ffffffffa018249d
       00000101a9065ca8 0000000000000200 00000101a5695580 0000000000000000
       00000101a9065ca8 0000000000000a00
Call Trace:<IRQ> <ffffffffa018249d>{:dm_mod:clone_endio+459} <ffffffff80256b2f>{__end_that_request_first+238}
       <ffffffffa00070cc>{:scsi_mod:scsi_end_request+40} <ffffffffa00073e1>{:scsi_mod:scsi_io_completion+497}
       <ffffffffa0002d45>{:scsi_mod:scsi_softirq+213} <ffffffff8013d5a8>{__do_softirq+88}
       <ffffffff8013d651>{do_softirq+49} <ffffffff801132ef>{do_IRQ+328}
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e88c>{mwait_idle+86}
       <ffffffff8010e81c>{cpu_idle+26}

Code: 8b 46 08 39 46 0c 7d 4d 48 89 f7 e8 e2 5f 1b 00 48 89 c5 8b
RIP <ffffffff8015e9be>{mempool_free+13} RSP <00000101aa153df8>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Oops


Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
<ffffffffa0188fb8>{:dm_mod:bio_free+40}
PML4 0
Oops: 0000 [1] SMP
CPU 3
Modules linked in: dm_mod(U) parport_pc(U) lp(U) parport(U) autofs4(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) cpufreq_powersave(U) ib_srp(U) ib_sdp(U) ib_ipoib(U) inet_lro(U) md5(U) ipv6(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) i5000_edac(U) edac_mc(U) hw_random(U) tg3(U) ext3(U) jbd(U) qla2400(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) mptscsih(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U) ahci(U) libata(U) sd_mod(U) scsi_mod(U)
Pid: 0, comm: swapper Not tainted 2.6.9-prep
RIP: 0010:[<ffffffffa0188fb8>] <ffffffffa0188fb8>{:dm_mod:bio_free+40}
RSP: 0018:0000010037e07df8  EFLAGS: 00010293
RAX: 0000000000000000 RBX: 000001019bf92900 RCX: 00000100010086d0
RDX: 000001019bf92900 RSI: 0000000000000000 RDI: 00000101a70cd3f0
RBP: 0000000000000000 R08: 00000100010086d0 R09: 00000000fffffffa
R10: 0000000300000000 R11: ffffffffa018268a R12: 000001019bf92900
R13: 0000000000000000 R14: 00000101a1c00310 R15: 0000010037c20a00
FS:  0000000000000000(0000) GS:ffffffff80503a00(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 00000001aa174000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 00000101aa172000, task 0000010005d2c030)
Stack: 000001019bf92900 0000010168addef8 0000000000000000 ffffffffa0182499
       00000101a9045468 0000000000000200 000001019bf92900 0000000000000000
       00000101a9045468 0000000000000a00
Call Trace:<IRQ> <ffffffffa0182499>{:dm_mod:clone_endio+455} <ffffffff80256b2f>{__end_that_request_first+238}
       <ffffffffa00070cc>{:scsi_mod:scsi_end_request+40} <ffffffffa00073e1>{:scsi_mod:scsi_io_completion+497}
       <ffffffffa0002d45>{:scsi_mod:scsi_softirq+213} <ffffffff8013d5a8>{__do_softirq+88}
       <ffffffff8013d651>{do_softirq+49} <ffffffff801132ef>{do_IRQ+328}
       <ffffffff801108bf>{ret_from_intr+0}  <EOI> <ffffffff8010e88c>{mwait_idle+86}
       <ffffffff8010e81c>{cpu_idle+26}

Code: 48 8b 74 c5 08 e8 ef 59 fd df 48 8b 75 00 48 89 df 58 5b 5d
RIP <ffffffffa0188fb8>{:dm_mod:bio_free+40} RSP <0000010037e07df8>
CR2: 0000000000000008
 <0>Kernel panic - not syncing: Oops

Comment 13 RHEL Program Management 2009-04-10 16:28:29 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 manuel.desbonnet 2009-07-01 09:33:37 UTC

Hi,

Any estimate for when this patch might be included in a maintenance release ?

Thanks.

Comment 19 RHEL Program Management 2010-04-20 13:18:26 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 Vivek Goyal 2010-05-07 17:00:27 UTC

Committed in 89.25.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 26 Gris Ge 2011-01-12 03:13:01 UTC

Code reviewed.
linux-2.6.9-dm-fix-oops-in-mempool_free-when-device-removed.patch was found and applied into kernel-2.6.9-95.EL

As this issue is hard to trigger, set as verify SanityOnly.

The basic function test of device-mapper after this change will be tested in device-mapper errata.

Comment 27 errata-xmlrpc 2011-02-16 15:20:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html

Note You need to log in before you can comment on or make changes to this bug.