Bug 164630
Summary: | panic while running fsstress to a filesystem on a mirror | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> |
Component: | kernel | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | agk, jbrassow, kanderso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2005-514 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-05 13:46:44 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 156323 |
Description
Corey Marthaler
2005-07-29 14:41:49 UTC
I've seen this twice now while only running two fsstress instances alone to an ext3 filesystem on top of a mirror. This also happens with a linear running fsstress and while doing pvmoves. This bug has been around for a long time... probably since the first time the dm-raid1.c file was included in the kernel. There is a function, __rh_alloc(), which expects to be called while a lock is held. However, it will unlock(read) and relock(write) in order to do allocation. This creates many races. I've changed the way the functions are called (and the arguments to one of them) so that allocation can happen without unlocking. This change closes 2 different race conditions. Devel ACK - needs to get fixed in order for mirrors to be useful during U2 beta on SMP machines. All of this chasing around for race conditions could have been avoided if I would have realized that what was meant by 'create_workqueue' was 'create_singlethread_workqueue'... This eliminates the races and (as best I can tell) is what the original author envisioned. The new patch is a couple line change. Patch submitted to rhkernel. It has been tested by QA and has successfully solved the problem. I saw this once this morning with the latest kern and GFS build while running an I/O load to a mirror with GFS on top. I have since been attempting to reproduce again but have not been able to. Will continue to try... [root@link-08 mirror]# uname -ar Linux link-08 2.6.9-15.ELsmp #1 SMP Fri Aug 5 19:00:35 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux Aug 9 04:39:31 link-08 kernel: GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed Aug 9 04:39:50 link-08 kernel: GFS: Trying to join cluster "lock_nolock", "" Aug 9 04:39:50 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Done Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} PML4 12bf7067 PGD 12bfa067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: lock_nolock(U) gfs(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptbase sd_mod scsi_mod Pid: 4083, comm: fsstress Tainted: G M 2.6.9-15.ELsmp RIP: 0010:[<ffffffffa00e3704>] <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP: 0018:ffffffff8044a6e8 EFLAGS: 00010216 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffff0000274640 RDX: ffffff0000274640 RSI: 0000000000001795 RDI: 000001001f136c30 RBP: 000001001f136c18 R08: ffffff0000273000 R09: 00000000000007ff R10: 000001001f52ac00 R11: ffffffffa009a0c2 R12: 000001001f136c30 R13: 000001001f136c00 R14: 0000010036bc7350 R15: 0000000000000000 FS: 0000002a95576b00(0000) GS:ffffffff804d3280(005b) knlGS:00000000f7fdd6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000038 CR3: 0000000000101000 CR4: 00000000000006e0 Process fsstress (pid: 4083, threadinfo 00000100366e4000, task 000001001f11b7f0) Stack: ffffffffa00e36b0 000001003466ab38 0000000000000000 0000010036bc7340 000001002e1c4d80 0000000000000000 0000000000007000 ffffffffa009a122 000001001f136c00 000001002e1c4d80 Call Trace:<IRQ> <ffffffffa00e36b0>{:dm_mirror:mirror_end_io+0} <ffffffffa009a122>{:dm_mod:clone_endio+96} <ffffffffa00e2428>{:dm_mirror:write_callback+314} <ffffffffa009ee83>{:dm_mod:endio+184} <ffffffffa009a18f>{:dm_mod:clone_endio+205} <ffffffff80249ae2>{__end_that_request_first+238} <ffffffffa0006d7d>{:scsi_mod:scsi_end_request+40} <ffffffffa0007092>{:scsi_mod:scsi_io_completion+497} <ffffffffa0002d21>{:scsi_mod:scsi_softirq+213} <ffffffff8013b774>{__do_softirq+88} <ffffffff8013b81d>{do_softirq+49} <ffffffff80112f77>{do_IRQ+328} <ffffffff8011061b>{ret_from_intr+0} <EOI> <ffffffff801cbb04>{dummy_inode_permission+0} <ffffffff80184d24>{open_namei+289} <ffffffff80184d10>{open_namei+269} <ffffffff801761c4>{filp_open+39} <ffffffff801e9c85>{strncpy_from_user+74} <ffffffff801762cd>{get_unused_fd+230} <ffffffff801763bc>{sys_open+57} <ffffffff801250ff>{cstar_do_call+27} Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8 RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8> CR2: 0000000000000038 <0>Kernel panic - not syncing: Oops After a reboot and another mount attempt: Lock_Harness 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed GFS: Trying to join cluster "lock_nolock", "" Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... GFS: fsid=dm-5.0: jid=0: Looking at journal... GFS: fsid=dm-5.0: jid=0: Acquiring the transaction lock... GFS: fsid=dm-5.0: jid=0: Replaying journal... GFS: fsid=dm-5.0: fatal: filesystem consistency error GFS: fsid=dm-5.0: function = foreach_descriptor GFS: fsid=dm-5.0: file = /usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c, line = 382 GFS: fsid=dm-5.0: time = 1123582218 GFS: fsid=dm-5.0: about to withdraw from the cluster GFS: fsid=dm-5.0: waiting for outstanding I/O GFS: fsid=dm-5.0: telling LM to withdraw GFS: fsid=dm-5.0: withdrawn GFS: fsid=dm-5.0: start = 13122448 mh_magic = 0x01161970 mh_type = 8 mh_generation = 0 mh_format = 800 mh_incarn = 0 lh_flags = 0x00000000 lh_pad = 0 lh_first = 13122384 lh_sequence = 33296 lh_tail = 13122000 lh_last_dump = 13119488 lh_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 GFS: fsid=dm-5.0: jid=0: Failed GFS: fsid=dm-5.0: error recovering journal 0: -5 Aug 9 05:10:17 link-08 kernel: Lock_Harness 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 05:10:17 link-08 kernel: GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed Aug 9 05:10:17 link-08 kernel: GFS: Trying to join cluster "lock_nolock", "" Aug 9 05:10:17 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Acquiring the transaction lock... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Replaying journal... Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: fatal: filesystem consistency error Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: function = foreach_descriptor Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: file = /usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c, line = 382 Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: time = 1123582218 Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: about to withdraw from the cluster Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: waiting for outstanding I/O Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: telling LM to withdraw Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: withdrawn Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: start = 13122448 Aug 9 05:10:18 link-08 kernel: mh_magic = 0x01161970 Aug 9 05:10:18 link-08 kernel: mh_type = 8 Aug 9 05:10:18 link-08 kernel: mh_generation = 0 Aug 9 05:10:18 link-08 kernel: mh_format = 800 Aug 9 05:10:18 link-08 kernel: mh_incarn = 0 Aug 9 05:10:18 link-08 kernel: lh_flags = 0x00000000 Aug 9 05:10:18 link-08 kernel: lh_pad = 0 Aug 9 05:10:18 link-08 kernel: lh_first = 13122384 Aug 9 05:10:18 link-08 kernel: lh_sequence = 33296 Aug 9 05:10:18 link-08 kernel: lh_tail = 13122000 Aug 9 05:10:18 link-08 kernel: lh_last_dump = 13119488 Aug 9 05:10:18 link-08 kernel: lh_reserved = Aug 9 05:10:18 link-08 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Aug 9 05:10:18 link-08 last message repeated 3 times Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Failed Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: error recovering journal 0: -5 Aug 9 05:10:19 link-08 hald[2847]: Timed out waiting for hotplug event 308. Rebasing to 309 Aug 9 05:10:32 link-08 hal.hotplug[3642]: timout(10000 ms) waiting for /block/diapered_dm-5 The error handling in mirror_end_io looks rather harsh to me. How is EWOULDBLOCK handled with read ahead? Or EOPNOTSUPP ? Also, ms->lock appears to be used from both interrupt and non-interrupt context without suspending interrupts. And map_context->ptr can still get referenced after it's been freed, which would be my guess at the cause of this particular crash. I take that back, confused by the non-simple flow through the function. The rw == READ test is always true? Speculation is the way to the dark side. :) Keying in on the important Oops statements: Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8 RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8> CR2: 0000000000000038 Doing objdump -D on the mirror module: 00000000000026b0 <mirror_end_io>: <snip> 26f4: e8 64 e5 ff ff callq c5d <__rh_lookup> 26f9: 4c 89 e7 mov %r12,%rdi 26fc: 48 89 c3 mov %rax,%rbx 26ff: e8 00 00 00 00 callq 2704 <mirror_end_io+0x54> 2704: f0 ff 4b 38 lock decl 0x38(%rbx) Add +84 (+0x54) to the mirror_end_io function to get 2704. Goto line 2704, and verify the machine code - yup, this is exactly the line we are looking for... Translate that to real code to get: static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { <--- right here Could it be our 'reg' variable? One more check... if it is, pending would must be offset from reg by 0x38. Indeed! 'pending' is at 0x38 offset in the region structure (assuming a 64-bit machine will pad 4 bytes between state and hash_list - which it will)! We can be reasonably certain that our bug is a know race that we have thought of in theory but have been trying to reproduce... Could the following patch fix it? diff -purN linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c linux-2.6.9-12.EL-02/drive rs/md/dm-raid1.c --- linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c 2005-08-03 15:28:40.031213364 -0 500 +++ linux-2.6.9-12.EL-02/drivers/md/dm-raid1.c 2005-08-03 15:30:24.220524638 -0 500 @@ -378,6 +378,11 @@ static void rh_inc(struct region_hash *r read_lock(&rh->hash_lock); reg = __rh_find(rh, region); + + spin_lock_irq(&rh->region_lock); + atomic_inc(®->pending); + spin_unlock_irq(&rh->region_lock); + if (reg->state == RH_CLEAN) { rh->log->type->mark_region(rh->log, reg->key); @@ -387,7 +392,6 @@ static void rh_inc(struct region_hash *r spin_unlock_irq(&rh->region_lock); } - atomic_inc(®->pending); read_unlock(&rh->hash_lock); } @@ -409,17 +413,17 @@ static void rh_dec(struct region_hash *r reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); + spin_lock_irqsave(&rh->region_lock, flags); if (atomic_dec_and_test(®->pending)) { - spin_lock_irqsave(&rh->region_lock, flags); if (reg->state == RH_RECOVERING) { list_add_tail(®->list, &rh->quiesced_regions); } else { reg->state = RH_CLEAN; list_add(®->list, &rh->clean_regions); } - spin_unlock_irqrestore(&rh->region_lock, flags); should_wake = 1; } + spin_unlock_irqrestore(&rh->region_lock, flags); if (should_wake) wake(); Reproduced this again while attempting to hit 165242. This was on an x86 UP machine running a 2.6.9-15.EL kernel compiled without linux-2.6.9-dm-mirroring.patch. I was running ext3 to a 25M linear and running 6 iterations of fsstress and while doing pvmoves at the same time to simulate mirroring. This had ran around 1.5 - 2 hours before triping. Unable to handle kernel NULL pointer dereference at virtual address 00000020 printing eip: e01aa6b2 *pde = 1a1c7067 Oops: 0002 [#1] Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) dm_mirror(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) e1000(U) floppy(U) ext3(U) jbd(U) dm_mod(U) qla2300(U)qla2xxx(U) scsi_transport_fc(U) sd_mod(U) scsi_mod(U) CPU: 0 EIP: 0060:[<e01aa6b2>] Not tainted VLI EFLAGS: 00010286 (2.6.9-prep) EIP is at rh_dec+0xc/0x124 [dm_mirror] eax: 00000000 ebx: df50628c ecx: 00000000 edx: 00000000 esi: df506280 edi: d6a499fc ebp: d459eb80 esp: c03e4eb0 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c03e4000 task=c0358be0) Stack: 00000000 df506280 d6a499fc e01abaf2 d67d29fc 00000000 e0060187 d6a49a04 e007e080 e01abacf 00000000 d459eb80 00000400 e0060122 00000000 c016df35 d459eb80 df506280 e01aaef7 db207900 e01aaf3c 00000000 00000000 d459eb80 Call Trace: [<e01abaf2>] mirror_end_io+0x23/0x28 [dm_mirror] [<e0060187>] clone_endio+0x65/0xee [dm_mod] [<e01abacf>] mirror_end_io+0x0/0x28 [dm_mirror] [<e0060122>] clone_endio+0x0/0xee [dm_mod] [<c016df35>] bio_endio+0x4f/0x54 [<e01aaef7>] write_callback+0x0/0x49 [dm_mirror] [<e01aaf3c>] write_callback+0x45/0x49 [dm_mirror] [<e00653c6>] dec_count+0x43/0x48 [dm_mod] [<e0065499>] endio+0x39/0x46 [dm_mod] [<e0065460>] endio+0x0/0x46 [dm_mod] [<c016df35>] bio_endio+0x4f/0x54 [<c0250dcc>] __end_that_request_first+0xea/0x1ab [<e0046dcc>] scsi_end_request+0x1b/0x174 [scsi_mod] [<e0047263>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<e0042001>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<e0041f26>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c012677d>] __do_softirq+0x35/0x79 [<c0109338>] do_softirq+0x46/0x4d ======================= [<c01088fc>] do_IRQ+0x2b3/0x2bf [<c030fb6c>] common_interrupt+0x18/0x20 [<c010403b>] default_idle+0x23/0x26 [<c010408c>] cpu_idle+0x1f/0x34 [<c03b86b9>] start_kernel+0x214/0x216 Code: d3 ed a8 20 0f 45 fd 0f 45 ea 89 e9 89 fa 89 f0 e8 b2 fe ff ff 8b 5b 08 eb d3 5b 5e 5f 5d c3 57 56 53 89 c3e8 3f f9 ff ff 89 c1 <ff> 48 20 0f 94 c0 84 c0 0f 84 06 01 00 00 9c 5f fa 81 7b 20 3c <0>Kernel panic - not syncing: Fatal exception in interrupt Exactly the same spot (except on x86): drivers/md/dm-raid1.c:400 16a6: 57 push %edi 16a7: 56 push %esi 16a8: 53 push %ebx 16a9: 89 c3 mov %eax,%ebx drivers/md/dm-raid1.c:406 16ab: e8 3f f9 ff ff call fef <__rh_lookup> 16b0: 89 c1 mov %eax,%ecx include/asm/atomic.h:135 16b2: ff 48 20 decl 0x20(%eax) <--- right here 16a6+c = 16b2 Translate to C code: static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { <--- right here Note that in this case the offset is 0x20. This is because pointers are four bytes on x86... We can easily see that 'pending' is 0x20 from the top of the region structure on x86. by "6 iterations of fsstress" do you mean 'fsstress -p 6' ? Taking a look at the following code: ###################################################################### static void rh_inc(struct region_hash *rh, region_t region) { struct region *reg; read_lock(&rh->hash_lock); reg = __rh_find(rh, region); if (reg->state == RH_CLEAN) { rh->log->type->mark_region(rh->log, reg->key); spin_lock_irq(&rh->region_lock); reg->state = RH_DIRTY; list_del_init(®->list); /* take off the clean list */ spin_unlock_irq(&rh->region_lock); } atomic_inc(®->pending); read_unlock(&rh->hash_lock); } static void rh_inc_pending(struct region_hash *rh, struct bio_list *bios) { struct bio *bio; for (bio = bios->head; bio; bio = bio->bi_next) rh_inc(rh, bio_to_region(rh, bio)); } static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { spin_lock_irqsave(&rh->region_lock, flags); if (reg->state == RH_RECOVERING) { list_add_tail(®->list, &rh->quiesced_regions); } else { reg->state = RH_CLEAN; list_add(®->list, &rh->clean_regions); } spin_unlock_irqrestore(&rh->region_lock, flags); should_wake = 1; } if (should_wake) wake(); } ###################################################################### It would be possible for the process to be in rh_inc, just after failing the (reg-state == RH_CLEAN), then be interrupted by rh_dec. rh_dec (if there is only one write outstanding) would place the region on the clean list. process control returns to rh_inc, which increments pending; but the region is still on the clean list. The write (which caused the rh_inc) proceeds, but rh_update_states removes the region because it is on the clean list. When the new write completes, the region lookup returns NULL - causing the Oops. This is an _extremely_ small window of opportunity. Could it be this? An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html |