Description of problem: I created an ext3 filesystem on top of a mirror and then ran a fsstress/genesis/accordion I/O load while at the same time doing pvmoves from pvs in the VG to others in the VG which had been extended later. Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67} PML4 1ab77067 PGD 3ebc9067 PMD 0 Oops: 0002 [1] SMP CPU 1 Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 0, comm: swapper Tainted: G M 2.6.9-11.28.ELsmp RIP: 0010:[<ffffffffa00b7f71>] <ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67} RSP: 0018:0000010037e0bd98 EFLAGS: 00010216 RAX: 0000000000000000 RBX: 0000010037e3dc30 RCX: ffffff00000545e0 RDX: ffffff00000545e0 RSI: 0000000000000a81 RDI: 0000010037e3dc30 RBP: 0000010037e3dc00 R08: ffffff0000054000 R09: 00000000000001ff R10: 000001003f1c0800 R11: ffffffffa006f0c2 R12: 0000000000000000 R13: 0000000000000a81 R14: 0000000000000000 R15: 0000000000001000 FS: 0000002a95576b00(0000) GS:ffffffff804cf400(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000038 CR3: 000000003ff38000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo 000001003ff34000, task 000001003ff0b030) Stack: ffffffffa00b7f2e 0000010023a1c118 0000000000000000 000001001d7d7be0 000001002a3b2680 ffffffffa006f122 000001002a3b2680 0000010037e3dc00 0000010019ee5250 000001001f2d5100 Call Trace:<IRQ> <ffffffffa00b7f2e>{:dm_mirror:mirror_end_io+0} <ffffffffa006f122>{:dm_mod:clone_endio+96} <ffffffffa00b6ede>{:dm_mirror:write_callback+70} <ffffffffa0073dbf>{:dm_mod:endio+184} <ffffffffa006f18f>{:dm_mod:clone_endio+205} <ffffffff80247532>{__end_that_request_first+238} <ffffffffa0006d7d>{:scsi_mod:scsi_end_request+40} <ffffffffa0007092>{:scsi_mod:scsi_io_completion+497} <ffffffffa0002d21>{:scsi_mod:scsi_softirq+213} <ffffffff8013b284>{__do_softirq+88} <ffffffff8013b32d>{do_softirq+49} <ffffffff80112f7b>{do_IRQ+328} <ffffffff8011061f>{ret_from_intr+0} <EOI> <ffffffff8010e5c9>{default_idle+0} <ffffffff8010e5e9>{default_idle+32} <ffffffff8010e65c>{cpu_idle+26} Code: f0 41 ff 4c 24 38 0f 94 c0 84 c0 74 76 48 8d 5d 50 48 89 df RIP <ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67} RSP <0000010037e0bd98> CR2: 0000000000000038 <0>Kernel panic - not syncing: Oops [root@link-02 ~]# pvscan PV /dev/hda6 VG VolGroup00 lvm2 [17.53 GB / 64.00 MB free] PV /dev/sda1 VG VG lvm2 [135.66 GB / 110.66 GB free] PV /dev/sdb1 VG VG lvm2 [135.66 GB / 110.66 GB free] PV /dev/sdc1 VG VG lvm2 [135.66 GB / 135.65 GB free] PV /dev/sdd1 VG VG lvm2 [135.66 GB / 135.66 GB free] PV /dev/sdf1 VG VG lvm2 [135.66 GB / 135.66 GB free] PV /dev/sdg1 lvm2 [135.66 GB] [root@link-02 ~]# vgscan Reading all physical volumes. This may take a while... Found volume group "VolGroup00" using metadata type lvm2 Found volume group "VG" using metadata type lvm2 [root@link-02 ~]# lvscan ACTIVE '/dev/VolGroup00/LogVol00' [15.53 GB] inherit ACTIVE '/dev/VolGroup00/LogVol01' [1.94 GB] inherit ACTIVE '/dev/VG/mirror' [25.00 GB] inherit ACTIVE '/dev/VG/mirror_mlog' [4.00 MB] inherit ACTIVE '/dev/VG/mirror_mimage_0' [25.00 GB] inherit ACTIVE '/dev/VG/mirror_mimage_1' [25.00 GB] inherit *Note* There were I/O errors on /dev/sde last night, however that device was not being used at all in this senario, and the errors occured well before this problem happened. Version-Release number of selected component (if applicable): [root@link-02 ~]# lvcreate --version LVM version: 2.01.13 (2005-07-13) Library version: 1.01.03 (2005-06-13) Driver version: 4.4.0
I've seen this twice now while only running two fsstress instances alone to an ext3 filesystem on top of a mirror.
This also happens with a linear running fsstress and while doing pvmoves.
This bug has been around for a long time... probably since the first time the dm-raid1.c file was included in the kernel. There is a function, __rh_alloc(), which expects to be called while a lock is held. However, it will unlock(read) and relock(write) in order to do allocation. This creates many races. I've changed the way the functions are called (and the arguments to one of them) so that allocation can happen without unlocking. This change closes 2 different race conditions.
Devel ACK - needs to get fixed in order for mirrors to be useful during U2 beta on SMP machines.
All of this chasing around for race conditions could have been avoided if I would have realized that what was meant by 'create_workqueue' was 'create_singlethread_workqueue'... This eliminates the races and (as best I can tell) is what the original author envisioned. The new patch is a couple line change.
Patch submitted to rhkernel. It has been tested by QA and has successfully solved the problem.
I saw this once this morning with the latest kern and GFS build while running an I/O load to a mirror with GFS on top. I have since been attempting to reproduce again but have not been able to. Will continue to try... [root@link-08 mirror]# uname -ar Linux link-08 2.6.9-15.ELsmp #1 SMP Fri Aug 5 19:00:35 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux Aug 9 04:39:31 link-08 kernel: GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed Aug 9 04:39:50 link-08 kernel: GFS: Trying to join cluster "lock_nolock", "" Aug 9 04:39:50 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal... Aug 9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Done Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} PML4 12bf7067 PGD 12bfa067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: lock_nolock(U) gfs(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptbase sd_mod scsi_mod Pid: 4083, comm: fsstress Tainted: G M 2.6.9-15.ELsmp RIP: 0010:[<ffffffffa00e3704>] <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP: 0018:ffffffff8044a6e8 EFLAGS: 00010216 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffff0000274640 RDX: ffffff0000274640 RSI: 0000000000001795 RDI: 000001001f136c30 RBP: 000001001f136c18 R08: ffffff0000273000 R09: 00000000000007ff R10: 000001001f52ac00 R11: ffffffffa009a0c2 R12: 000001001f136c30 R13: 000001001f136c00 R14: 0000010036bc7350 R15: 0000000000000000 FS: 0000002a95576b00(0000) GS:ffffffff804d3280(005b) knlGS:00000000f7fdd6c0 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 0000000000000038 CR3: 0000000000101000 CR4: 00000000000006e0 Process fsstress (pid: 4083, threadinfo 00000100366e4000, task 000001001f11b7f0) Stack: ffffffffa00e36b0 000001003466ab38 0000000000000000 0000010036bc7340 000001002e1c4d80 0000000000000000 0000000000007000 ffffffffa009a122 000001001f136c00 000001002e1c4d80 Call Trace:<IRQ> <ffffffffa00e36b0>{:dm_mirror:mirror_end_io+0} <ffffffffa009a122>{:dm_mod:clone_endio+96} <ffffffffa00e2428>{:dm_mirror:write_callback+314} <ffffffffa009ee83>{:dm_mod:endio+184} <ffffffffa009a18f>{:dm_mod:clone_endio+205} <ffffffff80249ae2>{__end_that_request_first+238} <ffffffffa0006d7d>{:scsi_mod:scsi_end_request+40} <ffffffffa0007092>{:scsi_mod:scsi_io_completion+497} <ffffffffa0002d21>{:scsi_mod:scsi_softirq+213} <ffffffff8013b774>{__do_softirq+88} <ffffffff8013b81d>{do_softirq+49} <ffffffff80112f77>{do_IRQ+328} <ffffffff8011061b>{ret_from_intr+0} <EOI> <ffffffff801cbb04>{dummy_inode_permission+0} <ffffffff80184d24>{open_namei+289} <ffffffff80184d10>{open_namei+269} <ffffffff801761c4>{filp_open+39} <ffffffff801e9c85>{strncpy_from_user+74} <ffffffff801762cd>{get_unused_fd+230} <ffffffff801763bc>{sys_open+57} <ffffffff801250ff>{cstar_do_call+27} Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8 RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8> CR2: 0000000000000038 <0>Kernel panic - not syncing: Oops After a reboot and another mount attempt: Lock_Harness 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed GFS: Trying to join cluster "lock_nolock", "" Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... GFS: fsid=dm-5.0: jid=0: Looking at journal... GFS: fsid=dm-5.0: jid=0: Acquiring the transaction lock... GFS: fsid=dm-5.0: jid=0: Replaying journal... GFS: fsid=dm-5.0: fatal: filesystem consistency error GFS: fsid=dm-5.0: function = foreach_descriptor GFS: fsid=dm-5.0: file = /usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c, line = 382 GFS: fsid=dm-5.0: time = 1123582218 GFS: fsid=dm-5.0: about to withdraw from the cluster GFS: fsid=dm-5.0: waiting for outstanding I/O GFS: fsid=dm-5.0: telling LM to withdraw GFS: fsid=dm-5.0: withdrawn GFS: fsid=dm-5.0: start = 13122448 mh_magic = 0x01161970 mh_type = 8 mh_generation = 0 mh_format = 800 mh_incarn = 0 lh_flags = 0x00000000 lh_pad = 0 lh_first = 13122384 lh_sequence = 33296 lh_tail = 13122000 lh_last_dump = 13119488 lh_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 GFS: fsid=dm-5.0: jid=0: Failed GFS: fsid=dm-5.0: error recovering journal 0: -5 Aug 9 05:10:17 link-08 kernel: Lock_Harness 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 05:10:17 link-08 kernel: GFS 2.6.9-38.0 (built Aug 8 2005 15:52:32) installed Aug 9 05:10:17 link-08 kernel: GFS: Trying to join cluster "lock_nolock", "" Aug 9 05:10:17 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug 8 2005 15:52:12) installed Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Acquiring the transaction lock... Aug 9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Replaying journal... Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: fatal: filesystem consistency error Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: function = foreach_descriptor Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: file = /usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c, line = 382 Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: time = 1123582218 Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: about to withdraw from the cluster Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: waiting for outstanding I/O Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: telling LM to withdraw Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: withdrawn Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: start = 13122448 Aug 9 05:10:18 link-08 kernel: mh_magic = 0x01161970 Aug 9 05:10:18 link-08 kernel: mh_type = 8 Aug 9 05:10:18 link-08 kernel: mh_generation = 0 Aug 9 05:10:18 link-08 kernel: mh_format = 800 Aug 9 05:10:18 link-08 kernel: mh_incarn = 0 Aug 9 05:10:18 link-08 kernel: lh_flags = 0x00000000 Aug 9 05:10:18 link-08 kernel: lh_pad = 0 Aug 9 05:10:18 link-08 kernel: lh_first = 13122384 Aug 9 05:10:18 link-08 kernel: lh_sequence = 33296 Aug 9 05:10:18 link-08 kernel: lh_tail = 13122000 Aug 9 05:10:18 link-08 kernel: lh_last_dump = 13119488 Aug 9 05:10:18 link-08 kernel: lh_reserved = Aug 9 05:10:18 link-08 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Aug 9 05:10:18 link-08 last message repeated 3 times Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Failed Aug 9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: error recovering journal 0: -5 Aug 9 05:10:19 link-08 hald[2847]: Timed out waiting for hotplug event 308. Rebasing to 309 Aug 9 05:10:32 link-08 hal.hotplug[3642]: timout(10000 ms) waiting for /block/diapered_dm-5
The error handling in mirror_end_io looks rather harsh to me. How is EWOULDBLOCK handled with read ahead? Or EOPNOTSUPP ?
Also, ms->lock appears to be used from both interrupt and non-interrupt context without suspending interrupts.
And map_context->ptr can still get referenced after it's been freed, which would be my guess at the cause of this particular crash.
I take that back, confused by the non-simple flow through the function. The rw == READ test is always true?
Speculation is the way to the dark side. :) Keying in on the important Oops statements: Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8 RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8> CR2: 0000000000000038 Doing objdump -D on the mirror module: 00000000000026b0 <mirror_end_io>: <snip> 26f4: e8 64 e5 ff ff callq c5d <__rh_lookup> 26f9: 4c 89 e7 mov %r12,%rdi 26fc: 48 89 c3 mov %rax,%rbx 26ff: e8 00 00 00 00 callq 2704 <mirror_end_io+0x54> 2704: f0 ff 4b 38 lock decl 0x38(%rbx) Add +84 (+0x54) to the mirror_end_io function to get 2704. Goto line 2704, and verify the machine code - yup, this is exactly the line we are looking for... Translate that to real code to get: static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { <--- right here Could it be our 'reg' variable? One more check... if it is, pending would must be offset from reg by 0x38. Indeed! 'pending' is at 0x38 offset in the region structure (assuming a 64-bit machine will pad 4 bytes between state and hash_list - which it will)! We can be reasonably certain that our bug is a know race that we have thought of in theory but have been trying to reproduce... Could the following patch fix it? diff -purN linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c linux-2.6.9-12.EL-02/drive rs/md/dm-raid1.c --- linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c 2005-08-03 15:28:40.031213364 -0 500 +++ linux-2.6.9-12.EL-02/drivers/md/dm-raid1.c 2005-08-03 15:30:24.220524638 -0 500 @@ -378,6 +378,11 @@ static void rh_inc(struct region_hash *r read_lock(&rh->hash_lock); reg = __rh_find(rh, region); + + spin_lock_irq(&rh->region_lock); + atomic_inc(®->pending); + spin_unlock_irq(&rh->region_lock); + if (reg->state == RH_CLEAN) { rh->log->type->mark_region(rh->log, reg->key); @@ -387,7 +392,6 @@ static void rh_inc(struct region_hash *r spin_unlock_irq(&rh->region_lock); } - atomic_inc(®->pending); read_unlock(&rh->hash_lock); } @@ -409,17 +413,17 @@ static void rh_dec(struct region_hash *r reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); + spin_lock_irqsave(&rh->region_lock, flags); if (atomic_dec_and_test(®->pending)) { - spin_lock_irqsave(&rh->region_lock, flags); if (reg->state == RH_RECOVERING) { list_add_tail(®->list, &rh->quiesced_regions); } else { reg->state = RH_CLEAN; list_add(®->list, &rh->clean_regions); } - spin_unlock_irqrestore(&rh->region_lock, flags); should_wake = 1; } + spin_unlock_irqrestore(&rh->region_lock, flags); if (should_wake) wake();
Reproduced this again while attempting to hit 165242. This was on an x86 UP machine running a 2.6.9-15.EL kernel compiled without linux-2.6.9-dm-mirroring.patch. I was running ext3 to a 25M linear and running 6 iterations of fsstress and while doing pvmoves at the same time to simulate mirroring. This had ran around 1.5 - 2 hours before triping. Unable to handle kernel NULL pointer dereference at virtual address 00000020 printing eip: e01aa6b2 *pde = 1a1c7067 Oops: 0002 [#1] Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) dm_mirror(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) e1000(U) floppy(U) ext3(U) jbd(U) dm_mod(U) qla2300(U)qla2xxx(U) scsi_transport_fc(U) sd_mod(U) scsi_mod(U) CPU: 0 EIP: 0060:[<e01aa6b2>] Not tainted VLI EFLAGS: 00010286 (2.6.9-prep) EIP is at rh_dec+0xc/0x124 [dm_mirror] eax: 00000000 ebx: df50628c ecx: 00000000 edx: 00000000 esi: df506280 edi: d6a499fc ebp: d459eb80 esp: c03e4eb0 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c03e4000 task=c0358be0) Stack: 00000000 df506280 d6a499fc e01abaf2 d67d29fc 00000000 e0060187 d6a49a04 e007e080 e01abacf 00000000 d459eb80 00000400 e0060122 00000000 c016df35 d459eb80 df506280 e01aaef7 db207900 e01aaf3c 00000000 00000000 d459eb80 Call Trace: [<e01abaf2>] mirror_end_io+0x23/0x28 [dm_mirror] [<e0060187>] clone_endio+0x65/0xee [dm_mod] [<e01abacf>] mirror_end_io+0x0/0x28 [dm_mirror] [<e0060122>] clone_endio+0x0/0xee [dm_mod] [<c016df35>] bio_endio+0x4f/0x54 [<e01aaef7>] write_callback+0x0/0x49 [dm_mirror] [<e01aaf3c>] write_callback+0x45/0x49 [dm_mirror] [<e00653c6>] dec_count+0x43/0x48 [dm_mod] [<e0065499>] endio+0x39/0x46 [dm_mod] [<e0065460>] endio+0x0/0x46 [dm_mod] [<c016df35>] bio_endio+0x4f/0x54 [<c0250dcc>] __end_that_request_first+0xea/0x1ab [<e0046dcc>] scsi_end_request+0x1b/0x174 [scsi_mod] [<e0047263>] scsi_io_completion+0x20b/0x417 [scsi_mod] [<e0042001>] scsi_finish_command+0xad/0xb1 [scsi_mod] [<e0041f26>] scsi_softirq+0xba/0xc2 [scsi_mod] [<c012677d>] __do_softirq+0x35/0x79 [<c0109338>] do_softirq+0x46/0x4d ======================= [<c01088fc>] do_IRQ+0x2b3/0x2bf [<c030fb6c>] common_interrupt+0x18/0x20 [<c010403b>] default_idle+0x23/0x26 [<c010408c>] cpu_idle+0x1f/0x34 [<c03b86b9>] start_kernel+0x214/0x216 Code: d3 ed a8 20 0f 45 fd 0f 45 ea 89 e9 89 fa 89 f0 e8 b2 fe ff ff 8b 5b 08 eb d3 5b 5e 5f 5d c3 57 56 53 89 c3e8 3f f9 ff ff 89 c1 <ff> 48 20 0f 94 c0 84 c0 0f 84 06 01 00 00 9c 5f fa 81 7b 20 3c <0>Kernel panic - not syncing: Fatal exception in interrupt
Exactly the same spot (except on x86): drivers/md/dm-raid1.c:400 16a6: 57 push %edi 16a7: 56 push %esi 16a8: 53 push %ebx 16a9: 89 c3 mov %eax,%ebx drivers/md/dm-raid1.c:406 16ab: e8 3f f9 ff ff call fef <__rh_lookup> 16b0: 89 c1 mov %eax,%ecx include/asm/atomic.h:135 16b2: ff 48 20 decl 0x20(%eax) <--- right here 16a6+c = 16b2 Translate to C code: static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { <--- right here Note that in this case the offset is 0x20. This is because pointers are four bytes on x86... We can easily see that 'pending' is 0x20 from the top of the region structure on x86.
by "6 iterations of fsstress" do you mean 'fsstress -p 6' ?
Taking a look at the following code: ###################################################################### static void rh_inc(struct region_hash *rh, region_t region) { struct region *reg; read_lock(&rh->hash_lock); reg = __rh_find(rh, region); if (reg->state == RH_CLEAN) { rh->log->type->mark_region(rh->log, reg->key); spin_lock_irq(&rh->region_lock); reg->state = RH_DIRTY; list_del_init(®->list); /* take off the clean list */ spin_unlock_irq(&rh->region_lock); } atomic_inc(®->pending); read_unlock(&rh->hash_lock); } static void rh_inc_pending(struct region_hash *rh, struct bio_list *bios) { struct bio *bio; for (bio = bios->head; bio; bio = bio->bi_next) rh_inc(rh, bio_to_region(rh, bio)); } static void rh_dec(struct region_hash *rh, region_t region) { unsigned long flags; struct region *reg; int should_wake = 0; read_lock(&rh->hash_lock); reg = __rh_lookup(rh, region); read_unlock(&rh->hash_lock); if (atomic_dec_and_test(®->pending)) { spin_lock_irqsave(&rh->region_lock, flags); if (reg->state == RH_RECOVERING) { list_add_tail(®->list, &rh->quiesced_regions); } else { reg->state = RH_CLEAN; list_add(®->list, &rh->clean_regions); } spin_unlock_irqrestore(&rh->region_lock, flags); should_wake = 1; } if (should_wake) wake(); } ###################################################################### It would be possible for the process to be in rh_inc, just after failing the (reg-state == RH_CLEAN), then be interrupted by rh_dec. rh_dec (if there is only one write outstanding) would place the region on the clean list. process control returns to rh_inc, which increments pending; but the region is still on the clean list. The write (which caused the rh_inc) proceeds, but rh_update_states removes the region because it is on the clean list. When the new write completes, the region lookup returns NULL - causing the Oops. This is an _extremely_ small window of opportunity. Could it be this?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html