Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 164630

Summary:	panic while running fsstress to a filesystem on a mirror
Product:	Red Hat Enterprise Linux 4	Reporter:	Corey Marthaler <cmarthal>
Component:	kernel	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	agk, jbrassow, kanderso
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2005-514	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-10-05 13:46:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	156323

Description Corey Marthaler 2005-07-29 14:41:49 UTC

Description of problem:
I created an ext3 filesystem on top of a mirror and then ran a
fsstress/genesis/accordion I/O load while at the same time doing pvmoves from
pvs in the VG to others in the VG which had been extended later. 


Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP:
<ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67}
PML4 1ab77067 PGD 3ebc9067 PMD 0
Oops: 0002 [1] SMP
CPU 1
Modules linked in: md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Pid: 0, comm: swapper Tainted: G   M  2.6.9-11.28.ELsmp
RIP: 0010:[<ffffffffa00b7f71>] <ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67}
RSP: 0018:0000010037e0bd98  EFLAGS: 00010216
RAX: 0000000000000000 RBX: 0000010037e3dc30 RCX: ffffff00000545e0
RDX: ffffff00000545e0 RSI: 0000000000000a81 RDI: 0000010037e3dc30
RBP: 0000010037e3dc00 R08: ffffff0000054000 R09: 00000000000001ff
R10: 000001003f1c0800 R11: ffffffffa006f0c2 R12: 0000000000000000
R13: 0000000000000a81 R14: 0000000000000000 R15: 0000000000001000
FS:  0000002a95576b00(0000) GS:ffffffff804cf400(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000003ff38000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 000001003ff34000, task 000001003ff0b030)
Stack: ffffffffa00b7f2e 0000010023a1c118 0000000000000000 000001001d7d7be0
       000001002a3b2680 ffffffffa006f122 000001002a3b2680 0000010037e3dc00
       0000010019ee5250 000001001f2d5100
Call Trace:<IRQ> <ffffffffa00b7f2e>{:dm_mirror:mirror_end_io+0}
<ffffffffa006f122>{:dm_mod:clone_endio+96}
       <ffffffffa00b6ede>{:dm_mirror:write_callback+70}
<ffffffffa0073dbf>{:dm_mod:endio+184}
       <ffffffffa006f18f>{:dm_mod:clone_endio+205}
<ffffffff80247532>{__end_that_request_first+238}
       <ffffffffa0006d7d>{:scsi_mod:scsi_end_request+40}
<ffffffffa0007092>{:scsi_mod:scsi_io_completion+497}
       <ffffffffa0002d21>{:scsi_mod:scsi_softirq+213}
<ffffffff8013b284>{__do_softirq+88}
       <ffffffff8013b32d>{do_softirq+49} <ffffffff80112f7b>{do_IRQ+328}
       <ffffffff8011061f>{ret_from_intr+0}  <EOI> <ffffffff8010e5c9>{default_idle+0}
       <ffffffff8010e5e9>{default_idle+32} <ffffffff8010e65c>{cpu_idle+26}


Code: f0 41 ff 4c 24 38 0f 94 c0 84 c0 74 76 48 8d 5d 50 48 89 df
RIP <ffffffffa00b7f71>{:dm_mirror:mirror_end_io+67} RSP <0000010037e0bd98>
CR2: 0000000000000038
 <0>Kernel panic - not syncing: Oops


[root@link-02 ~]# pvscan
  PV /dev/hda6   VG VolGroup00   lvm2 [17.53 GB / 64.00 MB free]
  PV /dev/sda1   VG VG           lvm2 [135.66 GB / 110.66 GB free]
  PV /dev/sdb1   VG VG           lvm2 [135.66 GB / 110.66 GB free]
  PV /dev/sdc1   VG VG           lvm2 [135.66 GB / 135.65 GB free]
  PV /dev/sdd1   VG VG           lvm2 [135.66 GB / 135.66 GB free]
  PV /dev/sdf1   VG VG           lvm2 [135.66 GB / 135.66 GB free]
  PV /dev/sdg1                   lvm2 [135.66 GB]
[root@link-02 ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "VolGroup00" using metadata type lvm2
  Found volume group "VG" using metadata type lvm2
[root@link-02 ~]# lvscan
  ACTIVE            '/dev/VolGroup00/LogVol00' [15.53 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit
  ACTIVE            '/dev/VG/mirror' [25.00 GB] inherit
  ACTIVE            '/dev/VG/mirror_mlog' [4.00 MB] inherit
  ACTIVE            '/dev/VG/mirror_mimage_0' [25.00 GB] inherit
  ACTIVE            '/dev/VG/mirror_mimage_1' [25.00 GB] inherit


*Note* There were I/O errors on /dev/sde last night, however that device was not
being used at all in this senario, and the errors occured well before this
problem happened.


Version-Release number of selected component (if applicable):
[root@link-02 ~]# lvcreate --version
  LVM version:     2.01.13 (2005-07-13)
  Library version: 1.01.03 (2005-06-13)
  Driver version:  4.4.0

Comment 1 Corey Marthaler 2005-07-29 15:27:28 UTC

I've seen this twice now while only running two fsstress instances alone to an
ext3 filesystem on top of a mirror.

Comment 2 Corey Marthaler 2005-07-29 16:19:27 UTC

This also happens with a linear running fsstress and while doing pvmoves.

Comment 3 Jonathan Earl Brassow 2005-07-29 20:03:02 UTC

This bug has been around for a long time... probably since the first time the
dm-raid1.c file was included in the kernel.

There is a function, __rh_alloc(), which expects to be called while a lock is
held.  However, it will unlock(read) and relock(write) in order to do
allocation.  This creates many races.

I've changed the way the functions are called (and the arguments to one of them)
so that allocation can happen without unlocking.

This change closes 2 different race conditions.

Comment 4 Kiersten (Kerri) Anderson 2005-08-01 15:35:27 UTC

Devel ACK - needs to get fixed in order for mirrors to be useful during U2 beta
on SMP machines.

Comment 5 Jonathan Earl Brassow 2005-08-02 21:47:12 UTC

All of this chasing around for race conditions could have been avoided if I
would have realized that what was meant by 'create_workqueue' was
'create_singlethread_workqueue'...  This eliminates the races and (as best I can
tell) is what the original author envisioned.

The new patch is a couple line change.

Comment 6 Jonathan Earl Brassow 2005-08-02 22:10:44 UTC

Patch submitted to rhkernel.

It has been tested by QA and has successfully solved the problem.

Comment 10 Corey Marthaler 2005-08-09 18:12:31 UTC

I saw this once this morning with the latest kern and GFS build while running an
I/O load to a mirror with GFS on top. I have since been attempting to reproduce
again but have not been able to. Will continue to try...

[root@link-08 mirror]# uname -ar
Linux link-08 2.6.9-15.ELsmp #1 SMP Fri Aug 5 19:00:35 EDT 2005 x86_64 x86_64
x86_64 GNU/Linux

Aug  9 04:39:31 link-08 kernel: GFS 2.6.9-38.0 (built Aug  8 2005 15:52:32)
installed
Aug  9 04:39:50 link-08 kernel: GFS: Trying to join cluster "lock_nolock", ""
Aug  9 04:39:50 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug  8 2005
15:52:12) installed
Aug  9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS...
Aug  9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire
journal lock...
Aug  9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal...
Aug  9 04:39:50 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Done
Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP:
<ffffffffa00e3704>{:dm_mirror:mirror_end_io+84}
PML4 12bf7067 PGD 12bfa067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: lock_nolock(U) gfs(U) lock_harness(U) md5 ipv6 parport_pc lp
parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button
battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptbase sd_mod scsi_mod
Pid: 4083, comm: fsstress Tainted: G   M  2.6.9-15.ELsmp
RIP: 0010:[<ffffffffa00e3704>] <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84}
RSP: 0018:ffffffff8044a6e8  EFLAGS: 00010216
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffff0000274640
RDX: ffffff0000274640 RSI: 0000000000001795 RDI: 000001001f136c30
RBP: 000001001f136c18 R08: ffffff0000273000 R09: 00000000000007ff
R10: 000001001f52ac00 R11: ffffffffa009a0c2 R12: 000001001f136c30
R13: 000001001f136c00 R14: 0000010036bc7350 R15: 0000000000000000
FS:  0000002a95576b00(0000) GS:ffffffff804d3280(005b) knlGS:00000000f7fdd6c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000038 CR3: 0000000000101000 CR4: 00000000000006e0
Process fsstress (pid: 4083, threadinfo 00000100366e4000, task 000001001f11b7f0)
Stack: ffffffffa00e36b0 000001003466ab38 0000000000000000 0000010036bc7340
       000001002e1c4d80 0000000000000000 0000000000007000 ffffffffa009a122
       000001001f136c00 000001002e1c4d80
Call Trace:<IRQ> <ffffffffa00e36b0>{:dm_mirror:mirror_end_io+0}
<ffffffffa009a122>{:dm_mod:clone_endio+96}
       <ffffffffa00e2428>{:dm_mirror:write_callback+314}
<ffffffffa009ee83>{:dm_mod:endio+184}
       <ffffffffa009a18f>{:dm_mod:clone_endio+205}
<ffffffff80249ae2>{__end_that_request_first+238}
       <ffffffffa0006d7d>{:scsi_mod:scsi_end_request+40}
<ffffffffa0007092>{:scsi_mod:scsi_io_completion+497}
       <ffffffffa0002d21>{:scsi_mod:scsi_softirq+213}
<ffffffff8013b774>{__do_softirq+88}
       <ffffffff8013b81d>{do_softirq+49} <ffffffff80112f77>{do_IRQ+328}
       <ffffffff8011061b>{ret_from_intr+0}  <EOI>
<ffffffff801cbb04>{dummy_inode_permission+0}
       <ffffffff80184d24>{open_namei+289} <ffffffff80184d10>{open_namei+269}
       <ffffffff801761c4>{filp_open+39} <ffffffff801e9c85>{strncpy_from_user+74}
       <ffffffff801762cd>{get_unused_fd+230} <ffffffff801763bc>{sys_open+57}
       <ffffffff801250ff>{cstar_do_call+27}

Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8
RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8>
CR2: 0000000000000038
 <0>Kernel panic - not syncing: Oops


After a reboot and another mount attempt:

Lock_Harness 2.6.9-38.0 (built Aug  8 2005 15:52:12) installed
GFS 2.6.9-38.0 (built Aug  8 2005 15:52:32) installed
GFS: Trying to join cluster "lock_nolock", ""
Lock_Nolock 2.6.9-38.0 (built Aug  8 2005 15:52:12) installed
GFS: fsid=dm-5.0: Joined cluster. Now mounting FS...
GFS: fsid=dm-5.0: jid=0: Trying to acquire journal lock...
GFS: fsid=dm-5.0: jid=0: Looking at journal...
GFS: fsid=dm-5.0: jid=0: Acquiring the transaction lock...
GFS: fsid=dm-5.0: jid=0: Replaying journal...
GFS: fsid=dm-5.0: fatal: filesystem consistency error
GFS: fsid=dm-5.0:   function = foreach_descriptor
GFS: fsid=dm-5.0:   file =
/usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c,
line = 382
GFS: fsid=dm-5.0:   time = 1123582218
GFS: fsid=dm-5.0: about to withdraw from the cluster
GFS: fsid=dm-5.0: waiting for outstanding I/O
GFS: fsid=dm-5.0: telling LM to withdraw
GFS: fsid=dm-5.0: withdrawn
GFS: fsid=dm-5.0: start = 13122448
  mh_magic = 0x01161970
  mh_type = 8
  mh_generation = 0
  mh_format = 800
  mh_incarn = 0
  lh_flags = 0x00000000
  lh_pad = 0
  lh_first = 13122384
  lh_sequence = 33296
  lh_tail = 13122000
  lh_last_dump = 13119488
  lh_reserved =
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
GFS: fsid=dm-5.0: jid=0: Failed
GFS: fsid=dm-5.0: error recovering journal 0: -5
Aug  9 05:10:17 link-08 kernel: Lock_Harness 2.6.9-38.0 (built Aug  8 2005
15:52:12) installed
Aug  9 05:10:17 link-08 kernel: GFS 2.6.9-38.0 (built Aug  8 2005 15:52:32)
installed
Aug  9 05:10:17 link-08 kernel: GFS: Trying to join cluster "lock_nolock", ""
Aug  9 05:10:17 link-08 kernel: Lock_Nolock 2.6.9-38.0 (built Aug  8 2005
15:52:12) installed
Aug  9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: Joined cluster. Now mounting FS...
Aug  9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Trying to acquire
journal lock...
Aug  9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Looking at journal...
Aug  9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Acquiring the
transaction lock...
Aug  9 05:10:17 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Replaying journal...
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: fatal: filesystem consistency
error
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0:   function = foreach_descriptor
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0:   file =
/usr/src/build/600098-x86_64/BUILD/gfs-kernel-2.6.9-38/smp/src/gfs/recovery.c,
line = 382
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0:   time = 1123582218
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: about to withdraw from the cluster
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: waiting for outstanding I/O
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: telling LM to withdraw
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: withdrawn
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: start = 13122448
Aug  9 05:10:18 link-08 kernel:   mh_magic = 0x01161970
Aug  9 05:10:18 link-08 kernel:   mh_type = 8
Aug  9 05:10:18 link-08 kernel:   mh_generation = 0
Aug  9 05:10:18 link-08 kernel:   mh_format = 800
Aug  9 05:10:18 link-08 kernel:   mh_incarn = 0
Aug  9 05:10:18 link-08 kernel:   lh_flags = 0x00000000
Aug  9 05:10:18 link-08 kernel:   lh_pad = 0
Aug  9 05:10:18 link-08 kernel:   lh_first = 13122384
Aug  9 05:10:18 link-08 kernel:   lh_sequence = 33296
Aug  9 05:10:18 link-08 kernel:   lh_tail = 13122000
Aug  9 05:10:18 link-08 kernel:   lh_last_dump = 13119488
Aug  9 05:10:18 link-08 kernel:   lh_reserved =
Aug  9 05:10:18 link-08 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Aug  9 05:10:18 link-08 last message repeated 3 times
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: jid=0: Failed
Aug  9 05:10:18 link-08 kernel: GFS: fsid=dm-5.0: error recovering journal 0: -5
Aug  9 05:10:19 link-08 hald[2847]: Timed out waiting for hotplug event 308.
Rebasing to 309
Aug  9 05:10:32 link-08 hal.hotplug[3642]: timout(10000 ms) waiting for
/block/diapered_dm-5

Comment 11 Alasdair Kergon 2005-08-09 18:37:55 UTC

The error handling in mirror_end_io looks rather harsh to me.

How is EWOULDBLOCK handled with read ahead?  Or EOPNOTSUPP ?

Comment 12 Alasdair Kergon 2005-08-09 18:46:48 UTC

Also, ms->lock appears to be used from both interrupt and non-interrupt context
without suspending interrupts.

Comment 13 Alasdair Kergon 2005-08-09 18:57:19 UTC

And map_context->ptr can still get referenced after it's been freed, which would
be my guess at the cause of this particular crash.

Comment 14 Alasdair Kergon 2005-08-09 19:10:48 UTC

I take that back, confused by the non-simple flow through the function.
The  rw == READ  test is always true?

Comment 15 Jonathan Earl Brassow 2005-08-09 20:23:16 UTC

Speculation is the way to the dark side.  :)

Keying in on the important Oops statements:
Code: f0 ff 4b 38 0f 94 c0 84 c0 0f 84 4b 01 00 00 49 8d 7d 50 e8
RIP <ffffffffa00e3704>{:dm_mirror:mirror_end_io+84} RSP <ffffffff8044a6e8>
CR2: 0000000000000038

Doing objdump -D on the mirror module:
00000000000026b0 <mirror_end_io>:
<snip>
    26f4:       e8 64 e5 ff ff          callq  c5d <__rh_lookup>
    26f9:       4c 89 e7                mov    %r12,%rdi
    26fc:       48 89 c3                mov    %rax,%rbx
    26ff:       e8 00 00 00 00          callq  2704 <mirror_end_io+0x54>
    2704:       f0 ff 4b 38             lock decl 0x38(%rbx)

Add +84 (+0x54) to the mirror_end_io function to get 2704.  Goto line 2704, and
verify the machine code - yup, this is exactly the line we are looking for...

Translate that to real code to get:
static void rh_dec(struct region_hash *rh, region_t region)
{
	unsigned long flags;
	struct region *reg;
	int should_wake = 0;

	read_lock(&rh->hash_lock);
	reg = __rh_lookup(rh, region);
	read_unlock(&rh->hash_lock);

	if (atomic_dec_and_test(&reg->pending)) {  <--- right here

Could it be our 'reg' variable?  One more check... if it is, pending would must
be offset from reg by 0x38.  Indeed!  'pending' is at 0x38 offset in the region
structure (assuming a 64-bit machine will pad 4 bytes between state and
hash_list - which it will)!

We can be reasonably certain that our bug is a know race that we have thought of
in theory but have been trying to reproduce...

Could the following patch fix it?

diff -purN linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c linux-2.6.9-12.EL-02/drive
rs/md/dm-raid1.c
--- linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c  2005-08-03 15:28:40.031213364 -0
500
+++ linux-2.6.9-12.EL-02/drivers/md/dm-raid1.c  2005-08-03 15:30:24.220524638 -0
500
@@ -378,6 +378,11 @@ static void rh_inc(struct region_hash *r

        read_lock(&rh->hash_lock);
        reg = __rh_find(rh, region);
+
+       spin_lock_irq(&rh->region_lock);
+       atomic_inc(&reg->pending);
+       spin_unlock_irq(&rh->region_lock);
+
        if (reg->state == RH_CLEAN) {
                rh->log->type->mark_region(rh->log, reg->key);

@@ -387,7 +392,6 @@ static void rh_inc(struct region_hash *r
                spin_unlock_irq(&rh->region_lock);
        }

-       atomic_inc(&reg->pending);
        read_unlock(&rh->hash_lock);
 }

@@ -409,17 +413,17 @@ static void rh_dec(struct region_hash *r
        reg = __rh_lookup(rh, region);
        read_unlock(&rh->hash_lock);

+       spin_lock_irqsave(&rh->region_lock, flags);
        if (atomic_dec_and_test(&reg->pending)) {
-               spin_lock_irqsave(&rh->region_lock, flags);
                if (reg->state == RH_RECOVERING) {
                        list_add_tail(&reg->list, &rh->quiesced_regions);
                } else {
                        reg->state = RH_CLEAN;
                        list_add(&reg->list, &rh->clean_regions);
                }
-               spin_unlock_irqrestore(&rh->region_lock, flags);
                should_wake = 1;
        }
+       spin_unlock_irqrestore(&rh->region_lock, flags);

        if (should_wake)
                wake();

Comment 16 Corey Marthaler 2005-08-09 21:41:25 UTC

Reproduced this again while attempting to hit 165242. 

This was on an x86 UP machine running a 2.6.9-15.EL kernel compiled without
linux-2.6.9-dm-mirroring.patch. I was running ext3 to a 25M linear and running 6
iterations of fsstress and while doing pvmoves at the same time to simulate
mirroring. This had ran around 1.5 - 2 hours before triping.


Unable to handle kernel NULL pointer dereference at virtual address 00000020
 printing eip:
e01aa6b2
*pde = 1a1c7067
Oops: 0002 [#1]
Modules linked in: md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U)
i2c_dev(U) i2c_core(U) sunrpc(U) dm_mirror(U) button(U) battery(U) ac(U)
uhci_hcd(U) ehci_hcd(U) e1000(U) floppy(U) ext3(U) jbd(U) dm_mod(U)
qla2300(U)qla2xxx(U) scsi_transport_fc(U) sd_mod(U) scsi_mod(U)
CPU:    0
EIP:    0060:[<e01aa6b2>]    Not tainted VLI
EFLAGS: 00010286   (2.6.9-prep)
EIP is at rh_dec+0xc/0x124 [dm_mirror]
eax: 00000000   ebx: df50628c   ecx: 00000000   edx: 00000000
esi: df506280   edi: d6a499fc   ebp: d459eb80   esp: c03e4eb0
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03e4000 task=c0358be0)
Stack: 00000000 df506280 d6a499fc e01abaf2 d67d29fc 00000000 e0060187 d6a49a04
       e007e080 e01abacf 00000000 d459eb80 00000400 e0060122 00000000 c016df35
       d459eb80 df506280 e01aaef7 db207900 e01aaf3c 00000000 00000000 d459eb80
Call Trace:
 [<e01abaf2>] mirror_end_io+0x23/0x28 [dm_mirror]
 [<e0060187>] clone_endio+0x65/0xee [dm_mod]
 [<e01abacf>] mirror_end_io+0x0/0x28 [dm_mirror]
 [<e0060122>] clone_endio+0x0/0xee [dm_mod]
 [<c016df35>] bio_endio+0x4f/0x54
 [<e01aaef7>] write_callback+0x0/0x49 [dm_mirror]
 [<e01aaf3c>] write_callback+0x45/0x49 [dm_mirror]
 [<e00653c6>] dec_count+0x43/0x48 [dm_mod]
 [<e0065499>] endio+0x39/0x46 [dm_mod]
 [<e0065460>] endio+0x0/0x46 [dm_mod]
 [<c016df35>] bio_endio+0x4f/0x54
 [<c0250dcc>] __end_that_request_first+0xea/0x1ab
 [<e0046dcc>] scsi_end_request+0x1b/0x174 [scsi_mod]
 [<e0047263>] scsi_io_completion+0x20b/0x417 [scsi_mod]
 [<e0042001>] scsi_finish_command+0xad/0xb1 [scsi_mod]
 [<e0041f26>] scsi_softirq+0xba/0xc2 [scsi_mod]
 [<c012677d>] __do_softirq+0x35/0x79
 [<c0109338>] do_softirq+0x46/0x4d
 =======================
 [<c01088fc>] do_IRQ+0x2b3/0x2bf
 [<c030fb6c>] common_interrupt+0x18/0x20
 [<c010403b>] default_idle+0x23/0x26
 [<c010408c>] cpu_idle+0x1f/0x34
 [<c03b86b9>] start_kernel+0x214/0x216
Code: d3 ed a8 20 0f 45 fd 0f 45 ea 89 e9 89 fa 89 f0 e8 b2 fe ff ff 8b 5b 08 eb
d3 5b 5e 5f 5d c3 57 56 53 89 c3e8 3f f9 ff ff 89 c1 <ff> 48 20 0f 94 c0 84 c0
0f 84 06 01 00 00 9c 5f fa 81 7b 20 3c
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Comment 17 Jonathan Earl Brassow 2005-08-10 03:54:11 UTC

Exactly the same spot (except on x86):

drivers/md/dm-raid1.c:400
    16a6:       57                      push   %edi
    16a7:       56                      push   %esi
    16a8:       53                      push   %ebx
    16a9:       89 c3                   mov    %eax,%ebx
drivers/md/dm-raid1.c:406
    16ab:       e8 3f f9 ff ff          call   fef <__rh_lookup>
    16b0:       89 c1                   mov    %eax,%ecx
include/asm/atomic.h:135
    16b2:       ff 48 20                decl   0x20(%eax)  <--- right here

16a6+c = 16b2

Translate to C code:

static void rh_dec(struct region_hash *rh, region_t region)
{
        unsigned long flags;
        struct region *reg;
        int should_wake = 0;

        read_lock(&rh->hash_lock);
        reg = __rh_lookup(rh, region);
        read_unlock(&rh->hash_lock);

        if (atomic_dec_and_test(&reg->pending)) {  <--- right here

Note that in this case the offset is 0x20.  This is because pointers are four
bytes on x86...  We can easily see that 'pending' is 0x20 from the top of the
region structure on x86.

Comment 18 Jonathan Earl Brassow 2005-08-10 05:42:17 UTC

by "6 iterations of fsstress" do you mean 'fsstress -p 6' ?

Comment 19 Jonathan Earl Brassow 2005-08-10 07:09:34 UTC

Taking a look at the following code:
######################################################################
        static void rh_inc(struct region_hash *rh, region_t region)
	{
		struct region *reg;
	
		read_lock(&rh->hash_lock);
		reg = __rh_find(rh, region);
		if (reg->state == RH_CLEAN) {
			rh->log->type->mark_region(rh->log, reg->key);
	
			spin_lock_irq(&rh->region_lock);
			reg->state = RH_DIRTY;
			list_del_init(&reg->list);	/* take off the clean list */
			spin_unlock_irq(&rh->region_lock);
		}
	
		atomic_inc(&reg->pending);
		read_unlock(&rh->hash_lock);
	}
	
	static void rh_inc_pending(struct region_hash *rh, struct bio_list *bios)
	{
		struct bio *bio;
	
		for (bio = bios->head; bio; bio = bio->bi_next)
			rh_inc(rh, bio_to_region(rh, bio));
	}
	
	static void rh_dec(struct region_hash *rh, region_t region)
	{
		unsigned long flags;
		struct region *reg;
		int should_wake = 0;
	
		read_lock(&rh->hash_lock);
		reg = __rh_lookup(rh, region);
		read_unlock(&rh->hash_lock);
	
		if (atomic_dec_and_test(&reg->pending)) {
			spin_lock_irqsave(&rh->region_lock, flags);
			if (reg->state == RH_RECOVERING) {
				list_add_tail(&reg->list, &rh->quiesced_regions);
			} else {
				reg->state = RH_CLEAN;
				list_add(&reg->list, &rh->clean_regions);
			}
			spin_unlock_irqrestore(&rh->region_lock, flags);
			should_wake = 1;
		}
	
		if (should_wake)
			wake();
	}
######################################################################

It would be possible for the process to be in rh_inc, just after failing the
(reg-state == RH_CLEAN), then be interrupted by rh_dec.  rh_dec (if there is
only one write outstanding) would place the region on the clean list.  process
control returns to rh_inc, which increments pending; but the region is still on
the clean list.  The write (which caused the rh_inc) proceeds, but
rh_update_states removes the region because it is on the clean list.  When the
new write completes, the region lookup returns NULL - causing the Oops.

This is an _extremely_ small window of opportunity.  Could it be this?

Comment 22 Red Hat Bugzilla 2005-10-05 13:46:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html