Bug 165242

Summary:	mirrors possibly reporting invalid blocks to the filesystem
Product:	Red Hat Enterprise Linux 4	Reporter:	Corey Marthaler <cmarthal>
Component:	kernel	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	agk
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2005-514	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-10-05 13:48:56 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	156323

Description Corey Marthaler 2005-08-05 19:24:04 UTC

Description of problem:
While running I/O (fsstress) to filesystems (both ext2 and ext3) on mirrors
lately, I start to see errors about where the I/O goes to. This causes ext to
protect itself by making the fs read only. I tried the exact same load on
linears and stripes and didn't see any problems until I tried to do pvmoves
(which uses mirroing code).

I also ran our block level write/read verifier (b_iogen/b_doio) to the mirror
and didn't see any problems.


EXT3:

link-12 kernel: journal commit I/O error

SYSLOG:
[...]
EXT3-fs error (device dm-3): ext3_new_block: Allocating block in system zone -
block = 3407874
Aborting journal on device dm-3.
ext3_abort called.
EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-3) in start_transaction: Journal has aborted
journal commit I/O error

Message from syslogd@link-12 at Thu Aug  4 10:49:49 2005 ...
link-12 kernel: journal commit I/O error
EXT3-fs error (device dm-3) in ext3_prepare_write: Journal has aborted
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_committed_data


[root@link-12 mirror]# touch /mnt/mirror/foobar
touch: cannot touch `/mnt/mirror/foobar': Read-only file system


EXT2:
[...]
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819216, block=1638402
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819340, block=1638406
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819255, block=1638403
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819244, block=1638403
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819310, block=1638405
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819341, block=1638406
device-mapper: All mirror devices are dead. Unable to choose mirror.
EXT2-fs error (device dm-3): ext2_get_inode: unable to read inode block -
inode=819202, block=1638402
device-mapper: All mirror devices are dead. Unable to choose mirror.
[...]

ext3 on a stripe doing pvmoves:

Aug  5 08:59:46 link-12 kernel: dm-0: rw=0, want=6607214936, limit=84934656
Aug  5 08:59:46 link-12 kernel: attempt to access beyond end of device
Aug  5 08:59:46 link-12 kernel: dm-0: rw=0, want=7143524776, limit=84934656
Aug  5 08:59:46 linkEXT3-fs error (device dm-0): ext3_readdir: -12 kernel:
attebad entry in directory #639490: rec_len % 4 != 0 - offset=0,
inode=1701733735, rec_len=26995, name_len=115
mpt to access beyond end of device
Aug  5 08:59:46 link-12 kernel: dm-0: rw=0, want=14150244824, limit=84934656
Aug  5 08:59:46 link-12 kernel: attempt to access beyond end of device
Aug  5 08:59:46 link-12 kernel: dm-0: rw=0, want=6466257832, limit=84934656
Aug  5 08:59:46 link-12 kernel: attempt to access beyond end of device
[...]


Version-Release number of selected component (if applicable):
[root@link-12 ~]# lvcreate --version
  LVM version:     2.01.13 (2005-07-13)
  Library version: 1.01.03 (2005-06-13)
  Driver version:  4.4.0


How reproducible:
quite easily

Steps to Reproduce:
1. make a mirror
2. run a few fsstress
3. wait for errors and for the filesystem to become read only

Comment 1 Corey Marthaler 2005-08-05 19:27:20 UTC

lvms:

[root@link-12 ~]# pvscan
  PV /dev/sdb1   VG coreyvg   lvm2 [40.71 GB / 216.00 MB free]
  PV /dev/sdb2   VG coreyvg   lvm2 [40.71 GB / 216.00 MB free]
  PV /dev/sdb3   VG coreyvg   lvm2 [40.71 GB / 40.71 GB free]
  PV /dev/sdb5   VG coreyvg   lvm2 [40.71 GB / 40.71 GB free]
  PV /dev/sda1   VG gfs       lvm2 [270.96 GB / 988.00 MB free]
  Total: 5 [433.81 GB] / in use: 5 [433.81 GB] / in no VG: 0 [0   ]
[root@link-12 ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "coreyvg" using metadata type lvm2
  Found volume group "gfs" using metadata type lvm2
[root@link-12 ~]# lvscan
  ACTIVE            '/dev/coreyvg/coreymirror' [40.50 GB] inherit
  ACTIVE            '/dev/coreyvg/coreymirror_mlog' [4.00 MB] inherit
  ACTIVE            '/dev/coreyvg/coreymirror_mimage_0' [40.50 GB] inherit
  ACTIVE            '/dev/coreyvg/coreymirror_mimage_1' [40.50 GB] inherit
  ACTIVE            '/dev/gfs/gfs' [270.00 GB] inherit



[root@link-12 ~]# lvs --all
  LV                     VG      Attr   LSize   Origin Snap%  Move Log         
    Copy%
  coreymirror            coreyvg mwi-a-  40.50G                   
coreymirror_mlog   1.96
  [coreymirror_mimage_0] coreyvg iwi-ao  40.50G
  [coreymirror_mimage_1] coreyvg iwi-ao  40.50G
  [coreymirror_mlog]     coreyvg lwi-ao   4.00M
  gfs                    gfs     -wi-a- 270.00G

Comment 2 Corey Marthaler 2005-08-05 20:46:27 UTC

I was just playing around with four mirrors and ext3 and saw errors and this
time a hang of the fs. 


[...]
Aug  5 10:36:36 link-12 kernel: device-mapper: All replicated volumes dead,
failing I/O
Aug  5 10:36:36 link-12 kernel: device-mapper: A read failure occurred on a
mirror device.
Aug  5 10:36:36 link-12 kernel: device-mapper: incrementing error_count on 253:7
Aug  5 10:36:36 link-12 kernel: device-mapper: All mirror devices are dead.
Unable to choose mirror.
Aug  5 10:36:36 link-12 kernel: device-mapper: All replicated volumes dead,
failing I/O
Aug  5 10:36:36 link-12 kernel: device-mapper: All mirror devices are dead.
Unable to choose mirror.
Aug  5 10:36:36 link-12 kernel: EXT2-fs error (device dm-8): read_inode_bitmap:
Cannot read inode bitmap - block_group = 0, inode_bitmap = 1028
[...]



[root@link-12 ~]# lvs --all
  LV                      VG      Attr   LSize   Origin Snap%  Move Log        
      Copy%
  coreymirror0            coreyvg mwi-ao  20.00G                   
coreymirror0_mlog  17.79
  [coreymirror0_mimage_0] coreyvg iwi-ao  20.00G
  [coreymirror0_mimage_1] coreyvg iwi-ao  20.00G
  [coreymirror0_mlog]     coreyvg lwi-ao   4.00M
  coreymirror1            coreyvg mwi-ao  20.00G                   
coreymirror1_mlog  17.42
  [coreymirror1_mimage_0] coreyvg iwi-ao  20.00G
  [coreymirror1_mimage_1] coreyvg iwi-ao  20.00G
  [coreymirror1_mlog]     coreyvg lwi-ao   4.00M
  coreymirror2            coreyvg mwi-ao  20.00G                   
coreymirror2_mlog  13.67
  [coreymirror2_mimage_0] coreyvg iwi-ao  20.00G
  [coreymirror2_mimage_1] coreyvg iwi-ao  20.00G
  [coreymirror2_mlog]     coreyvg lwi-ao   4.00M
  coreymirror3            coreyvg mwi-ao  20.00G                   
coreymirror3_mlog  12.68
  [coreymirror3_mimage_0] coreyvg iwi-ao  20.00G
  [coreymirror3_mimage_1] coreyvg iwi-ao  20.00G
  [coreymirror3_mlog]     coreyvg lwi-ao   4.00M
  gfs                     gfs     -wi-a- 270.00G



[root@link-12 bin]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda2              36G  2.6G   32G   8% /
/dev/hda1              99M   16M   79M  17% /boot
none                  248M     0  248M   0% /dev/shm
/dev/mapper/coreyvg-coreymirror0
                       20G   48M   19G   1% /mnt/mirror0
/dev/mapper/coreyvg-coreymirror1
                       20G   45M   19G   1% /mnt/mirror1
/dev/mapper/coreyvg-coreymirror2
                       20G   45M   19G   1% /mnt/mirror2
/dev/mapper/coreyvg-coreymirror3
                       20G   49M   19G   1% /mnt/mirror3
[root@link-12 bin]# touch /mnt/mirror0/sdkfjs
[root@link-12 bin]# touch /mnt/mirror1/sdkfjs
[hang]

Comment 3 Jonathan Earl Brassow 2005-08-08 16:46:57 UTC

Try using something other than ext[23] - just to make sure it isn't the FS. 
Also,  in addition to 'lvs --all' do a 'dmsetup status'.

You sure your drives aren't spitting out any errors?

Comment 4 Alasdair Kergon 2005-08-08 21:19:28 UTC

Exactly which kernel was this with?

Comment 5 Alasdair Kergon 2005-08-08 21:24:25 UTC

(If it didn't include the recently-added patches like
linux-2.6.9-dm-raid1-race.patch and linux-2.6.9-bio-clone.patch then please retest.)

Comment 6 Jonathan Earl Brassow 2005-08-08 21:33:57 UTC

Indeed, the kernel would have preceeded on with the bio_clone fix.  2.6.9-15.EL has these fixes.  I've 
asked corey to retest with these.

Comment 7 Corey Marthaler 2005-08-09 15:05:12 UTC

Hit this after about 50 minutes of running the fsstress load on a gfs filesystem
 with no_lock:

Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0: fatal: invalid metadata block
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0:   bh = 8050939 (magic)
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0:   function = gfs_rgrp_read
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0:   file =
/usr/src/build/600062-x86_64/BUILD/gfs-kernel-2.6.9-37/smp/src/gfs/rgrp.c, line
= 830
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0:   time = 1123514218
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0: about to withdraw from the cluster
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0: waiting for outstanding I/O
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0: telling LM to withdraw
Aug  8 10:16:58 link-08 kernel: GFS: fsid=dm-5.0: withdrawn


[root@link-08 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       52G  5.4G   45G  11% /
/dev/hda5              99M   35M   59M  38% /boot
none                  501M     0  501M   0% /dev/shm
df: `/mnt/mirror': Input/output error



[root@link-08 mirror]# uname -ar
Linux link-08 2.6.9-15.ELsmp #1 SMP Fri Aug 5 19:00:35 EDT 2005 x86_64 x86_64
x86_64 GNU/Linux

GFS 2.6.9-37.1 (built Aug  8 2005 13:59:40) installed

Comment 8 Corey Marthaler 2005-08-09 18:04:46 UTC

I reproduced this as well with a smaller ext fs (25M) on a single CPU machine
running a UP kernel.

Comment 9 Alasdair Kergon 2005-08-09 18:14:31 UTC

Next we probably need to build one-off kernels with some of the recent patches
removed to see if it goes away or not.  Using older kernels is unlikely to help
as they include other bugs we fixed.  

Could also try using Linus's latest 2.6.13* kernel.  [Again, stick with UP 25Meg
for further testing.]

Comment 10 Jonathan Earl Brassow 2005-08-09 18:53:32 UTC

Corey is now working on 2.6.9-15.EL with linux-2.6.9-dm-mirroring.patch removed.
 This will tell us if it was a pre-existing condition.

When reproducing, please list the following in your comment:
1) kernel + any changes (like the removal of linux-2.6.9-dm-mirroring.patch)
2) UP or SMP
3) size of FS if one is used
4) tests run
5) time to reproduce

Comment 11 Alasdair Kergon 2005-08-09 19:40:32 UTC

See also bug 164630: Is this because readahead that returns EWOULDBLOCK is
treated as a hard error and causes the device to be disabled?

Comment 12 Corey Marthaler 2005-08-09 22:35:12 UTC

Testing update of this afternoons activities:

x86 UP machine (link-12) running a 2.6.9-15.EL kernel compiled without
linux-2.6.9-dm-mirroring.patch. Running ext3 on a 28M linear and running 6
iterations of fsstress and while doing pvmoves at the same time to simulate
mirroring. This ran around 1.5 - 2 hours before triping 164630.


x86 UP machine (link-12) running a 2.6.9-15.EL kernel compiled without
linux-2.6.9-dm-mirroring.patch. Running ext3 on a bigger 25G linear and running
6 iterations of fsstress and while doing pvmoves at the same time to simulate
mirroring. This continues to run.


x86 UP machine (link-11) running a regular 2.6.9-15.EL kernel. Running ext3 on a
28M mirror and running 6 iterations of fsstress. This _always_ trips 165242
instantanously.


x86 UP machine (link-11) running a regular 2.6.9-15.EL kernel. Running
b_iogen/b_doio (a block level io generator/write verifier) straight to a 28M
mirror. b_iogen/b_doio are this time doing overlapping I/O to random offsets
with random transfer size between 1k and 4m. This continues to run.


x86_64 smp machine (link-08) running a regular 2.6.9-15.EL kernel. Running GFS
on a 500M mirror and doing 8 iterations of fsstress. This continues to run but
is causing the following errors every minute or so:
Aug  9 11:43:37 link-08 kernel: GFS: fsid=dm-5.0: quotad: error = -28
Aug  9 11:48:41 link-08 kernel: GFS: fsid=dm-5.0: quotad: error = -28

Comment 15 Corey Marthaler 2005-08-10 20:40:42 UTC

Reproduced this finally after 2 hours of running a much bigger I/O load to a
much bigger filesystem. 

This is with the lastest patches from Jon:
patch -p1 < /pub/164630-1.patch
patch -p1 < /pub/165242-2.patch
patch -p1 < /pub/spinlock_cleanup-3.patch

On x86_64 (still have yet to reproduce on x86). The filesytem/mirror size 10G.
The I/O load consisted of many fsstress/genesis/accordion processes along with
pvmoves going on.

With these latest patches, this bug is definately much harder to see as it used
to happen seconds after starting I/O (and like stated earlier, I still haven't
seen it on x86).

Comment 16 Alasdair Kergon 2005-08-10 20:54:46 UTC

Which patches and applied to which kernel?  Please attach patches to this bug so
we can see what you've tested!

And is this a true reproduction - exactly the same error messages as in comment
#1 - if not, please be more specific and quote the new messages.

Comment 17 Corey Marthaler 2005-08-10 21:02:56 UTC

164630-1.patch:

diff -purN linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c
linux-2.6.9-12.EL-02/drivers/md/dm-raid1.c
--- linux-2.6.9-12.EL-01/drivers/md/dm-raid1.c  2005-08-03 15:28:40.031213364 -0500
+++ linux-2.6.9-12.EL-02/drivers/md/dm-raid1.c  2005-08-03 15:30:24.220524638 -0500
@@ -378,6 +378,11 @@ static void rh_inc(struct region_hash *r

        read_lock(&rh->hash_lock);
        reg = __rh_find(rh, region);
+
+       spin_lock_irq(&rh->region_lock);
+       atomic_inc(&reg->pending);
+       spin_unlock_irq(&rh->region_lock);
+
        if (reg->state == RH_CLEAN) {
                rh->log->type->mark_region(rh->log, reg->key);

@@ -387,7 +392,6 @@ static void rh_inc(struct region_hash *r
                spin_unlock_irq(&rh->region_lock);
        }

-       atomic_inc(&reg->pending);
        read_unlock(&rh->hash_lock);
 }

@@ -409,17 +413,17 @@ static void rh_dec(struct region_hash *r
        reg = __rh_lookup(rh, region);
        read_unlock(&rh->hash_lock);

+       spin_lock_irqsave(&rh->region_lock, flags);
        if (atomic_dec_and_test(&reg->pending)) {
-               spin_lock_irqsave(&rh->region_lock, flags);
                if (reg->state == RH_RECOVERING) {
                        list_add_tail(&reg->list, &rh->quiesced_regions);
                } else {
                        reg->state = RH_CLEAN;
                        list_add(&reg->list, &rh->clean_regions);
                }
-               spin_unlock_irqrestore(&rh->region_lock, flags);
                should_wake = 1;
        }
+       spin_unlock_irqrestore(&rh->region_lock, flags);

        if (should_wake)
                wake();


165242-2.patch:

diff -purN linux-2.6.9-15.EL-race/drivers/md/dm-raid1.c
linux-2.6.9-15.EL-01/drivers/md/dm-raid1.c
--- linux-2.6.9-15.EL-race/drivers/md/dm-raid1.c        2005-08-10
10:53:04.267420165 -0500
+++ linux-2.6.9-15.EL-01/drivers/md/dm-raid1.c  2005-08-10 10:58:45.888278366 -0500
@@ -1371,7 +1371,7 @@ static int mirror_map(struct dm_target *
                 */
                m = choose_mirror(ms, NULL);
                if (likely(m)) {
-                       bmi = mempool_alloc(bio_map_info_pool, GFP_KERNEL);
+                       bmi = mempool_alloc(bio_map_info_pool, GFP_NOIO);

                        if (likely(bmi)) {
                                /* without this, a read is not retryable */
@@ -1408,6 +1408,7 @@ static int mirror_end_io(struct dm_targe
        int rw = bio_rw(bio);
        struct mirror_set *ms = (struct mirror_set *) ti->private;
        struct mirror *m = NULL;
+       struct dm_bio_details *bd = NULL;

        /*
         * We need to dec pending if this was a write.
@@ -1417,9 +1418,13 @@ static int mirror_end_io(struct dm_targe
                return error;
        }

-       if (unlikely(error)) {
-               struct dm_bio_details *bd = NULL;
+       if (error == -EOPNOTSUPP)
+               goto out;
+
+       if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio))
+               goto out;

+       if (unlikely(error)) {
                DMERR("A read failure occurred on a mirror device.");
                if (!map_context->ptr) {
                        /*
@@ -1437,7 +1442,7 @@ static int mirror_end_io(struct dm_targe
                 * to the daemon for another shot to
                 * one (if any) intact mirrors.
                 */
-               if (rw == READ && default_ok(m)) {
+               if (default_ok(m)) {
                        bd = &(((struct bio_map_info *)map_context->ptr)->bmi_bd);

                        DMWARN("Trying different device.");
@@ -1450,6 +1455,7 @@ static int mirror_end_io(struct dm_targe
                DMERR("All replicated volumes dead, failing I/O");
        }

+ out:
        if (map_context->ptr)
                mempool_free(map_context->ptr, bio_map_info_pool);



spinlock_cleanup-3.patch:

diff -purN linux-2.6.9-15.EL-01/drivers/md/dm-raid1.c
linux-2.6.9-15.EL-02/drivers/md/dm-raid1.c
--- linux-2.6.9-15.EL-01/drivers/md/dm-raid1.c  2005-08-10 10:58:45.888278366 -0500
+++ linux-2.6.9-15.EL-02/drivers/md/dm-raid1.c  2005-08-10 11:21:48.808704093 -0500
@@ -1067,12 +1067,12 @@ static void do_mirror(struct mirror_set
 {
        struct bio_list reads, writes;

-       spin_lock(&ms->lock);
+       spin_lock_irq(&ms->lock);
        reads = ms->reads;
        writes = ms->writes;
        bio_list_init(&ms->reads);
        bio_list_init(&ms->writes);
-       spin_unlock(&ms->lock);
+       spin_unlock_irq(&ms->lock);

        rh_update_states(&ms->rh);
        do_recovery(ms);
@@ -1320,14 +1320,15 @@ static void mirror_dtr(struct dm_target

 static void queue_bio(struct mirror_set *ms, struct bio *bio, int rw)
 {
+       unsigned long flags;
        int should_wake = 0;
        struct bio_list *bl;

        bl = (rw == WRITE) ? &ms->writes : &ms->reads;
-       spin_lock(&ms->lock);
+       spin_lock_irqsave(&ms->lock, flags);
        should_wake = !(bl->head);
        bio_list_add(bl, bio);
-       spin_unlock(&ms->lock);
+       spin_unlock_irqrestore(&ms->lock, flags);

        if (should_wake)
                wake();

Comment 18 Corey Marthaler 2005-08-10 21:04:50 UTC

I could probably never be positive that this is an exact repoduction, but it
sure seems similar if not the same:

Aug 10 09:23:18 link-08 kernel: attempt to access beyond end of device
Aug 10 09:23:18 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:23:18 link-08 kernel: attempt to access beyond end of device
Aug 10 09:23:18 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:23:18 link-08 kernel: attempt to access beyond end of device
Aug 10 09:23:18 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:37:17 link-08 kernel: attempt to access beyond end of device
Aug 10 09:37:17 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:45:37 link-08 kernel: attempt to access beyond end of device
Aug 10 09:45:37 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5): ext3_free_blocks:
Freeing blocks not in datazone - bloc
k = 2038004089, count = 1
Aug 10 09:45:37 link-08 kernel: Aborting journal on device dm-5.
Aug 10 09:45:37 link-08 kernel: ext3_abort called.
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5):
ext3_journal_start_sb: Detected aborted journal
Aug 10 09:45:37 link-08 kernel: Remounting filesystem read-only
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
ext3_reserve_inode_write: Journal has aborted
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
ext3_reserve_inode_write: Journal has aborted
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in ext3_orphan_del:
Journal has aborted
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in ext3_truncate:
Journal has aborted
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
ext3_reserve_inode_write: Journal has aborted
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in ext3_dirty_inode:
Journal has aborted
Aug 10 09:45:37 link-08 kernel: journal commit I/O error
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
start_transaction: Journal has aborted
Aug 10 09:45:37 link-08 kernel: journal commit I/O error
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
start_transaction: Journal has aborted
Aug 10 09:45:37 link-08 kernel: journal commit I/O error
Aug 10 09:45:37 link-08 kernel: EXT3-fs error (device dm-5) in
start_transaction: Journal has aborted
Aug 10 09:45:48 link-08 kernel: EXT3-fs error (device dm-5): ext3_free_blocks:
Freeing blocks not in datazone - bloc
k = 2038004089, count = 1
Aug 10 09:45:48 link-08 last message repeated 859 times
Aug 10 09:45:48 link-08 kernel: EXT3-fs error (device dm-5) in
ext3_reserve_inode_write: Journal has aborted
Aug 10 09:45:48 link-08 kernel: EXT3-fs error (device dm-5) in
ext3_reserve_inode_write: Journal has aborted
Aug 10 09:45:48 link-08 kernel: EXT3-fs error (device dm-5) in ext3_orphan_del:
Journal has aborted
Aug 10 09:45:48 link-08 kernel: EXT3-fs error (device dm-5) in ext3_truncate:
Journal has aborted
Aug 10 09:45:48 link-08 kernel: __journal_remove_journal_head: freeing
b_committed_data
Aug 10 09:46:07 link-08 kernel: attempt to access beyond end of device
Aug 10 09:46:07 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:46:07 link-08 kernel: attempt to access beyond end of device
Aug 10 09:46:07 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:46:07 link-08 kernel: attempt to access beyond end of device
Aug 10 09:46:07 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520
Aug 10 09:46:07 link-08 kernel: attempt to access beyond end of device
[...]

Comment 19 Alasdair Kergon 2005-08-10 21:08:33 UTC

But curiously no device-mapper error messages in there this time.

And UP 2.6.9-15.EL still I presume?

Comment 20 Alasdair Kergon 2005-08-10 21:09:37 UTC

That makes it a different problem from the original one.

Comment 21 Alasdair Kergon 2005-08-10 21:19:13 UTC

And can you confirm that the *first* error in the log was:

Aug 10 09:23:18 link-08 kernel: attempt to access beyond end of device
Aug 10 09:23:18 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520

If so, next time it happens please dump the configuration of the device
corresponding to the error message at the time of the error. dm-5 means the
minor number is 5 so something like 'dmsetup table -m 5 -j 253' should do this
(if 253 is the dm major number).   [If that's difficult to do, you could run
'dmsetup table' in a loop with sleep inbetween capturing timestamped output, so
you can work it out after the event.]

Comment 22 Corey Marthaler 2005-08-10 21:44:24 UTC

This was on link-08, a machine with one processor and 2.6.9-15.EL, correct.

Yes, that was the very first message as things started south:
Aug 10 09:23:18 link-08 kernel: attempt to access beyond end of device
Aug 10 09:23:18 link-08 kernel: dm-5: rw=0, want=16304032720, limit=20971520

If this is a diff problem, should I be opening a new bug then?

Comment 23 Jonathan Earl Brassow 2005-08-11 13:59:30 UTC

Yes, please open a new bug.

This bug is dedicated to the problem that I wasn't handling READA responses correctly.  Thus, all the 
devices would go missing - giving the problems described.

The bug you have after the latest patches has nothing to do with the mirror marking all devices as 
disabled.  Rather, for some reason ext[23] is asking for blocks way outside your range...  could be 
corruption, not sure.

Comment 24 Corey Marthaler 2005-08-11 15:02:31 UTC

Submited 165717 for comments #15 - #23. 

Which means that after applying those patches, this bug has not been seen in
about 16 hours of testing.

Comment 26 Red Hat Bugzilla 2005-10-05 13:48:57 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html