Bug 498532 - RHEL5 cmirror tracker: multiple device failure of fully synced cmirror causes corruption
RHEL5 cmirror tracker: multiple device failure of fully synced cmirror causes...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Red Hat Kernel QE team
:
Depends On:
Blocks: 533192 525215
  Show dependency treegraph
 
Reported: 2009-04-30 17:33 EDT by Corey Marthaler
Modified: 2010-03-30 03:17 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-30 03:17:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log from taft-01 (21.46 KB, text/plain)
2009-04-30 17:38 EDT, Corey Marthaler
no flags Details
log from taft-02 (4.43 KB, text/plain)
2009-04-30 17:39 EDT, Corey Marthaler
no flags Details
log from taft-03 (4.55 KB, text/plain)
2009-04-30 17:39 EDT, Corey Marthaler
no flags Details
log from taft-04 (6.95 KB, text/plain)
2009-04-30 17:40 EDT, Corey Marthaler
no flags Details
dump from taft-01 (212.91 KB, text/plain)
2009-05-01 11:52 EDT, Corey Marthaler
no flags Details
dump from taft-02 (197.92 KB, text/plain)
2009-05-01 11:52 EDT, Corey Marthaler
no flags Details
dump from taft-03 (200.70 KB, text/plain)
2009-05-01 11:54 EDT, Corey Marthaler
no flags Details
dump from taft-04 (198.12 KB, text/plain)
2009-05-01 11:54 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2009-04-30 17:33:34 EDT
Description of problem:
Helter_skelter failed both the primary leg and log device at the same time and this resulted in a corrupted volume.

================================================================================
Iteration 0.1 started at Thu Apr 30 15:54:10 CDT 2009                           
================================================================================
Scenario: Kill primary leg and disk log of synced 2 leg mirror(s)               

****** Mirror hash info for this scenario ******
* name:         syncd_primary_log_2legs         
* sync:         1                               
* num mirrors:  1                               
* disklog:      /dev/sdh1                       
* failpv(s):    /dev/sde1 /dev/sdh1             
* leg devices:  /dev/sde1 /dev/sdf1             
************************************************

Creating mirror(s) on taft-02-bond...
taft-02-bond: lvcreate -m 1 -n syncd_primary_log_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdh1:0-150

Waiting until all mirrors become fully syncd...
        0/1 mirror(s) are fully synced: ( 1=2.75% )
        0/1 mirror(s) are fully synced: ( 1=36.75% )
        0/1 mirror(s) are fully synced: ( 1=36.75% )
        0/1 mirror(s) are fully synced: ( 1=41.50% )
        0/1 mirror(s) are fully synced: ( 1=41.50% )
        0/1 mirror(s) are fully synced: ( 1=46.67% )
        0/1 mirror(s) are fully synced: ( 1=52.17% )
        0/1 mirror(s) are fully synced: ( 1=80.75% )
        1/1 mirror(s) are fully synced: ( 1=100.00% )

Creating gfs on top of mirror(s) on taft-01-bond...
Mounting mirrored gfs filesystems on taft-01-bond...
Mounting mirrored gfs filesystems on taft-02-bond...
Mounting mirrored gfs filesystems on taft-03-bond...
Mounting mirrored gfs filesystems on taft-04-bond...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01-bond ----                         
checkit starting with:                                 
CREATE                                                 
Num files:          100                                
Random Seed:        12027                              
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-02-bond ----
checkit starting with:        
CREATE                        
Num files:          100       
Random Seed:        11738     
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-03-bond ----
checkit starting with:        
CREATE                        
Num files:          100       
Random Seed:        11625     
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-04-bond ----
checkit starting with:        
CREATE                        
Num files:          100       
Random Seed:        11611     
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit


<start name="taft-01-bond_1" pid="6572" time="Thu Apr 30 15:55:49 2009" type="cmd" />
<start name="taft-02-bond_1" pid="6574" time="Thu Apr 30 15:55:49 2009" type="cmd" />
<start name="taft-03-bond_1" pid="6576" time="Thu Apr 30 15:55:49 2009" type="cmd" />
<start name="taft-04-bond_1" pid="6578" time="Thu Apr 30 15:55:49 2009" type="cmd" />
Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
        ---- taft-01-bond ----
checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-02-bond ----
checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-03-bond ----
checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit

        ---- taft-04-bond ----
checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_syncd_primary_log_2legs_1
Working dir:        /mnt/syncd_primary_log_2legs_1/checkit


Disabling device sde on taft-01-bond
Disabling device sdh on taft-01-bond
Disabling device sde on taft-02-bond
Disabling device sdh on taft-02-bond
Disabling device sde on taft-03-bond
Disabling device sdh on taft-03-bond
Disabling device sde on taft-04-bond
Disabling device sdh on taft-04-bond

Attempting I/O to cause mirror down conversion(s) on taft-02-bond
[HANG]

[root@taft-04 ~]# lvs -a -o +devices
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
[HANG]

[root@taft-01 ~]# ls /mnt/syncd_primary_log_2legs_1
ls: /mnt/syncd_primary_log_2legs_1: Input/output error

Apr 30 15:56:39 taft-01 clogd[6672]: [vqQS1eu0] rw_log:  write failure: Input/output error
Apr 30 15:56:39 taft-01 clogd[6672]: [vqQS1eu0] Error writing to disk log
Apr 30 15:56:39 taft-01 lvm[8622]: Log device, 253:2, has failed.
Apr 30 15:56:39 taft-01 lvm[8622]: Device failure in helter_skelter-syncd_primary_log_2legs_1
Apr 30 15:56:39 taft-01 lvm[8622]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended
Apr 30 15:56:39 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device
Apr 30 15:56:39 taft-01 kernel: device-mapper: raid1: Primary mirror device has failed while mirror is not in-sync
Apr 30 15:56:39 taft-01 kernel: device-mapper: raid1: Unable to choose alternative primary device
Apr 30 15:56:39 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device
Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: fatal: I/O error
Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   block = 63440
Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   function = gfs_logbh_wait
Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/dio.c, line = 925
Apr 30 15:56:39 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   time = 1241124999


I'll attach the logs from each of the 4 nodes as well as include a kern dump of the running processes.

Version-Release number of selected component (if applicable):
2.6.18-128.el5

lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-7.el5    BUILT: Wed Nov 26 07:19:19 CST 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
Comment 1 Corey Marthaler 2009-04-30 17:38:29 EDT
Created attachment 342012 [details]
log from taft-01
Comment 2 Corey Marthaler 2009-04-30 17:39:00 EDT
Created attachment 342013 [details]
log from taft-02
Comment 3 Corey Marthaler 2009-04-30 17:39:31 EDT
Created attachment 342014 [details]
log from taft-03
Comment 4 Corey Marthaler 2009-04-30 17:40:27 EDT
Created attachment 342015 [details]
log from taft-04
Comment 5 Corey Marthaler 2009-05-01 11:43:55 EDT
This is reproducible. Attaching kern dumps...
Comment 6 Corey Marthaler 2009-05-01 11:52:13 EDT
Created attachment 342103 [details]
dump from taft-01
Comment 7 Corey Marthaler 2009-05-01 11:52:54 EDT
Created attachment 342105 [details]
dump from taft-02
Comment 8 Corey Marthaler 2009-05-01 11:54:16 EDT
Created attachment 342106 [details]
dump from taft-03
Comment 9 Corey Marthaler 2009-05-01 11:54:59 EDT
Created attachment 342108 [details]
dump from taft-04
Comment 10 Jonathan Earl Brassow 2009-05-14 09:43:40 EDT
I don't see any corruption.

What I do see is that GFS got an I/O error from the mirror and withdrew.  The data should still be fine, but we need to find the reason for the I/O error.
Comment 11 Corey Marthaler 2009-06-05 16:39:45 EDT
All operations to the mirror get stuck at the point of failure. As far as I can tell, there is no way to get at the data without a reboot. 

After rebooting my entire cluster, the mirror was unable to be activated which is expected. I then attempted to down convert, which was probably a mistake, because it's going to attempt to down convert to the failed primary leg, at which point the data gets lost. What would have been the proper way to fix this issue and kept the data?

# After a reboot:

  syncd_primary_log_2legs_1            helter_skelter mwi-a- 600.00M                    syncd_primary_log_2legs_1_mlog   0.00         syncd_p                      rimary_log_2legs_1_mimage_0(0),syncd_primary_log_2legs_1_mimage_1(0)
  [syncd_primary_log_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M                                                                                                                                                                   
  [syncd_primary_log_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M                                                                  /dev/sd                      d1(0)                                                               
  [syncd_primary_log_2legs_1_mlog]     helter_skelter lwi-ao   4.00M                                                                                                                                                                   

[root@taft-01 ~]# lvconvert -m 0 helter_skelter/syncd_primary_log_2legs_1
  Logical volume syncd_primary_log_2legs_1 converted.

[root@taft-01 ~]# lvs -a -o +devices
  LV                        VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices        
  LogVol00                  VolGroup00     -wi-ao  58.38G                                       /dev/sda2(0)   
  LogVol01                  VolGroup00     -wi-ao   9.75G                                       /dev/sda2(1868)
  syncd_primary_log_2legs_1 helter_skelter vwi-a- 600.00M                                                      

[root@taft-01 ~]# mount /dev/helter_skelter/syncd_primary_log_2legs_1 /mnt/taft
mount: you must specify the filesystem type


printk: 25 messages suppressed.
Buffer I/O error on device dm-5, logical block 153584
Buffer I/O error on device dm-5, logical block 153584
Buffer I/O error on device dm-5, logical block 0
Buffer I/O error on device dm-5, logical block 1
Buffer I/O error on device dm-5, logical block 2
Buffer I/O error on device dm-5, logical block 3
Buffer I/O error on device dm-5, logical block 4
Buffer I/O error on device dm-5, logical block 5
Buffer I/O error on device dm-5, logical block 6
Buffer I/O error on device dm-5, logical block 7
hfs: unable to find HFS+ superblock
Comment 12 Corey Marthaler 2009-06-05 17:01:23 EDT
The second time I attempted this today, the VG didn't even show up after the reboot.

[root@taft-04 ~]# service cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
[root@taft-04 ~]# service cmirror start
Loading clustered mirror log module:                       [  OK  ]
Starting clustered mirror log server:                      [  OK  ]
[root@taft-04 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:   /dev/sdd1: Checksum error
  /dev/sde1: Checksum error
  /dev/sdh1: Checksum error
  2 logical volume(s) in volume group "VolGroup00" now active
                                                           [  OK  ]
[root@taft-04 ~]# lvs -a -o +devices
  /dev/sdd1: Checksum error
  /dev/sde1: Checksum error
  /dev/sdh1: Checksum error
  LV       VG         Attr   LSize  Origin Snap%  Move Log Copy%  Convert Devices        
  LogVol00 VolGroup00 -wi-ao 58.38G                                       /dev/sda2(0)   
  LogVol01 VolGroup00 -wi-ao  9.75G                                       /dev/sda2(1868)
Comment 14 Jonathan Earl Brassow 2009-07-21 14:59:39 EDT
Comment #12 sounds like a completely different bug to me.  The volume group metadata seems to be corrupted in that instance.  The cluster mirror components know nothing about LVM metadata and cannot even address those portions of the disk.  A valid bug for sure - especially if you can reproduce it, but I don't think it is related to the original post.

The problem being seen in the original post is:
- failure of primary (or any other device) write causes the mirror to be "out-of-sync"
- the primary cannot be switched due to the out-of-sync nature of the mirror
- a read results in an I/O error because the mirror can't be sure the back-up leg is up-to-date

Since the logic failure is in drivers/md/dm-raid1.c, this should be reproducible on single machine mirrors also.

Can the bug be reproduced by just failing the primary leg?
Can the bug be reproduced in single machine mode?

I think there may be a workaround, but I would have to try it on a setup with a reproduction of the problem.

Changing the logic to fix the problem in drivers/md/dm-raid1.c has the potential to introduce other regressions.  I will be checking to see if there is a minimally intrusive way to resolve the issue - so for now, I'm adding the devel-ack.
Comment 15 Corey Marthaler 2009-07-23 12:07:02 EDT
Jon and I tested and concluded that this doesn't happen on single machine mirrors. We're working on device failure testing in a cluster mode to get a better handle on this bug.
Comment 16 Jonathan Earl Brassow 2009-07-23 16:45:31 EDT
I had thought the bug could be recreated by:
1) create cluster mirror and wait for sync
2) kill primary device and log
3) write to the mirror on a machine that has not yet reported "in-sync"

This method for reproducing the bug has not worked.  We are currently running tests with a modified kernel that we are hoping will provide us enough information to create a flow chart of how the bug is happening.  Waiting to reproduce....
Comment 17 Jonathan Earl Brassow 2009-07-28 10:27:52 EDT
We are pretty sure we have discovered the cause.  From my pending upstream patch:

<patch>
This patch fixes a bug which was triggering a case where
the primary leg could not be changed (even when the mirror was
in-sync) on failure.

The case involves the failure of the primary device along with
the transient failure of the log device.  The problem is that
bios can be put on the 'failures' list - due to log failure -
before 'fail_mirror' is called due to the primary device failure.
Normally, this is fine, but if the log device failure is transient,
a subsequent iteration of the work thread, 'do_mirror', will
reset 'log_failure'.  The 'do_failures' function then resets
the 'in_sync' variable when processing bios on the failures list.
The 'in_sync' variable is what is used to determine if the
primary device can be switched in the event of a failure.  Since
this has been reset, the primary device is incorrectly assumed
to be not switchable.

The case has been seen in the cluster mirror context, where one
machine realizes the log device is dead before the other machines.
As the responsibilities of the server migrate from one node to
another (because the mirror is being reconfigured due to the failure),
the new server may think for a moment that the log device is fine -
thus resetting the 'log_failure' variable.

In any case, it is inappropiate for us to reset the 'log_failure'
variable.  The above bug simply illustrates that it can actually
hurt us.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c
+++ linux-2.6/drivers/md/dm-raid1.c
@@ -647,7 +647,13 @@ static void do_writes(struct mirror_set
         */
        dm_rh_inc_pending(ms->rh, &sync);
        dm_rh_inc_pending(ms->rh, &nosync);
-       ms->log_failure = dm_rh_flush(ms->rh) ? 1 : 0;
+
+       /*
+        * If the flush fails on a previous call and succeeds here,
+        * we must not reset the log_failure variable.  We need
+        * userspace interaction to do that.
+        */
+       ms->log_failure = dm_rh_flush(ms->rh) ? 1 : ms->log_failure;

        /*
         * Dispatch io.
</patch>

It seems to me that the bug has pretty low priority, given what it takes to reproduce it.  OTOH, the patch changes one line and already has some testing mileage on it.
Comment 18 Jonathan Earl Brassow 2009-07-28 12:03:57 EDT
file affected is: drivers/md/dm-raid1.c
Comment 19 Don Zickus 2009-11-23 10:32:13 EST
in kernel-2.6.18-175.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 23 errata-xmlrpc 2010-03-30 03:17:06 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.