Bug 461498

Summary: RHEL5 cmirror tracker: log failure on 3-way mirror leads to I/O errors
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: agk, bstevens, ccaulfie, dwysocha, edamato, heinzm, mbroz, syeghiay
Target Milestone: beta   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:26:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from taft-01 none

Description Corey Marthaler 2008-09-08 16:01:21 UTC
Description of problem:
This is possibly related to bz 446255, though unlike bz 446255, this time there were issues after the log failure, instead of after the restoration of the log device.


Senario: Kill disk log of non synced 3 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      nonsyncd_log_3legs
* sync:      0
* mirrors:   3
* disklog:   1
* failpv:    /dev/sdh1
* legs:      3
* pvs:       /dev/sdg1 /dev/sde1 /dev/sdf1 /dev/sdh1
************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 2 -n nonsyncd_log_3legs_1 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdh1:0-150
taft-01: lvcreate -m 2 -n nonsyncd_log_3legs_2 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdh1:0-150
taft-01: lvcreate -m 2 -n nonsyncd_log_3legs_3 -L 600M helter_skelter /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdh1:0-150

Continuing on without fully syncd mirrors, currently at...
        ( 1=10.67% 2=4.75% 3=0.33% )

Creating gfs on top of mirror(s) on taft-01...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdh on taft-01
Disabling device sdh on taft-02
Disabling device sdh on taft-03
Disabling device sdh on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
dd: opening `/mnt/nonsyncd_log_3legs_1/ddfile': Input/output error
couldn't write to mirror
run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.
<fail name="taft-01_1" pid="30004" time="Sun Sep  7 12:39:28 2008" type="cmd" duration="96" ec="1" />
ALL STOP!


The log did appear to be properly removed from the mirrors, however the
I/O errors still occurred.

[root@taft-01 ~]# lvs -a -o +devices
  /dev/sdh1: open failed: No such device or address
  LV                              VG             Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices                                               
  LogVol00                        VolGroup00     -wi-ao  66.19G                                       /dev/sda2(0)                                          
  LogVol01                        VolGroup00     -wi-ao   1.94G                                       /dev/sda2(2118)                                       
  nonsyncd_log_3legs_1            helter_skelter mwi-ao 600.00M                        100.00         nonsyncd_log_3legs_1_mimage_0(_log_3legs_1_mimage_2(0)
  [nonsyncd_log_3legs_1_mimage_0] helter_skelter iwi-ao 600.00M                                       /dev/sdg1(0)                                          
  [nonsyncd_log_3legs_1_mimage_1] helter_skelter iwi-ao 600.00M                                       /dev/sde1(0)                                          
  [nonsyncd_log_3legs_1_mimage_2] helter_skelter iwi-ao 600.00M                                       /dev/sdf1(0)                                          
  nonsyncd_log_3legs_2            helter_skelter mwi-ao 600.00M                        100.00         nonsyncd_log_3legs_2_mimage_0(_log_3legs_2_mimage_2(0)
  [nonsyncd_log_3legs_2_mimage_0] helter_skelter iwi-ao 600.00M                                       /dev/sdg1(150)                                        
  [nonsyncd_log_3legs_2_mimage_1] helter_skelter iwi-ao 600.00M                                       /dev/sde1(150)                                        
  [nonsyncd_log_3legs_2_mimage_2] helter_skelter iwi-ao 600.00M                                       /dev/sdf1(150)                                        
  nonsyncd_log_3legs_3            helter_skelter mwi-ao 600.00M                        100.00         nonsyncd_log_3legs_3_mimage_0(_log_3legs_3_mimage_2(0)
  [nonsyncd_log_3legs_3_mimage_0] helter_skelter iwi-ao 600.00M                                       /dev/sdg1(300)                                        
  [nonsyncd_log_3legs_3_mimage_1] helter_skelter iwi-ao 600.00M                                       /dev/sde1(300)                                        
  [nonsyncd_log_3legs_3_mimage_2] helter_skelter iwi-ao 600.00M                                       /dev/sdf1(300)  


Log:
Sep  7 12:37:59 taft-01 qarshd[9024]: Running cmdline: echo offline > /sys/block/sdh/device/state
Sep  7 12:37:59 taft-01 xinetd[6638]: EXIT: qarsh status=0 pid=9024 duration=0(sec)
Sep  7 12:37:59 taft-01 clogd[6162]: [mSnPP2Pa] rw_log:  write failure: Input/output error
Sep  7 12:37:59 taft-01 clogd[6162]: [mSnPP2Pa] Error writing to disk log
Sep  7 12:37:59 taft-01 kernel: sd 1:0:0:7: rejecting I/O to offline device
Sep  7 12:37:59 taft-01 clogd[6162]: [4A2Z7in7] rw_log:  write failure: Input/output error
Sep  7 12:37:59 taft-01 clogd[6162]: [4A2Z7in7] Error writing to disk log

[...]

Sep  7 12:38:04 taft-01 lvm[6259]: WARNING: dev_open(/dev/sdh1) called while suspended
Sep  7 12:38:04 taft-01 kernel: sd 1:0:0:7: rejecting I/O to offline device
Sep  7 12:38:04 taft-01 xinetd[6638]: START: qarsh pid=9027 from=10.15.80.47
Sep  7 12:38:04 taft-01 qarshd[9027]: Talking to peer 10.15.80.47:43072
Sep  7 12:38:04 taft-01 qarshd[9027]: Running cmdline: dd if=/dev/zero of=/mnt/nonsyncd_log_3legs_1/ddfile count=10 bs=4M
Sep  7 12:38:05 taft-01 kernel: sd 1:0:0:7: rejecting I/O to offline device
Sep  7 12:38:05 taft-01 last message repeated 39 times
Sep  7 12:38:05 taft-01 clogd[6162]: [mSnPP2Pa] rw_log:  write failure: Input/output error
Sep  7 12:38:05 taft-01 clogd[6162]: [mSnPP2Pa] Error writing to disk log
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: fatal: I/O error
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   block = 61696
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   function = gfs_logbh_wait
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.24/_kmod_build_/src/gfs/dio.c, line = 925
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0:   time = 1220809085
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: about to withdraw from the cluster
Sep  7 12:38:05 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: telling LM to withdraw
Sep  7 12:38:06 taft-01 kernel: GFS: fsid=TAFT:gfs1.0: withdrawn

              
I'll attach the full log from taft-01. There doesn't appear to be anything of value in the other logs.

Also, since this has to do with 3-way mirroring, I will not add this to the 5.3 blocker list unless it's reproduced with 2-way mirroring.

Version-Release number of selected component (if applicable):
2.6.18-98.el5

lvm2-2.02.39-2.el5    BUILT: Wed Jul  9 07:26:29 CDT 2008
lvm2-cluster-2.02.39-1.el5    BUILT: Thu Jul  3 09:31:57 CDT 2008
device-mapper-1.02.27-1.el5    BUILT: Thu Jul  3 03:22:29 CDT 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008
kmod-cmirror-0.1.13-2.el5    BUILT:  Thu 24 Jul 2008 04:00:48 PM CDT

Comment 1 Corey Marthaler 2008-09-08 16:07:27 UTC
Created attachment 316097 [details]
log from taft-01

Comment 2 Jonathan Earl Brassow 2008-09-29 21:51:37 UTC
Modified by the following check-in:

commit 85d1423ec47e48ab844088ebaf4157327b928ae9
Author: Jonathan Brassow <jbrassow>
Date:   Fri Sep 19 16:19:02 2008 -0500

    dm-log-clustered/clogd: Fix off-by-one error and compilation errors

    Needed to tweek included header files to make dm-log-clustered compile
    again.

    Found an off-by-one error that was causing mirror corruption in the
    case where the primary mirror device was killed in a mirror.


This off-by-one error will manifest itself anytime you are doing I/O while the
mirror is sync'ing.  This could be during the initial sync or a resync after a
failure.

Another reason to suspect that the above is the appropriate fix, is the timing of the last log I/O error and the GFS corruption message that followed - they happen in the same second.  If the log fails, no write I/O will be allowed until a new 'core log' mirror is in place.  So, the write I/O that caused the corruption must have happened previously... leading me to believe the above is correct.

Another possibility would be that the last error message is from one mirror, while the GFS error is from another entirely.  In this event, there could be a small chance of the error being somewhere else.

Comment 6 Corey Marthaler 2008-12-02 22:54:20 UTC
Have not seen this issue since the fix was checked in. Marking verified, will reopen if seen again.

Comment 8 errata-xmlrpc 2009-01-20 21:26:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html