Bug 441970

Summary: RHEL5 cmirror tracker: filesystem is missing after 'successful' device failure iteration
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: agk, bstevens, ccaulfie, dwysocha, edamato, heinzm, jbrassow, mbroz, syeghiay
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:25:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 444983    

Description Corey Marthaler 2008-04-10 21:52:33 UTC
Description of problem:
After executing a successful device failure iteration on a 3 legged mirror, my
script was unable to umount the gfs filesystem because it appeared to be missing. 

Stopping the io load (collie/xdoio) on mirror(s)
Unmounting gfs and removing mnt point on taft-01...
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_3legs_1
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_3legs_1
couldn't umount /mnt/syncd_primary_3legs_1 on taft-01

[root@taft-02 tmp]# gfs_tool sb /dev/helter_skelter/syncd_primary_3legs_1 all
gfs_tool: there isn't a GFS filesystem on /dev/helter_skelter/syncd_primary_3legs_1

[root@taft-02 tmp]# lvs -a -o +devices
  LV                               VG             Attr   LSize   Origin Snap% 
Move Log                        Copy%  Convert Devices
  LogVol00                         VolGroup00     -wi-ao  66.19G               
                                              /dev/sda2(0)
  LogVol01                         VolGroup00     -wi-ao   1.94G               
                                              /dev/sda2(2118)
  syncd_primary_3legs_1            helter_skelter mwi-ao 800.00M               
    syncd_primary_3legs_1_mlog 100.00        
syncd_primary_3legs_1_mimage_0(0),syncd_primary_3legs_1_mimage_1(0),syncd_primary_3legs_1_mimage_2(0)
  [syncd_primary_3legs_1_mimage_0] helter_skelter iwi-ao 800.00M               
                                              /dev/sdg1(0)
  [syncd_primary_3legs_1_mimage_1] helter_skelter iwi-ao 800.00M               
                                              /dev/sdh1(0)
  [syncd_primary_3legs_1_mimage_2] helter_skelter iwi-ao 800.00M               
                                              /dev/sde1(0)
  [syncd_primary_3legs_1_mlog]     helter_skelter lwi-ao   4.00M               
                                              /dev/sdf1(0)
[root@taft-02 tmp]# dmsetup ls
helter_skelter-syncd_primary_3legs_1_mimage_0   (253, 3)
helter_skelter-syncd_primary_3legs_1_mlog       (253, 2)
helter_skelter-syncd_primary_3legs_1    (253, 6)
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)
helter_skelter-syncd_primary_3legs_1_mimage_2   (253, 5)
helter_skelter-syncd_primary_3legs_1_mimage_1   (253, 4)


I'll try and reproduce this and gather more info as this bug, if true, could be
very serious. This may be a GFS issue instead of cmirror.


Version-Release number of selected component (if applicable):
2.6.18-88.el5
lvm2-2.02.32-3.el5
lvm2-cluster-2.02.32-4.el5
openais-0.80.3-15.el5
gfs-utils-0.1.16-2.el5
kmod-gfs-0.1.23-3.el5

Comment 1 Corey Marthaler 2008-04-22 14:00:15 UTC
I was able to reproduce this issue.

[...]
Stopping the io load (collie/xdoio) on mirror(s)
Unmounting gfs and removing mnt point on taft-01...
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_2legs_1
/sbin/umount.gfs: there isn't a GFS filesystem on
/dev/mapper/helter_skelter-syncd_primary_2legs_1
couldn't umount /mnt/syncd_primary_2legs_1 on taft-01

[root@taft-01 tmp]# gfs_tool sb /dev/helter_skelter/syncd_primary_2legs_1 all
gfs_tool: there isn't a GFS filesystem on /dev/helter_skelter/syncd_primary_2legs_1

2.6.18-90.el5
lvm2-2.02.32-4.el5
lvm2-cluster-2.02.32-4.el5
openais-0.80.3-15.el5
gfs-utils-0.1.17-1.el5
kmod-gfs-0.1.23-3.el5

Comment 2 Jonathan Earl Brassow 2008-05-02 14:32:06 UTC
"After executing a successful device failure iteration on a 3 legged mirror, my
script was unable to umount the gfs filesystem because it appeared to be missing."

... but the rest of your data in comment #1 shows that there are still 3 legs
and a log to the mirror.  What gives?  Wouldn't one of the legs be missing if
the fault handling was successful?


Comment 3 Corey Marthaler 2008-05-02 18:19:00 UTC
Jon, that is because by that time in the test case, everything had been put back
together like it was originally. IOW, the failed device was once again
pvcreated, extened into the vg, and the cmirror converted back to the way it
was. At that point the test is tearing everything down inorder to create a new
cmirror set and try it all again, but it's that clean up that spots the "wth
where did my gfs go?"

I'll try to add a couple gfs checks earlier in the test case before the tear
down, as well to verify that it isn't disappearing earlier, though it already
does gfs I/O checks before and after most operations, so...

Comment 4 Corey Marthaler 2008-07-23 21:10:51 UTC
*** Bug 444983 has been marked as a duplicate of this bug. ***

Comment 5 Corey Marthaler 2008-07-25 15:24:34 UTC
Just a note that this is still reproducable using the same helter_skelter test
case, Senario: Kill primary leg of synced 2 leg mirror(s).

Comment 6 Corey Marthaler 2008-08-22 14:35:34 UTC
Another note that i'm still seeing this issue, however it takes running helter_skelter for quite a few iterations before seeing this. 

There is probably no reason for this to block beta, however it should still be fixed for the RC.

Comment 7 Kiersten (Kerri) Anderson 2008-09-19 14:31:46 UTC
Adding blocker flag for rc.

Comment 8 Corey Marthaler 2008-09-22 16:06:23 UTC
Just an FYI that this issue still appears in the latest cmirror rpms:

2.6.18-110.el5

lvm2-2.02.39-2.el5    BUILT: Wed Jul  9 07:26:29 CDT 2008
lvm2-cluster-2.02.39-1.el5    BUILT: Thu Jul  3 09:31:57 CDT 2008
device-mapper-1.02.27-1.el5    BUILT: Thu Jul  3 03:22:29 CDT 2008
cmirror-1.1.25-1.el5    BUILT: Fri Sep 19 16:27:46 CDT 2008
kmod-cmirror-0.1.17-1.el5    BUILT: Fri Sep 19 16:27:33 CDT 2008

Comment 10 Jonathan Earl Brassow 2008-09-29 21:33:25 UTC
The following commit should have fixed this issue:

commit 85d1423ec47e48ab844088ebaf4157327b928ae9
Author: Jonathan Brassow <jbrassow>
Date:   Fri Sep 19 16:19:02 2008 -0500

    dm-log-clustered/clogd: Fix off-by-one error and compilation errors

    Needed to tweek included header files to make dm-log-clustered compile
    again.

    Found an off-by-one error that was causing mirror corruption in the
    case where the primary mirror device was killed in a mirror.

Assuming that the build date on the RPMs implies that this check-in was included, then I need to examine the possibility of some of the other scenarios I put forward for corruption.

Comment 13 errata-xmlrpc 2009-01-20 21:25:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html