Bug 596318

Summary: Basic cmirror device failure (with I/O running) is broken
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: Corey Marthaler <cmarthal>
Severity: urgent Docs Contact:
Priority: high    
Version: 6.0CC: agk, antillon.maurizio, dwysocha, heinzm, jbrassow, joe.thornber, mbroz, msnitzer, pkrul, prajnoha, prockai
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.72-6.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-10 21:07:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 599016    
Attachments:
Description Flags
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04 none

Description Corey Marthaler 2010-05-26 15:28:01 UTC
Description of problem:
In the following case, the secondary leg should have been removed. That remove failed and resulted in a corrupted mirror. I'll gather more info, like whether or not this also occurs on local LVM as well.

Scenario: Kill secondary leg of synced core log 2 leg mirror(s)                                                                                         

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_core_2legs_1    
* sync:               1                               
* disklog:            0                               
* failpv(s):          /dev/sdg1                       
* failnode(s):        taft-01 taft-02 taft-03 taft-04 
* leg devices:        /dev/sdc1 /dev/sdg1             
* leg fault policy:   remove                          
* log fault policy:   allocate                        
******************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdg1:0-1000

PV=/dev/sdg1
        syncd_secondary_core_2legs_1_mimage_1: 6:
PV=/dev/sdg1                                     
        syncd_secondary_core_2legs_1_mimage_1: 6:

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 43.33% )  
   0/1 mirror(s) are fully synced: ( 79.92% )  
   1/1 mirror(s) are fully synced: ( 100.00% ) 

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----                              
        ---- taft-02 ----                              
        ---- taft-03 ----                              
        ---- taft-04 ----                              

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure                                 
Verifying files (checkit) on mirror(s) on...                                                                
        ---- taft-01 ----                                                                                   
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdg on taft-01
Disabling device sdg on taft-02
Disabling device sdg on taft-03
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.361142 s, 116 MB/s
Verifying current sanity of lvm after the failure
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
verify the dm devices associated with /dev/sdg1 have been removed as expected
Checking REMOVAL of syncd_secondary_core_2legs_1_mimage_1 on:  taft-01 taft-02 taft-03 taft-04
syncd_secondary_core_2legs_1_mimage_1 on taft-04 should no longer be there


[root@taft-01 ~]# lvs -a -o +devices
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
  LV                                    VG             Attr   LSize   Log Copy%  Devices
  syncd_secondary_core_2legs_1          helter_skelter -wi-ao 600.00m            /dev/sdc1(0)
  syncd_secondary_core_2legs_1_mimage_0 helter_skelter vwi-a- 600.00m
  syncd_secondary_core_2legs_1_mimage_1 helter_skelter -wi--- 600.00m            unknown device(0)


Version-Release number of selected component (if applicable):
2.6.32-28.el6bz590851_v1.x86_64

lvm2-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-libs-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-cluster-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
cmirror-2.02.65-1.el6    BUILT: Wed May 19 11:19:57 CDT 2010

Comment 1 Corey Marthaler 2010-05-26 15:31:34 UTC
I'll attach the full logs, but here's the bit about the repair:

taft-01:
May 26 15:15:38 taft-01 lvm[19586]: Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
May 26 15:15:39 taft-01 lvm[19586]: Repair of mirrored LV helter_skelter/syncd_secondary_core_2legs_1 finished successfully.

taft-02:
May 26 10:14:05 taft-02 lvm[17402]: Error locking on node taft-04: LV helter_skelter/syncd_secondary_core_2legs_1_mimage_1 in use: not deactivating
May 26 10:14:05 taft-02 lvm[17402]: Repair of mirrored LV helter_skelter/syncd_secondary_core_2legs_1 failed.
May 26 10:14:05 taft-02 lvm[17402]: Failed to remove faulty devices in helter_skelter-syncd_secondary_core_2legs_1.
May 26 10:14:07 taft-02 lvm[17402]: No longer monitoring mirror device helter_skelter-syncd_secondary_core_2legs_1 for events.

taft-03:

taft-04:

Comment 2 Corey Marthaler 2010-05-26 15:37:25 UTC
Created attachment 416882 [details]
log from taft-01

Comment 3 Corey Marthaler 2010-05-26 15:37:55 UTC
Created attachment 416883 [details]
log from taft-02

Comment 4 Corey Marthaler 2010-05-26 15:38:37 UTC
Created attachment 416884 [details]
log from taft-03

Comment 5 Corey Marthaler 2010-05-26 15:39:09 UTC
Created attachment 416885 [details]
log from taft-04

Comment 6 Corey Marthaler 2010-05-26 16:51:49 UTC
This appears to be a cluster mirror issue only. Local machine mirrors "work", there are other issues however like bug 596367, but the basic functionality is there.

Comment 8 Jonathan Earl Brassow 2010-06-11 12:59:42 UTC
corey, please try again without udev running - we think udev is getting in the way.  Once we know whose fault this is, we can proceed to fix.

Comment 9 Corey Marthaler 2010-06-17 19:03:55 UTC
cmirror creation doesn't appear to work without udev running, so I'm not sure how to tell if udev is the problem here.

Comment 10 Corey Marthaler 2010-06-18 18:56:05 UTC
I tried this same simple fault injection case with the latest patched built and saw the exact same results, both without killing udev before the failure, and with killing udev before the failure. Not sure where to go from here...

2.6.32-25.el6.x86_64

lvm2-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
lvm2-libs-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
lvm2-cluster-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-libs-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-event-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-event-libs-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
cmirror-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010

Comment 11 Corey Marthaler 2010-06-28 21:26:34 UTC
FYI - if I run this testcase w/o any I/O load (the only I/O being a dd in order to force the repair) then cmirror device failure works.

Comment 12 Petr Rockai 2010-08-09 15:14:41 UTC
I *think* this is the same problem as bug 596453 and friends (from looking at the logs, although to confirm this I would need to have more of the logs). Jon, if you disagree please flip this back to ASSIGNED.

Comment 14 Corey Marthaler 2010-08-13 18:20:57 UTC
There is now a basic level of device failure functionality wrt cmirrors in the latest build. Other less basic device failure bugs still exist however. 

Marking this bug verified.

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010

Comment 15 releng-rhel@redhat.com 2010-11-10 21:07:58 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.