Bug 596318 - Basic cmirror device failure (with I/O running) is broken
Basic cmirror device failure (with I/O running) is broken
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
high Severity urgent
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
: Regression, TestBlocker
Depends On:
Blocks: 599016
  Show dependency treegraph
 
Reported: 2010-05-26 11:28 EDT by Corey Marthaler
Modified: 2010-11-10 16:07 EST (History)
11 users (show)

See Also:
Fixed In Version: lvm2-2.02.72-6.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 16:07:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log from taft-01 (21.04 KB, text/plain)
2010-05-26 11:37 EDT, Corey Marthaler
no flags Details
log from taft-02 (59.13 KB, text/plain)
2010-05-26 11:37 EDT, Corey Marthaler
no flags Details
log from taft-03 (19.53 KB, text/plain)
2010-05-26 11:38 EDT, Corey Marthaler
no flags Details
log from taft-04 (49.07 KB, text/plain)
2010-05-26 11:39 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2010-05-26 11:28:01 EDT
Description of problem:
In the following case, the secondary leg should have been removed. That remove failed and resulted in a corrupted mirror. I'll gather more info, like whether or not this also occurs on local LVM as well.

Scenario: Kill secondary leg of synced core log 2 leg mirror(s)                                                                                         

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_core_2legs_1    
* sync:               1                               
* disklog:            0                               
* failpv(s):          /dev/sdg1                       
* failnode(s):        taft-01 taft-02 taft-03 taft-04 
* leg devices:        /dev/sdc1 /dev/sdg1             
* leg fault policy:   remove                          
* log fault policy:   allocate                        
******************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_1 -L 600M helter_skelter /dev/sdc1:0-1000 /dev/sdg1:0-1000

PV=/dev/sdg1
        syncd_secondary_core_2legs_1_mimage_1: 6:
PV=/dev/sdg1                                     
        syncd_secondary_core_2legs_1_mimage_1: 6:

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 43.33% )  
   0/1 mirror(s) are fully synced: ( 79.92% )  
   1/1 mirror(s) are fully synced: ( 100.00% ) 

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----                              
        ---- taft-02 ----                              
        ---- taft-03 ----                              
        ---- taft-04 ----                              

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure                                 
Verifying files (checkit) on mirror(s) on...                                                                
        ---- taft-01 ----                                                                                   
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdg on taft-01
Disabling device sdg on taft-02
Disabling device sdg on taft-03
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.361142 s, 116 MB/s
Verifying current sanity of lvm after the failure
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
verify the dm devices associated with /dev/sdg1 have been removed as expected
Checking REMOVAL of syncd_secondary_core_2legs_1_mimage_1 on:  taft-01 taft-02 taft-03 taft-04
syncd_secondary_core_2legs_1_mimage_1 on taft-04 should no longer be there


[root@taft-01 ~]# lvs -a -o +devices
  Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
  LV                                    VG             Attr   LSize   Log Copy%  Devices
  syncd_secondary_core_2legs_1          helter_skelter -wi-ao 600.00m            /dev/sdc1(0)
  syncd_secondary_core_2legs_1_mimage_0 helter_skelter vwi-a- 600.00m
  syncd_secondary_core_2legs_1_mimage_1 helter_skelter -wi--- 600.00m            unknown device(0)


Version-Release number of selected component (if applicable):
2.6.32-28.el6bz590851_v1.x86_64

lvm2-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-libs-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
lvm2-cluster-2.02.65-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
device-mapper-event-libs-1.02.48-1.el6    BUILT: Tue May 18 04:46:06 CDT 2010
cmirror-2.02.65-1.el6    BUILT: Wed May 19 11:19:57 CDT 2010
Comment 1 Corey Marthaler 2010-05-26 11:31:34 EDT
I'll attach the full logs, but here's the bit about the repair:

taft-01:
May 26 15:15:38 taft-01 lvm[19586]: Couldn't find device with uuid OL4s9P-QZXm-Gezs-RbjI-9mUu-6mQy-EEYHT0.
May 26 15:15:39 taft-01 lvm[19586]: Repair of mirrored LV helter_skelter/syncd_secondary_core_2legs_1 finished successfully.

taft-02:
May 26 10:14:05 taft-02 lvm[17402]: Error locking on node taft-04: LV helter_skelter/syncd_secondary_core_2legs_1_mimage_1 in use: not deactivating
May 26 10:14:05 taft-02 lvm[17402]: Repair of mirrored LV helter_skelter/syncd_secondary_core_2legs_1 failed.
May 26 10:14:05 taft-02 lvm[17402]: Failed to remove faulty devices in helter_skelter-syncd_secondary_core_2legs_1.
May 26 10:14:07 taft-02 lvm[17402]: No longer monitoring mirror device helter_skelter-syncd_secondary_core_2legs_1 for events.

taft-03:

taft-04:
Comment 2 Corey Marthaler 2010-05-26 11:37:25 EDT
Created attachment 416882 [details]
log from taft-01
Comment 3 Corey Marthaler 2010-05-26 11:37:55 EDT
Created attachment 416883 [details]
log from taft-02
Comment 4 Corey Marthaler 2010-05-26 11:38:37 EDT
Created attachment 416884 [details]
log from taft-03
Comment 5 Corey Marthaler 2010-05-26 11:39:09 EDT
Created attachment 416885 [details]
log from taft-04
Comment 6 Corey Marthaler 2010-05-26 12:51:49 EDT
This appears to be a cluster mirror issue only. Local machine mirrors "work", there are other issues however like bug 596367, but the basic functionality is there.
Comment 8 Jonathan Earl Brassow 2010-06-11 08:59:42 EDT
corey, please try again without udev running - we think udev is getting in the way.  Once we know whose fault this is, we can proceed to fix.
Comment 9 Corey Marthaler 2010-06-17 15:03:55 EDT
cmirror creation doesn't appear to work without udev running, so I'm not sure how to tell if udev is the problem here.
Comment 10 Corey Marthaler 2010-06-18 14:56:05 EDT
I tried this same simple fault injection case with the latest patched built and saw the exact same results, both without killing udev before the failure, and with killing udev before the failure. Not sure where to go from here...

2.6.32-25.el6.x86_64

lvm2-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
lvm2-libs-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
lvm2-cluster-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-libs-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-event-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
device-mapper-event-libs-1.02.49-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
cmirror-2.02.67-1.6.el6    BUILT: Thu Jun 17 10:54:32 CDT 2010
Comment 11 Corey Marthaler 2010-06-28 17:26:34 EDT
FYI - if I run this testcase w/o any I/O load (the only I/O being a dd in order to force the repair) then cmirror device failure works.
Comment 12 Petr Rockai 2010-08-09 11:14:41 EDT
I *think* this is the same problem as bug 596453 and friends (from looking at the logs, although to confirm this I would need to have more of the logs). Jon, if you disagree please flip this back to ASSIGNED.
Comment 14 Corey Marthaler 2010-08-13 14:20:57 EDT
There is now a basic level of device failure functionality wrt cmirrors in the latest build. Other less basic device failure bugs still exist however. 

Marking this bug verified.

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-libs-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
lvm2-cluster-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
device-mapper-event-libs-1.02.53-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
cmirror-2.02.72-7.el6    BUILT: Wed Aug 11 17:12:24 CDT 2010
Comment 15 releng-rhel@redhat.com 2010-11-10 16:07:58 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.