Bug 606931 - repair for failed local mirror devices is broken
repair for failed local mirror devices is broken
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
high Severity high
: rc
: ---
Assigned To: Petr Rockai
Corey Marthaler
: Regression, TestBlocker
Depends On: 596453
Blocks: 607316
  Show dependency treegraph
 
Reported: 2010-06-22 14:48 EDT by Corey Marthaler
Modified: 2010-11-10 16:08 EST (History)
10 users (show)

See Also:
Fixed In Version: lvm2-2.02.72-6.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-11-10 16:08:08 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log from taft-01 (120.15 KB, text/plain)
2010-06-29 12:25 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2010-06-22 14:48:35 EDT
Description of problem:
Scenario: Kill secondary leg of non synced 3 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              nonsyncd_secondary_3legs_1      
* sync:               0                               
* leg devices:        /dev/sdd1 /dev/sde1 /dev/sdb1   
* log devices:        /dev/sdg1                       
* failpv(s):          /dev/sde1                       
* failnode(s):        taft-01                         
* leg fault policy:   remove                          
* log fault policy:   allocate                        
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 2 -n nonsyncd_secondary_3legs_1 -L 600M helter_skelter /dev/sdd1:0-1000 /dev/sde1:0-1000 /dev/sdb1:0-1000 /dev/sdg1:0-150
PV=/dev/sde1
        nonsyncd_secondary_3legs_1_mimage_1: 6
PV=/dev/sde1                                  
        nonsyncd_secondary_3legs_1_mimage_1: 6

Continuing on without fully syncd mirrors, currently at...
        ( 5.00% )                                         

Creating ext on top of mirror(s) on taft-01...
mke2fs 1.41.10 (10-Feb-2009)                  
Mounting mirrored ext filesystems on taft-01...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----                              

<start name="taft-01_nonsyncd_secondary_3legs_1" pid="31025" time="Tue Jun 22 12:08:24 2010" type="cmd" />
Sleeping 10 seconds to get some outsanding EXT I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in                                             
10+0 records out                                            
41943040 bytes (42 MB) copied, 0.229205 s, 183 MB/s         
Verifying current sanity of lvm after the failure           
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
Verifying LOG device(s) /dev/sdg1 *ARE* in the mirror(s)
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
Verifying LEG device /dev/sdd1 *IS* in the volume(s)
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
Verifying LEG device /dev/sdb1 *IS* in the volume(s)
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
verify the dm devices associated with /dev/sde1 have been removed as expected
Checking REMOVAL of nonsyncd_secondary_3legs_1_mimage_1 on:  taft-01
nonsyncd_secondary_3legs_1_mimage_1 on taft-01 should no longer be there


from console:
Jun 22 17:09:12 taft-01 kernel: sd 3:0:0:4: rejecting I/O to offline device
Jun 22 17:09:12 taft-01 lvm[3042]: LV helter_skelter/nonsyncd_secondary_3legs_1_mimage_1 in use: not deactivating
Jun 22 17:09:12 taft-01 lvm[3042]: Repair of mirrored LV helter_skelter/nonsyncd_secondary_3legs_1 failed.
Jun 22 17:09:12 taft-01 lvm[3042]: Failed to remove faulty devices in helter_skelter-nonsyncd_secondary_3legs_1.
Jun 22 17:09:12 taft-01 lvm[3042]: helter_skelter-nonsyncd_secondary_3legs_1 is now in-sync.
Jun 22 17:09:12 taft-01 kernel: sd 3:0:0:4: rejecting I/O to offline device

[root@taft-01 ~]# lvs -a -o +devices
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid SJ9AJw-OJ0f-yR1d-Cq0w-r5gO-sRtJ-lbj3AY.
  LV                                    VG             Attr   LSize   Log                             Copy%  Devices
  nonsyncd_secondary_3legs_1            helter_skelter mwi-ao 600.00m nonsyncd_secondary_3legs_1_mlog 100.00 nonsyncd_secondary_3legs_1_mimage_0(0),nonsyncd_secondary_3legs_1_mimage_2(0)
  [nonsyncd_secondary_3legs_1_mimage_0] helter_skelter iwi-ao 600.00m                                        /dev/sdd1(0)
  nonsyncd_secondary_3legs_1_mimage_1   helter_skelter -wi-a- 600.00m                                        unknown device(0)
  [nonsyncd_secondary_3legs_1_mimage_2] helter_skelter iwi-ao 600.00m                                        /dev/sdb1(0)
  [nonsyncd_secondary_3legs_1_mlog]     helter_skelter lwi-ao   4.00m                                        /dev/sdg1(0)

[root@taft-01 ~]# dmsetup ls
vg_taft01-lv_home       (253, 2)
helter_skelter-nonsyncd_secondary_3legs_1_mlog  (253, 3)
vg_taft01-lv_swap       (253, 1)
helter_skelter-nonsyncd_secondary_3legs_1_mimage_2      (253, 6)
vg_taft01-lv_root       (253, 0)
helter_skelter-nonsyncd_secondary_3legs_1_mimage_1      (253, 5)
helter_skelter-nonsyncd_secondary_3legs_1_mimage_0      (253, 4)
helter_skelter-nonsyncd_secondary_3legs_1       (253, 7)

Version-Release number of selected component (if applicable):
Version 2.02.67+ (custom build by Brassow)
Comment 1 Corey Marthaler 2010-06-23 12:59:15 EDT
This is happening for even the basic cases. This appears to be a major regression since it was working in lvm2-2.02.66-3.

Scenario: Kill primary leg of synced 2 leg mirror(s)                                                                 

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1           
* sync:               1                               
* leg devices:        /dev/sdh1 /dev/sdc1             
* log devices:        /dev/sdb1                       
* failpv(s):          /dev/sdh1                       
* failnode(s):        taft-01                         
* leg fault policy:   remove                          
* log fault policy:   allocate                        
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 1 -n syncd_primary_2legs_1 -L 600M helter_skelter /dev/sdh1:0-1000 /dev/sdc1:0-1000 /dev/sdb1:0-150

PV=/dev/sdh1
        syncd_primary_2legs_1_mimage_0: 6
PV=/dev/sdh1                             
        syncd_primary_2legs_1_mimage_0: 6

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 65.08% )  
   1/1 mirror(s) are fully synced: ( 100.00% ) 

Creating ext on top of mirror(s) on taft-01...
mke2fs 1.41.10 (10-Feb-2009)                  
Mounting mirrored ext filesystems on taft-01...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----                              

<start name="taft-01_syncd_primary_2legs_1" pid="6493" time="Wed Jun 23 11:21:18 2010" type="cmd" />
Sleeping 10 seconds to get some outsanding EXT I/O locks before the failure                         
Verifying files (checkit) on mirror(s) on...                                                        
        ---- taft-01 ----

Disabling device sdh on taft-01

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.236162 s, 178 MB/s
Verifying current sanity of lvm after the failure
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid 134Doj-Xnsg-XQj2-BMpV-G0A5-nZap-sNZvpb.
Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid 134Doj-Xnsg-XQj2-BMpV-G0A5-nZap-sNZvpb.
Verifying LOG device(s) /dev/sdb1 are *NOT* in the linear(s)
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid 134Doj-Xnsg-XQj2-BMpV-G0A5-nZap-sNZvpb.
log device /dev/sdb1 should no longer be present on taft-01


[root@taft-01 mnt]# lvs -a -o +devices
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid 134Doj-Xnsg-XQj2-BMpV-G0A5-nZap-sNZvpb.
  LV                             VG             Attr   LSize   Copy% Devices
  syncd_primary_2legs_1          helter_skelter -wi-ao 600.00m       /dev/sdc1(0)
  syncd_primary_2legs_1_mimage_0 helter_skelter -wi-a- 600.00m       unknown device(0)
  syncd_primary_2legs_1_mimage_1 helter_skelter vwi-a- 600.00m
  syncd_primary_2legs_1_mlog     helter_skelter -wi-a-   4.00m       /dev/sdb1(0)
Comment 2 Petr Rockai 2010-06-28 16:12:41 EDT
Hi, we have the following script in our upstream tests, and I have verified that it is working as expected in CVS.

prepare_vg 5
prepare_dmeventd

which mkfs.ext2 || exit 200

lvcreate -m 3 --ig -L 1 -n 4way $vg
lvchange --monitor y $vg/4way
disable_dev $dev2 $dev4
mkfs.ext2 $DM_DEV_DIR/$vg/4way
sleep 3 # FIXME: need a "poll" utility, akin to "check"
enable_dev $dev2 $dev4
check mirror $vg 4way
check mirror_legs $vg 4way 2

in other words: we create a 4-way mirror, fail 2 of its devices and run mkfs to trigger downconversion, which indeed happens and the resulting mirror is consistent and has 2 legs and a disk log (since we use the default policy of log replacement, but leg removal). If I insert lvs -a right after the sleep, I get:

  LV              VG             Attr   LSize Origin Snap%  Move Log       Copy%  Convert
  4way            LVMTEST11604vg mwi-a- 4.00m                    4way_mlog 100.00        
  [4way_mimage_0] LVMTEST11604vg iwi-ao 4.00m                                            
  [4way_mimage_2] LVMTEST11604vg iwi-ao 4.00m                                            
  [4way_mlog]     LVMTEST11604vg lwi-ao 4.00m                                            

Anyway, to gather more insight into the issue, I would need dmeventd logs: these should contain messages from lvconvert. I think recent upstream versions enable logging of warnings and errors from lvconvert through dmeventd -- this should be sufficient. These should be somewhere in syslog.

Thanks!

PS: I have also tried the simple case of 2-way mirror with a single failed leg. Haven't encountered any problems there either. This is all with current CVS though.
Comment 3 Corey Marthaler 2010-06-29 12:20:58 EDT
This may be a timing issue, because after turning debugging on, this scenario ran 16 times before failing again. However, w/o debugging turned on, it appears to fail every time. I'll attach the log from taft-01 during the latest failure, however I don't really seen any debugging messages in the log.
Comment 4 Corey Marthaler 2010-06-29 12:25:54 EDT
Created attachment 427730 [details]
log from taft-01
Comment 5 Petr Rockai 2010-06-29 14:47:17 EDT
Hm, I see this:

Jun 28 23:18:29 taft-01 lvm[2266]: Repair of mirrored LV helter_skelter/syncd_primary_2legs_1 finished successfully.

and then

Jun 28 23:18:31 taft-01 lvm[2266]: LV helter_skelter/syncd_primary_2legs_2_mimage_0 in use: not deactivating
Jun 28 23:18:31 taft-01 lvm[2266]: Repair of mirrored LV helter_skelter/syncd_primary_2legs_2 failed.
Jun 28 23:18:31 taft-01 lvm[2266]: Failed to remove faulty devices in helter_skelter-syncd_primary_2legs_2.

So something is holding your mimage open. If you say this is timing-dependent, this could mean two things. Either your build is missing Milan's patch to avoid scanning mimages and similar devices, although I find this unlikely, or alternatively something is holding the mirror image open. This could be a kernel bug, or some userspace interference.

In the other outputs you pasted, I don't see any mention of syncd_primary_2legs_2 -- is your latest log from testing parallel failure of two mirrors that share a device? Maybe a log of a simpler case would be available? I don't necessarily need debug: what you have provided above is a good start.

Ta.
Comment 6 Corey Marthaler 2010-06-30 12:35:55 EDT
In comment #4, syncd_primary_2legs_2 was another mirror entirely. Both of these mirrors were laid out so that they had the same devices for legs/logs. In the original report, there was only one mirror. Let me know if you'd like me to turn off the multiple mirror cases and gather info with just one.
Comment 7 Peter Rajnoha 2010-07-02 06:42:30 EDT
Should be fixed in upstream (2.02.70).
Comment 8 Corey Marthaler 2010-07-02 13:37:36 EDT
FYI - This bug still exists in the lastest build.

lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
udev-147-2.18.el6    BUILT: Fri Jun 11 07:47:21 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
Comment 9 Petr Rockai 2010-07-21 10:51:44 EDT
This might be the same bug as #596453, but let's not get overhappy with marking this as dup. Dependency will do for now. Peter, based on that bug and the log of the failure in this, I suspect this is not actually fixed upstream, either.
Comment 10 Corey Marthaler 2010-08-02 11:54:44 EDT
This bug can be produced with the latest rpms. 

[root@taft-02 ~]# lvs -a -o +devices
  /dev/dm-4: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  /dev/dm-5: read failed after 0 of 4096 at 629080064: Input/output error
  [...]
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  [...]
  Couldn't find device with uuid meYcRo-0yqZ-hFYf-8DuQ-FghJ-otCR-GkO4Fd.
  LV                             VG             Attr   LSize   Log Copy% Devices
  syncd_primary_2legs_1          helter_skelter -wi-ao 600.00m           /dev/sdb1(0)
  syncd_primary_2legs_1_mimage_0 helter_skelter -wi-a- 600.00m           unknown device(0)
  syncd_primary_2legs_1_mimage_1 helter_skelter vwi-a- 600.00m
  syncd_primary_2legs_1_mlog     helter_skelter -wi-a-   4.00m           /dev/sdf1(0)
  syncd_primary_2legs_2          helter_skelter -wi-ao 600.00m           /dev/sdb1(150)


2.6.32-52.el6.x86_64

lvm2-2.02.72-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
lvm2-libs-2.02.72-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
lvm2-cluster-2.02.72-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
udev-147-2.21.el6    BUILT: Mon Jul 12 04:55:00 CDT 2010
device-mapper-1.02.53-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-libs-1.02.53-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-event-1.02.53-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
device-mapper-event-libs-1.02.53-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
cmirror-2.02.72-3.el6    BUILT: Wed Jul 28 15:39:43 CDT 2010
Comment 11 Jonathan Earl Brassow 2010-08-02 16:01:28 EDT
I agree with comment #9.  I think this is the same issue as 596453.
Comment 12 Petr Rockai 2010-08-09 11:05:45 EDT
Relevant patch checked in upstream.
Comment 14 Corey Marthaler 2010-08-12 11:07:39 EDT
Helter_skelter in single machine mode ran all night. Marking verified in the following build:

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
lvm2-libs-2.02.72-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
lvm2-cluster-2.02.72-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
device-mapper-libs-1.02.53-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
device-mapper-event-1.02.53-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
device-mapper-event-libs-1.02.53-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
cmirror-2.02.72-6.el6    BUILT: Wed Aug 11 10:18:33 CDT 2010
Comment 15 releng-rhel@redhat.com 2010-11-10 16:08:08 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.