Bug 430590

Summary:	RHEL5 cmirror tracker: failed leg recovery is broken
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED WORKSFORME	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.2	CC:	agk, ccaulfie, dwysocha, heinzm, mbroz
Target Milestone:	rc	Keywords:	TestBlocker
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-07-15 16:32:36 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	430797

Description Corey Marthaler 2008-01-28 22:00:35 UTC

Description of problem:
I had I/O running to a gfs filesystem on top of a cmirror and then failed
/dev/sdd (the primary leg). All I/O and lvm opertations deadlocked and the
console filled with the familiar clogd/device-mapper errors 

[root@grant-01 ~]# lvs -a -o +devices
  LV                VG         Attr   LSize  Origin Snap%  Move Log        
Copy%  Convert Devices           
  LogVol00          VolGroup00 -wi-ao 72.44G                                   
           /dev/sda2(0)      
  LogVol01          VolGroup00 -wi-ao  1.94G                                   
           /dev/sda2(2318)   
  mirror            grant      Mwi-ao 25.00G                    mirror_mlog
100.00         mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] grant      iwi-ao 25.00G                                   
           /dev/sdd1(0)      
  [mirror_mimage_1] grant      iwi-ao 25.00G                                   
           /dev/sdb1(0)      
  [mirror_mlog]     grant      lwi-ao  4.00M                                   
           /dev/sdb3(0)   

# FAILED /dev/sdd
   
[root@grant-01 ~]# lvs -a -o +devices
  /dev/mapper/grant-mirror_mimage_0: read failed after 0 of 4096 at 0:
Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
[deadlock]

[...]
Jan 28 15:35:55 grant-02 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
Jan 28 15:35:55 grant-02 kernel: device-mapper: dm-log-clustered: Retrying
request [DM_CLOG_GET_SYNC_COUNT]
Jan 28 15:35:55 grant-02 kernel: device-mapper: dm-log-clustered:
[3813][0][0][0][0][0] [97]
Jan 28 15:35:55 grant-02 clogd[4669]: Unable to send request to cluster: Invalid
exchange
Jan 28 15:35:55 grant-02 clogd[4669]: Bad callback on local/4
Jan 28 15:36:00 grant-02 kernel: device-mapper: dm-log-clustered: Timed out
waiting for cluster log server [DM_CLOG_GET_SYNC_COUNT]
[...]


Version-Release number of selected component (if applicable):
2.6.18-71.el5
lvm2-2.02.31-1.el5
lvm2-cluster-2.02.31-1.el5
cmirror-1.1.7-1.el5
kmod-cmirror-0.1.4-1.el5

Comment 3 Corey Marthaler 2008-02-01 21:44:00 UTC

Although this operation didn't fail everytime I attempted it, I was still able
to reproduce it with the latest.

kmod-cmirror-0.1.5-2.el5
cmirror-1.1.8-1.el5

Comment 4 Corey Marthaler 2008-02-04 21:55:20 UTC

Here's the output from the failed lvconvert after the device was disabled and
then re-enabled.

[root@taft-04 ~]# lvs -a -o +devices
  LV                               VG             Attr   LSize   Origin Snap% 
Move Log                        Copy%  Convert Devices
  LogVol00                         VolGroup00     -wi-ao  66.19G               
                                              /dev/sda2(0)
  LogVol01                         VolGroup00     -wi-ao   1.94G               
                                              /dev/sda2(2118)
  syncd_primary_2legs_1            helter_skelter mwi-ao 800.00M               
    syncd_primary_2legs_1_mlog 100.00        
syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
  [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-ao 800.00M               
                                              /dev/sdd1(0)
  [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-ao 800.00M               
                                              /dev/sdf1(0)
  [syncd_primary_2legs_1_mlog]     helter_skelter lwi-ao   4.00M               
                                              /dev/sdc1(0)



[...]
Disabling device sdd on taft-01
Disabling device sdd on taft-02
Disabling device sdd on taft-03
Disabling device sdd on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-03
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.080716 seconds, 520 MB/s
Verifying the down conversion of the failed mirror(s)
  /dev/sdd1: open failed: No such device or address
Verifying FAILED device /dev/sdd1 is *NOT* in the volume(s)
  /dev/sdd1: open failed: No such device or address
Verifying LEG device /dev/sdb1 *IS* in the volume(s)
  /dev/sdd1: open failed: No such device or address
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
  /dev/sdd1: open failed: No such device or address
Verifying LOG device /dev/sde1 *IS* in the mirror(s)
  /dev/sdd1: open failed: No such device or address

Verifying files (checkit) on mirror(s) on...

Enabling device sdd on taft-01
Enabling device sdd on taft-02
Enabling device sdd on taft-03
Enabling device sdd on taft-04

Recreating PV /dev/sdd1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter
Since we can't yet up convert existing mirrors, down converting to linear(s)
on taft-03 before re-converting back to original mirror(s)
Up converting linear(s) back to mirror(s) on taft-03...
taft-03: lvconvert -m 2 helter_skelter/syncd_primary_3legs_1 /dev/sdd1:0-1000
/dev/sdb1:0-1000 /dev/sdc1:0-1000 /dev/sde1:0-150
  Error locking on node taft-04: Command timed out
  Problem reactivating syncd_primary_3legs_1
  Error locking on node taft-04: Command timed out
couldn't up convert mirror syncd_primary_3legs_1 on taft-03

Comment 6 Jonathan Earl Brassow 2008-02-15 16:08:38 UTC

The last portion of the above is interesting:
"taft-03: lvconvert -m 2 helter_skelter/syncd_primary_3legs_1 /dev/sdd1:0-1000
/dev/sdb1:0-1000 /dev/sdc1:0-1000 /dev/sde1:0-150
  Error locking on node taft-04: Command timed out
  Problem reactivating syncd_primary_3legs_1
  Error locking on node taft-04: Command timed out
couldn't up convert mirror syncd_primary_3legs_1 on taft-03"

Why did it timeout?  Are there "Timed out waiting for cluster log server"
messages in the log, or did clvmd simply time-out because of all the disk traffic?

Comment 9 Corey Marthaler 2008-07-15 16:32:36 UTC

This bz doesn't appear to have been reproduced in the past 5 months, so closing.
Will reopen if seen again.