Description of problem: Our cmirror device failure tests failed during our 4.7 regression runs due to known issues. That caused us to over look the fact that it appears cmirror device failure has regressed to not work at all. The simplest test case of failing the primary leg of a fully sync'ed cmirror fails to down convert to a linear. I've reproduced this now quite a few times. [root@taft-01 ~]# lvs -a -o +devices /dev/sde1: read failed after 0 of 2048 at 0: Input/output error LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 58.34G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 9.75G /dev/sda2(1867) syncd_primary_2legs_1 helter_skelter mwi-ao 800.00M syncd_primary_2legs_1_mlog 100.00 syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0) [syncd_primary_2legs_1_mimage_0] helter_skelter iwi-so 800.00M [syncd_primary_2legs_1_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdh1(0) [syncd_primary_2legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdg1(0) Version-Release number of selected component (if applicable): lvm2-2.02.36-1.el4 lvm2-cluster-2.02.36-1.el4 cmirror-1.0.1-1 Build Date: Tue 30 Jan 2007 05:28:02 PM CST cmirror-kernel-2.6.9-41.3 Build Date: Mon 19 May 2008 02:00:31 PM CDT
Single machine mirror device failures work just fine.
What if you kill dmeventd and run 'vgreduce --removemissing <vg>' by hand? That would tell us if the problem is in dmeventd.
first try ok.
second time ok... I think you may be omitting some information on how to reproduce?
Try testing with increased timeout for clvmd.... Seems to work for me. I set the command timeout to 600 (instead of 90)
trying to reduce logging in cmirror module to reduce response time... perhaps bringing it under the clvmd timeout.
It appears that this bug has mysterious been fixed with the latest rpms: device-mapper-1.02.25-2.el4 lvm2-cluster-2.02.37-2.el4 lvm2-2.02.37-2.el4 The clvmd locking timeout made no difference when I downgraded. I reproduced this everytime regardless. Also, when I upgraded, I could no longer reproduce this, even with the locking time out set to the default. Marking this verified.
Closing this bug as it has been released in 4.7.