Description of problem: On the two node x86_64 (kool/salem) cluster, I had three 5G cmirrors each with GFS on top, and each running I/O. None of them were fully synced yet. [root@salem ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 72G 1.7G 66G 3% / /dev/sda1 99M 19M 76M 20% /boot none 1004M 0 1004M 0% /dev/shm /dev/mapper/vg-mirror1 4.8G 20K 4.8G 1% /mnt/gfs1 /dev/mapper/vg-mirror2 4.8G 20K 4.8G 1% /mnt/gfs2 /dev/mapper/vg-mirror3 4.8G 20K 4.8G 1% /mnt/gfs3 [root@salem ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices mirror1 vg mwi-ao 5.00G mirror1_mlog 20.08 mirror1_mimage_0(0),mirror1_mimage_1(0) [mirror1_mimage_0] vg iwi-ao 5.00G /dev/sdc(0) [mirror1_mimage_1] vg iwi-ao 5.00G /dev/sde(0) [mirror1_mlog] vg lwi-ao 4.00M /dev/sdd(0) mirror2 vg mwi-ao 5.00G mirror2_mlog 16.72 mirror2_mimage_0(0),mirror2_mimage_1(0) [mirror2_mimage_0] vg iwi-ao 5.00G /dev/sdc(1280) [mirror2_mimage_1] vg iwi-ao 5.00G /dev/sde(1280) [mirror2_mlog] vg lwi-ao 4.00M /dev/sdd(1) mirror3 vg mwi-ao 5.00G mirror3_mlog 24.45 mirror3_mimage_0(0),mirror3_mimage_1(0) [mirror3_mimage_0] vg iwi-ao 5.00G /dev/sdc(2560) [mirror3_mimage_1] vg iwi-ao 5.00G /dev/sde(2560) [mirror3_mlog] vg lwi-ao 4.00M /dev/sdd(2) I then attempted to create two more mirrors, and while the 5th one was being created I failed /dev/sdb and /dev/sdc (which was the primary leg for the original 3 mirrors) and i'm sure apart of the two new cmirrors. At that point the I/O stopped and I waited for the mirrors to down convert. According to the log messages, these conversions did appear to take place, however, none of the I/O restarted, and now all the filesystem are inaccessable. [root@salem ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 72G 1.7G 66G 3% / /dev/sda1 99M 19M 76M 20% /boot none 1004M 0 1004M 0% /dev/shm df: `/mnt/gfs1': Input/output error df: `/mnt/gfs2': Input/output error /dev/mapper/vg-mirror3 4.8G 228K 4.8G 1% /mnt/gfs3 [root@salem ~]# ls -lrt /mnt/gfs3 [HANG] Also, all my lvs commands hang as well. Here's what's in the logs: [...] Feb 1 12:24:03 kool lvm[4270]: WARNING: Bad device removed from mirror volume, vg/mirror1 Feb 1 12:24:03 kool lvm[4270]: WARNING: Mirror volume, vg/mirror1 converted to linear due to device failure. [...] Feb 1 12:24:07 kool lvm[4270]: WARNING: Bad device removed from mirror volume, vg/mirror2 Feb 1 12:24:08 kool lvm[4270]: WARNING: Mirror volume, vg/mirror2 converted to linear due to device failure. [...] [...] Feb 1 12:23:41 salem lvm[4320]: No longer monitoring mirror device vg-mirror1 for events Feb 1 12:23:45 salem lvm[4320]: No longer monitoring mirror device vg-mirror2 for events Feb 1 12:23:50 salem lvm[4320]: No longer monitoring mirror device vg-mirror3 for events Feb 1 12:33:54 salem kernel: device-mapper: A read failure occurred on a mirror device. Feb 1 12:33:54 salem kernel: device-mapper: Unable to retry read. [...] Version-Release number of selected component (if applicable): [root@salem ~]# uname -ar Linux salem 2.6.9-43.ELsmp #1 SMP Wed Jan 10 19:57:37 EST 2007 x86_64 x86_64 x86_64 GNU/Linux [root@salem ~]# rpm -qa | grep lvm2 lvm2-cluster-2.02.20-1.el4 lvm2-cluster-debuginfo-2.02.06-6.0.RHEL4 lvm2-2.02.20-1.el4 [root@salem ~]# rpm -qa | grep cmirror cmirror-kernel-largesmp-2.6.9-18.6 cmirror-debuginfo-1.0.1-1 cmirror-kernel-debuginfo-2.6.9-18.6 cmirror-kernel-smp-2.6.9-18.7 cmirror-kernel-2.6.9-18.6 cmirror-1.0.1-1
I can hit was also appears to be this bug by pulling the fc cable from one of the machines in the two node cluster while there is I/O going to the cmirrors, and then plugging it back in later. The cmirror filesystems end up deadlocked as well as any other fs stat cmds like 'df'. This also caused the sync percents to get messed up. Before pullig the cable [root@salem ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror1 corey Mwi-ao 5.00G cmirror1_mlog 100.00 cmirror1_mimage_0(0),cmirror1_mimage_1(0) [cmirror1_mimage_0] corey iwi-ao 5.00G /dev/sdc(0) [cmirror1_mimage_1] corey iwi-ao 5.00G /dev/sde(0) [cmirror1_mlog] corey lwi-ao 4.00M /dev/sdd(0) cmirror2 corey Mwi-ao 5.00G cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmirror2_mimage_1(0) [cmirror2_mimage_0] corey iwi-ao 5.00G /dev/sdc(1280) [cmirror2_mimage_1] corey iwi-ao 5.00G /dev/sde(1280) [cmirror2_mlog] corey lwi-ao 4.00M /dev/sdd(1) cmirror3 corey Mwi-ao 5.00G cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmirror3_mimage_1(0) [cmirror3_mimage_0] corey iwi-ao 5.00G /dev/sdc(2560) [cmirror3_mimage_1] corey iwi-ao 5.00G /dev/sde(2560) [cmirror3_mlog] corey lwi-ao 4.00M /dev/sdd(2) [root@kool ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror1 corey Mwi-ao 5.00G cmirror1_mlog 100.00 cmirror1_mimage_0(0),cmirror1_mimage_1(0) [cmirror1_mimage_0] corey iwi-ao 5.00G /dev/sdc(0) [cmirror1_mimage_1] corey iwi-ao 5.00G /dev/sde(0) [cmirror1_mlog] corey lwi-ao 4.00M /dev/sdd(0) cmirror2 corey Mwi-ao 5.00G cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmirror2_mimage_1(0) [cmirror2_mimage_0] corey iwi-ao 5.00G /dev/sdc(1280) [cmirror2_mimage_1] corey iwi-ao 5.00G /dev/sde(1280) [cmirror2_mlog] corey lwi-ao 4.00M /dev/sdd(1) cmirror3 corey Mwi-ao 5.00G cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmirror3_mimage_1(0) [cmirror3_mimage_0] corey iwi-ao 5.00G /dev/sdc(2560) [cmirror3_mimage_1] corey iwi-ao 5.00G /dev/sde(2560) [cmirror3_mlog] corey lwi-ao 4.00M /dev/sdd(2) Here I pulled the cable from salem: [root@salem ~]# lvs -a -o +devices /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error /dev/dm-6: read failed after 0 of 4096 at 0: Input/output error /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error /dev/dm-10: read failed after 0 of 4096 at 0: Input/output error /dev/dm-11: read failed after 0 of 4096 at 0: Input/output error /dev/dm-12: read failed after 0 of 4096 at 0: Input/output error /dev/dm-13: read failed after 0 of 4096 at 0: Input/output error /dev/sdb: read failed after 0 of 4096 at 0: Input/output error /dev/sdc: read failed after 0 of 4096 at 0: Input/output error /dev/sdd: read failed after 0 of 4096 at 0: Input/output error /dev/sde: read failed after 0 of 4096 at 0: Input/output error No volume groups found Saw the scsi errors on salem and this on kool: dm-cmirror: Error while listening for server response: -110 dm-cmirror: Error while listening for server response: -110 dm-cmirror: Error while listening for server response: -110 dm-cmirror: Error while listening for server response: -110 dm-cmirror: Error while listening for server response: -110 Then I plugged the cable back in: [root@salem ~]# lvs -a -o +devices /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error /dev/dm-13: read failed after 0 of 4096 at 0: Input/output error LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror1 corey Mwi-ao 5.00G cmirror1_mlog 0.00 cmirror1_mimage_0(0),cmirror1_mimage_1(0) [cmirror1_mimage_0] corey iwi-ao 5.00G /dev/sdc(0) [cmirror1_mimage_1] corey iwi-ao 5.00G /dev/sde(0) [cmirror1_mlog] corey lwi-ao 4.00M /dev/sdd(0) cmirror2 corey Mwi-ao 5.00G cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmirror2_mimage_1(0) [cmirror2_mimage_0] corey iwi-ao 5.00G /dev/sdc(1280) [cmirror2_mimage_1] corey iwi-ao 5.00G /dev/sde(1280) [cmirror2_mlog] corey lwi-ao 4.00M /dev/sdd(1) cmirror3 corey Mwi-ao 5.00G cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmirror3_mimage_1(0) [cmirror3_mimage_0] corey iwi-ao 5.00G /dev/sdc(2560) [cmirror3_mimage_1] corey iwi-ao 5.00G /dev/sde(2560) [cmirror3_mlog] corey lwi-ao 4.00M /dev/sdd(2) [root@kool ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror1 corey Mwi-ao 5.00G cmirror1_mlog 0.00 cmirror1_mimage_0(0),cmirror1_mimage_1(0) [cmirror1_mimage_0] corey iwi-ao 5.00G /dev/sdc(0) [cmirror1_mimage_1] corey iwi-ao 5.00G /dev/sde(0) [cmirror1_mlog] corey lwi-ao 4.00M /dev/sdd(0) cmirror2 corey Mwi-ao 5.00G cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmirror2_mimage_1(0) [cmirror2_mimage_0] corey iwi-ao 5.00G /dev/sdc(1280) [cmirror2_mimage_1] corey iwi-ao 5.00G /dev/sde(1280) [cmirror2_mlog] corey lwi-ao 4.00M /dev/sdd(1) cmirror3 corey Mwi-ao 5.00G cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmirror3_mimage_1(0) [cmirror3_mimage_0] corey iwi-ao 5.00G /dev/sdc(2560) [cmirror3_mimage_1] corey iwi-ao 5.00G /dev/sde(2560) [cmirror3_mlog] corey lwi-ao 4.00M /dev/sdd(2) [root@kool ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices cmirror1 corey Mwi-ao 5.00G cmirror1_mlog 0.00 cmirror1_mimage_0(0),cmirror1_mimage_1(0) [cmirror1_mimage_0] corey iwi-ao 5.00G /dev/sdc(0) [cmirror1_mimage_1] corey iwi-ao 5.00G /dev/sde(0) [cmirror1_mlog] corey lwi-ao 4.00M /dev/sdd(0) cmirror2 corey Mwi-ao 5.00G cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmirror2_mimage_1(0) [cmirror2_mimage_0] corey iwi-ao 5.00G /dev/sdc(1280) [cmirror2_mimage_1] corey iwi-ao 5.00G /dev/sde(1280) [cmirror2_mlog] corey lwi-ao 4.00M /dev/sdd(1) cmirror3 corey Mwi-ao 5.00G cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmirror3_mimage_1(0) [cmirror3_mimage_0] corey iwi-ao 5.00G /dev/sdc(2560) [cmirror3_mimage_1] corey iwi-ao 5.00G /dev/sde(2560) [cmirror3_mlog] corey lwi-ao 4.00M /dev/sdd(2) [root@kool ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 72G 1.8G 66G 3% / /dev/sda1 99M 23M 71M 25% /boot none 1004M 0 1004M 0% /dev/shm [HANG]
I'm still trying to weed through this to find what could be addressed better... Of course you're going to have problems when you fail your primary device before the mirror is in-sync. That's not a bug... I would expect file system commands to have problems as well. The important issues here are: dm-cmirror: Error while listening for server response: -110 and: are we properly swapping out a mirror for an error target when the primary device goes out on a non-sync'ed mirror.
please help me reproduce with the latest cmirror-kernel package (>= 2/21/2007)
Marking modified, as I believe this has been fixed in the process of fixing other bugs.
It sounds like this is another case where the primary leg is being failed before it's synced, so this is a dup of bz 232711 and 233031. *** This bug has been marked as a duplicate of 232711 ***