Description of problem: There appears to be an issue when dealing with multilpe cmirrors from different volume groups all experiencing leg device failures at the same time. The down conversion to linear appears to work, however the up conversion back to a cmirror can time out and lead to future clvmd command hangs, as well and I/O to those cmirrors deadlocking. I started with 3 mirrors on 3 vgs: [root@link-08 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices LogVol00 VolGroup00 -wi-ao 72.38G /dev/hda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/hda2(2316) cmirror1 feist1 Mwi-ao 500.00M cmirror1_mlog 100.00 cmirror1_mimage_0(0),cmir ge_1(0) [cmirror1_mimage_0] feist1 iwi-ao 500.00M /dev/sda1(0) [cmirror1_mimage_1] feist1 iwi-ao 500.00M /dev/sdb1(0) [cmirror1_mlog] feist1 lwi-ao 4.00M /dev/sdc1(0) cmirror2 feist2 Mwi-ao 500.00M cmirror2_mlog 100.00 cmirror2_mimage_0(0),cmir ge_1(0) [cmirror2_mimage_0] feist2 iwi-ao 500.00M /dev/sda2(0) [cmirror2_mimage_1] feist2 iwi-ao 500.00M /dev/sdd1(0) [cmirror2_mlog] feist2 lwi-ao 4.00M /dev/sde1(0) cmirror3 feist3 Mwi-ao 500.00M cmirror3_mlog 100.00 cmirror3_mimage_0(0),cmir ge_1(0) [cmirror3_mimage_0] feist3 iwi-ao 500.00M /dev/sda3(0) [cmirror3_mimage_1] feist3 iwi-ao 500.00M /dev/sdf1(0) [cmirror3_mlog] feist3 lwi-ao 4.00M /dev/sdh1(0) I then failed /dev/sda (the primary leg of each cmirror). The down conversion worked: [root@link-08 ~]# lvs -a -o +devices /dev/sda: read failed after 0 of 4096 at 0: Input/output error /dev/sda1: read failed after 0 of 2048 at 0: Input/output error /dev/sda2: read failed after 0 of 2048 at 0: Input/output error /dev/sda3: read failed after 0 of 2048 at 0: Input/output error LV VG Attr LSize Origin Snap% Move Log Copy% Devices LogVol00 VolGroup00 -wi-ao 72.38G /dev/hda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/hda2(2316) cmirror1 feist1 -wi-ao 500.00M /dev/sdb1(0) cmirror2 feist2 -wi-ao 500.00M /dev/sdd1(0) cmirror3 feist3 -wi-ao 500.00M /dev/sdf1(0) I then brought /dev/sda back up, extended the 3 volume groups, and attempted to up convert the three (now linears) back to mirrors: # WORKED: [root@link-08 ~]# lvconvert -m 2 --corelog feist1/cmirror1 Logical volume cmirror1 converted. # FAILED: [root@link-08 ~]# lvconvert -m 2 --corelog feist2/cmirror2 Error locking on node link-02: Command timed out Problem reactivating cmirror2 # FAILED: [root@link-08 ~]# lvconvert -m 2 feist3/cmirror3 Error locking on node link-02: Command timed out Aborting. Failed to activate mirror log. Remove new LVs and retry. Failed to create mirror log. From here I attempted to remove the zombie mirror log created from the failed mirror up conversion: [root@link-08 ~]# lvremove -f feist3/cmirror3_mlog [HANG] An lvs also hung as well as I/O to those two failed mirrors. From here I then attempted to create another mirror and then failed that disk again. I'll try and reproduce this better and include more information. Version-Release number of selected component (if applicable): 2.6.9-50.ELsmp cmirror-kernel-2.6.9-30.0 lvm2-cluster-2.02.21-3.el4
Current state according to dm: [root@link-02 ~]# dmsetup ls --tree feist1-cmirror1 (253:5) ├─feist1-cmirror1_mimage_2 (253:4) │ └─ (8:33) ├─feist1-cmirror1_mimage_1 (253:3) │ └─ (8:1) └─feist1-cmirror1_mimage_0 (253:2) └─ (8:17) feist3-cmirror3 (253:13) └─ (8:81) VolGroup00-LogVol01 (253:1) └─ (3:2) VolGroup00-LogVol00 (253:0) └─ (3:2) feist2-cmirror2 (253:9) ├─feist2-cmirror2_mimage_2 (253:8) │ └─ (8:65) ├─feist2-cmirror2_mimage_1 (253:7) │ └─ (8:2) └─feist2-cmirror2_mimage_0 (253:6) └─ (8:49) [root@link-02 ~]# dmsetup info Name: feist1-cmirror1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 96 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU85DIlXwM9ob2sZDz2DutZPhEt1LJwGwLA Name: feist2-cmirror2_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 8 Number of targets: 1 UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjkjbHKZ9o4vZjOVwjKIQIq5ChSsHd4ywX Name: feist3-cmirror3 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 11 Major, minor: 253, 13 Number of targets: 1 UUID: LVM-sfkiHiFe0zVVWyVd3khoI0zxEh9FpgMcpME61KCpsdYeLPlfZNGOo2zbgMgD8oyB Name: VolGroup00-LogVol01 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-i9Sdf1SXACVF330Sb7WdqECxIr1LdJtDxCyO7AGloRnyFBWnRK03Ir4tIJVh1hdX Name: feist1-cmirror1_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8GeoQCiJmu8ffFpftl0UZmjbU148URkSK Name: feist2-cmirror2_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 7 Number of targets: 1 UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjpoXz0hDcRA4im0JrngCd5I7enUcXHjxO Name: VolGroup00-LogVol00 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-i9Sdf1SXACVF330Sb7WdqECxIr1LdJtDiNgxxFoLGSZ82yY0VHP39K4fbTxLg5Xn Name: feist1-cmirror1_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 3 Number of targets: 1 UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8Bp1l0PRAoCGH2Nt4M1oFhIC5lESFi5c8 Name: feist2-cmirror2_mimage_0 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 6 Number of targets: 1 UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjGHn8dzAFodD83lBnnenfpZSI0YLllsoA Name: feist2-cmirror2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 114 Major, minor: 253, 9 Number of targets: 1 UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLj24BUZX1wO88bvfBjAXfRM9RhWMaAOjuT Name: feist1-cmirror1_mimage_0 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 2 Number of targets: 1 UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8376uhV36hkbITljHBGjMmNoHWrTGseDX
likely similar to 236432
Created attachment 153366 [details] verbose output from failed convert Steps to get this (you may find a simpler way): 1) vg containing a mirror is reduced by one device (sda) due to its failure 2) turn device back on and do a pvcreate on it 3) remove the VG (and all its volumes) 4) create all new pv's 5) create new volume group of the same name 6) create linear device 7) try to up convert (the attachment is a log of this failed step)
Created attachment 153368 [details] corresponding clvmd output when attempting failed lock operation
Created attachment 153369 [details] dmsetup info after the failed command Note that the log device (vg1-lv1-mlog) and the mirror device (vg1-lv1) are still suspended... Likely that the failed lock op was suppose to resume these. The lock op UUID corresponds to the top level mirror device.
at 0x9e == LCK_LV_RESUME
When I reproduce this (or something similar, but the steps I use are the same), the clvmd process is stuck in ioctl(11, DM_TABLE_LOAD...). The sysrq-T seems to indicate it's waiting for mempool? <ffffffff8015ca93>{mempool_resize+317} <ffffffffa009fe5b>{:dm_mod:resize_pool ffffffff8030ad52>{__down_write+134} <ffffffffa00e49c2>{:dm_mirror:mirror_ctr+118 ffffffffa009d0bb>{:dm_mod:dm_table_add_target+328} ffffffffa009f29c>{:dm_mod:table_load+225} <ffffffffa009f1bb>{:dm_mod:table_load+ ffffffffa009fc34>{:dm_mod:ctl_ioctl+600}
I recently checked in a fix for mempools in cluster mirror, but the trace seems to end in dm_mod; so I don't think it's related. The fix was to clean-up (mempool_destroy) when mirrors were shutdown.
here are my backtraces when it happens: Apr 25 09:44:55 neo-04 kernel: clvmd S C2301E98 2204 3375 1 3402 3363 (NOTLB) Apr 25 09:44:55 neo-04 kernel: c2301ec4 00000082 0000001e c2301e98 c011cd42 c2396230 00000019 c2024de0 Apr 25 09:44:55 neo-04 kernel: c2396230 00000000 c201d740 c201cde0 00000000 00000000 d69f6800 000f4239 Apr 25 09:44:55 neo-04 kernel: c2396230 f6244730 f624489c 00000002 00000246 00042c80 00042c80 00000000 Apr 25 09:44:55 neo-04 kernel: Call Trace: Apr 25 09:44:55 neo-04 kernel: [<c011cd42>] recalc_task_prio+0x128/0x133 Apr 25 09:44:55 neo-04 kernel: [<c02d4616>] schedule_timeout+0x139/0x154 Apr 25 09:44:55 neo-04 kernel: [<c012a73a>] process_timeout+0x0/0x5 Apr 25 09:44:55 neo-04 kernel: [<c016bdbc>] do_select+0x291/0x2c6 Apr 25 09:44:55 neo-04 kernel: [<c016b9f1>] __pollwait+0x0/0x95 Apr 25 09:44:55 neo-04 kernel: [<c016c0e4>] sys_select+0x2e0/0x43a Apr 25 09:44:55 neo-04 kernel: [<c02799e1>] sock_ioctl+0x2a7/0x2b4 Apr 25 09:44:55 neo-04 kernel: [<c015a9f0>] sys_close+0x86/0x92 Apr 25 09:44:55 neo-04 kernel: [<c02d5e97>] syscall_call+0x7/0xb Apr 25 09:44:55 neo-04 kernel: clvmd S 00004480 3380 3402 1 3403 3375 (NOTLB) Apr 25 09:44:55 neo-04 kernel: f61e2f14 00000082 00000031 00004480 c2396230 f706d930 00000019 c2024de0 Apr 25 09:44:55 neo-04 kernel: f706d930 00000000 c201d740 c201cde0 00000000 00000000 d69025c0 000f4239 Apr 25 09:44:55 neo-04 kernel: f706d930 f62713b0 f627151c 00000002 f708f2a0 f708f298 f7049e80 f61e2f58 Apr 25 09:44:55 neo-04 kernel: Call Trace: Apr 25 09:44:55 neo-04 kernel: [<f89a2c91>] dlm_read+0xd5/0x3fc [dlm] Apr 25 09:44:55 neo-04 kernel: [<c011e851>] __wake_up+0x29/0x3c Apr 25 09:44:55 neo-04 kernel: [<c011e7cb>] default_wake_function+0x0/0xc Apr 25 09:44:55 neo-04 kernel: [<c0134d87>] wake_futex+0x3a/0x44 Apr 25 09:44:55 neo-04 kernel: [<c011e7cb>] default_wake_function+0x0/0xc Apr 25 09:44:55 neo-04 kernel: [<c015b082>] vfs_read+0xb6/0xe2 Apr 25 09:44:55 neo-04 kernel: [<c015b297>] sys_read+0x3c/0x62 Apr 25 09:44:55 neo-04 kernel: [<c02d5e97>] syscall_call+0x7/0xb Apr 25 09:44:55 neo-04 kernel: clvmd S 00000003 2372 3403 1 3439 3402 (NOTLB) Apr 25 09:44:55 neo-04 kernel: f5aa5eb4 00000082 00000006 00000003 f6244730 f6244730 00000019 c2024de0 Apr 25 09:44:55 neo-04 kernel: f6244730 00000000 c201d740 c201cde0 00000000 00000000 d69f6800 000f4239 Apr 25 09:44:55 neo-04 kernel: f6244730 f62708b0 f6270a1c 00000002 c0150845 00000000 7fffffff f5aa5000 Apr 25 09:44:55 neo-04 kernel: Call Trace: Apr 25 09:44:55 neo-04 kernel: [<c0150845>] find_extend_vma+0x12/0x4f Apr 25 09:44:55 neo-04 kernel: [<c02d4596>] schedule_timeout+0xb9/0x154 Apr 25 09:44:55 neo-04 kernel: [<c0150845>] find_extend_vma+0x12/0x4f Apr 25 09:44:55 neo-04 kernel: [<c01203c5>] add_wait_queue+0x12/0x30 Apr 25 09:44:55 neo-04 kernel: [<c01352dd>] futex_wait+0x133/0x194 Apr 25 09:44:55 neo-04 kernel: [<c011e7cb>] default_wake_function+0x0/0xc Apr 25 09:44:55 neo-04 kernel: [<c011e7cb>] default_wake_function+0x0/0xc Apr 25 09:44:55 neo-04 kernel: [<c0162255>] blkdev_put+0x184/0x18f Apr 25 09:44:55 neo-04 kernel: [<c013558e>] do_futex+0x29/0x5a Apr 25 09:44:55 neo-04 kernel: [<c01356c0>] sys_futex+0x101/0x10c Apr 25 09:44:55 neo-04 kernel: [<c02d5e97>] syscall_call+0x7/0xb Apr 25 09:44:55 neo-04 kernel: [<c02d007b>] unix_dgram_recvmsg+0x5d/0x1da
If I do the steps in comment #3 *except* I use a different vgname in step 5, I don't hit the problem.
Created attachment 153525 [details] clvmd patch to fix return code I'm having difficulty reproducing this when memory pressure is dicounted. However I did spot a potential problem that could leave locks lying around.
With extreme care I can now reproduce this without loading memory pressure on the machines (using touch rather than find|cpio). However, the traceback I get is always identical to the one I posted in comment #7. If I rename the VGs as Jon did in comment #10 then the problem still occurs, just the same.
don't need 3 vg's to do this, and don't need FS or I/O. Relevant history: 1045 pvcreate /dev/sd*1 1046 vgcreate vg /dev/sd*1 1047 lvcreate -L 1G -n lv vg 1049 lvconvert -m1 vg/lv # wait for sync 1051 ./off.sh sda # on all machines 1052 vgreduce --removemissing vg 1054 ./on.sh sda # on all machines 1055 pvcreate /dev/sda1 1056 vgchange -an vg 1057 lvremove -ff vg 1059 vgremove vg 1060 pvcreate /dev/sd*1 1061 vgcreate vg /dev/sd*1 1062 lvcreate -L 1G -n lv vg /dev/sda1 1063 lvconvert -m1 vg/lv -vvvv /dev/sd[bc]1 >& output # BAD!
killing and restarting clvmd between 1062 and 1063 will allow the command to succeed.
I get a problem in _vg_read_by_vgid() /* Is corresponding vgname already cached? */ if ((vginfo = vginfo_from_vgid(vgid)) && vginfo->vgname && *vginfo->vgname) { if ((vg = _vg_read(cmd, vginfo->vgname, vgid, &consistent, precommitted)) && !strncmp((char *)vg->id.uuid, vgid, ID_LEN)) { if (!consistent) { log_error("Volume group %s metadata is " "inconsistent", vginfo->vgname); if (!partial_mode()) return NULL; ****> HERE } return vg; } }
continuing from code in comment #15 and looking into _vg_read() /* Find the vgname in the cache */ /* If it's not there we must do full scan to be completely sure */ if (!(fmt = fmt_from_vgname(vgname, vgid))) { The contents of that if statement are not run... I think they should be. Especially since the resulting VG is missing a PV after it is done.
It seems that the cache is populated with a VG that is missing one of the PVs
another thing that is puzzling to me is that it seems that the mirror is being resumed before the log sub lv... This will certainly cause things to hang, because when the mirror resumes, it tries to read the log device. If the log device is suspended, it will obviously hang. That would likely explain your hang in comment #7.
you know... I may be chasing a different bug. I can't reproduce this issue when I simply pvcreate/vgextend the failed volumes before doing an upconvert. The timeouts in the original comment could be caused by the low timeout in clvmd (which has been upped in the rpm via a command line option to clvmd). I'm not sure about what patrick is seeing... Could we all update to the latest rpms and retry? Corey, I'd especially like to here that you recreated it.
Those steps don't reproduce it for me, It works perfectly every time. Obviously my device names are different but I only need to subsititude sdm for sda.
I too was unable to re-create this issue with the latest build (cmirror-kernel-2.6.9-32.0), this may have been fixed. # After the failure and re-extension of the vg... [root@link-08 ~]# lvconvert -m 2 --corelog vg1/mirror1 Logical volume mirror1 converted. [root@link-08 ~]# lvconvert -m 2 --corelog vg2/mirror2 Logical volume mirror2 converted. [root@link-08 ~]# lvconvert -m 2 vg3/mirror3 Logical volume mirror3 converted.
Moving to NEEDINFO until it can be recreated again.
Moving defect to 4.6 list to track for the next release if more information is uncovered.
I appear to have reproduced this, but it looks like that may just be because I hit bz 227155 first. For now I'm going to close this bz and make a note in bz 227155 about possibly being related to this bug.