Description of problem: I was running a testcase where I extended a full mirror and that somehow resulted in my mirror being bad or corrupted. So now when ever I attempt to start the clvmd service (init script), which is what a user will do, the activate VG attempt hangs. If a user only knows to use the init scripts, then getting out of this situation becomes painful. Obviously I can clear this out by either force deleting the pvs or by starting clvmd by hand so that the volume activation isn't attempted. I'll try and figure how I got into this situation in the first place, but shouldn't clvmd eventually give up and return instead of deadlocking on startup? full_extend mirror_sanity Mwi--- 8.00M full_extend_mlog full_extend_mimage_0(0),full_extend_mimage_1(0) [full_extend_mimage_0] mirror_sanity Iwi--- 8.00M /dev/sdb1(0) [full_extend_mimage_1] mirror_sanity Iwi--- 8.00M /dev/sdc1(0) [full_extend_mlog] mirror_sanity lwi--- 4.00M /dev/sdh1(0) [root@taft-04 ~]# vgchange -ay mirror_sanity Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-01: Command timed out Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-01: Command timed out Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-01: Command timed out Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-01: Command timed out 1 logical volume(s) in volume group "mirror_sanity" now active Error locking on node taft-04: Command timed out Error locking on node taft-03: Command timed out Error locking on node taft-01: Command timed out [...] [...] Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 Mar 7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 Mar 7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 Mar 7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 Mar 7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 [...] device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:82 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:83 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:84 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:85 - retrying device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:86 - retrying [...] Version-Release number of selected component (if applicable): cmirror-1.1.15-1.el5 kmod-cmirror-0.1.8-1.el5 openais-0.80.3-13.el5
This is reproducable: SCENARIO - [extend_mirror_into_non_contig_space_on_primary_leg] Create a mirror with non contiguous space on the primary leg only for expansion, do the expansion, and verify the expansion didn't spoil mirror redundancy lvcreate -m 1 -n non_contig_prim_expand -L 20M mirror_sanity /dev/sdd1:0-500 /dev/sdb1:0-500 /dev/sdf1:0-50 Error locking on node taft-02: Command timed out Aborting. Failed to activate new LV to wipe the start of it. Error locking on node taft-02: Command timed out couldn't create mirror non_contig_prim_expand [...] Mar 7 13:34:53 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_SET_REGION_SYNC:158788 - retrying Mar 7 13:35:08 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_SET_REGION_SYNC:158789 - retrying Mar 7 13:35:23 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_SET_REGION_SYNC:158790 - retrying Mar 7 13:35:23 taft-03 clogd[6706]: kernel_recv: Preallocated transfer structs exhausted Mar 7 13:35:23 taft-03 clogd[6706]: cpg_message_callback: Preallocated transfer structs exhausted Mar 7 13:35:38 taft-03 last message repeated 2 times Mar 7 13:35:38 taft-03 clogd[6706]: kernel_recv: Preallocated transfer structs exhausted Mar 7 13:35:38 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_SET_REGION_SYNC:158791 - retrying Mar 7 13:35:38 taft-03 clogd[6706]: cpg_message_callback: Preallocated transfer structs exhausted Mar 7 13:35:53 taft-03 last message repeated 2 times Mar 7 13:35:53 taft-03 clogd[6706]: kernel_recv: Preallocated transfer structs exhausted Mar 7 13:35:53 taft-03 kernel: device-mapper: dm-log-clustered: Request timed out on DM_CLOG_SET_REGION_SYNC:158792 - retrying Mar 7 13:35:53 taft-03 clogd[6706]: cpg_message_callback: Preallocated transfer structs exhausted Mar 7 13:36:08 taft-03 last message repeated 2 times
believe this to be a dupe of bugzill #400941. *** This bug has been marked as a duplicate of 400941 ***
This is still reproducable with openais-0.80.3-17.el5. SCENARIO - [extend_mirror_into_non_contig_space_on_primary_leg] Create a mirror with non contiguous space on the primary leg only for expansion, do the expansion, and verify the expansion didn't spoil mirror redundancy lvcreate -m 1 -n non_contig_prim_expand -L 20M mirror_sanity /dev/etherd/e1.1p4:0-500 /dev/etherd/e1.1p3:0-500 /dev/etherd/e1.1p2:0-50 Error locking on node hayes-03: Command timed out Error locking on node hayes-01: Command timed out Aborting. Failed to activate new LV to wipe the start of it. Error locking on node hayes-03: Command timed out Error locking on node hayes-01: Command timed out couldn't create mirror non_contig_prim_expand I'm seeing tons of the following messages in the logs: Jul 9 10:46:39 hayes-01 kernel: device-mapper: dm-log-clustered: [V21KYIHt] Request timed out: [DM_CLOG_RESUME/34996] - retrying Jul 9 10:46:39 hayes-01 clogd[3162]: kernel_recv: Preallocated transfer structs exhausted Jul 9 10:46:39 hayes-01 clogd[3162]: cpg_message_callback: Preallocated transfer structs exhausted Jul 9 10:46:53 hayes-01 clogd[3162]: cpg_message_callback: Preallocated transfer structs exhausted Jul 9 10:46:54 hayes-01 clogd[3162]: kernel_recv: Preallocated transfer structs exhausted Jul 9 10:46:54 hayes-01 kernel: device-mapper: dm-log-clustered: [V21KYIHt] Request timed out: [DM_CLOG_RESUME/34997] - retrying
The messages: Mar 7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3 Mar 7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint are exactly what I would expect from bug 455453. The later messages can happen due to a massive backlog of requests that need processing... something that will occur because the nodes cannot get their checkpoints. I'm assuming this is fixed by the following check-in: commit 6c8d7408095782bb00b5361a7df5973f3dcda183 Author: Jonathan Brassow <jbrassow> Date: Tue Jul 15 11:58:26 2008 -0500 clogd: Fix for bug 455453: small mirror creation fails Was setting the checkpoint attribute 'attr.maxSectionSize' with the size of the bitmap. However, when mirrors are really small (<= 30M) other sections may have a larger size and need to considered.
No longer seeing this issue with the latest code. Marking verified. 2.6.18-98.el5 lvm2-2.02.32-4.el5 BUILT: Fri Apr 4 06:15:19 CDT 2008 lvm2-cluster-2.02.32-4.el5 BUILT: Wed Apr 2 03:56:50 CDT 2008 device-mapper-1.02.24-1.el5 BUILT: Thu Jan 17 16:46:05 CST 2008 cmirror-1.1.22-1.el5 BUILT: Thu Jul 24 15:59:03 CDT 2008
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html