436507 – RHEL5 cmirror tracker: failed checkpoint issues can cause volume activation hang

Bug 436507 - RHEL5 cmirror tracker: failed checkpoint issues can cause volume activation hang

Summary: RHEL5 cmirror tracker: failed checkpoint issues can cause volume activation hang

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cmirror
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-03-07 17:08 UTC by Corey Marthaler
Modified:	2010-01-12 02:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 21:25:39 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:0158	0	normal	SHIPPED_LIVE	new package: cmirror	2009-01-20 16:05:16 UTC

Description Corey Marthaler 2008-03-07 17:08:38 UTC

Description of problem:
I was running a testcase where I extended a full mirror and that somehow
resulted in my mirror being bad or corrupted. So now when ever I attempt to
start the clvmd service (init script), which is what a user will do, the
activate VG attempt hangs. If a user only knows to use the init scripts, then
getting out of this situation becomes painful. Obviously I can clear this out by
either force deleting the pvs or by starting clvmd by hand so that the volume
activation isn't attempted. I'll try and figure how I got into this situation in
the first place, but shouldn't clvmd eventually give up and return instead of
deadlocking on startup?

  full_extend            mirror_sanity Mwi---  8.00M                   
full_extend_mlog                full_extend_mimage_0(0),full_extend_mimage_1(0)
  [full_extend_mimage_0] mirror_sanity Iwi---  8.00M                           
                        /dev/sdb1(0)                 
  [full_extend_mimage_1] mirror_sanity Iwi---  8.00M                           
                        /dev/sdc1(0)                 
  [full_extend_mlog]     mirror_sanity lwi---  4.00M                           
                        /dev/sdh1(0)                 


[root@taft-04 ~]# vgchange -ay mirror_sanity
   Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-01: Command timed out
  Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-01: Command timed out
  Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-01: Command timed out
  Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-01: Command timed out
  1 logical volume(s) in volume group "mirror_sanity" now active
  Error locking on node taft-04: Command timed out
  Error locking on node taft-03: Command timed out
  Error locking on node taft-01: Command timed out
  [...]

[...]
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
Mar  7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
Mar  7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
Mar  7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
Mar  7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
[...]
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:82
- retrying
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:83
- retrying
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:84
- retrying
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_SYNC_COUNT:85
- retrying
device-mapper: dm-log-clustered: Request timed out on DM_CLOG_GET_RESYNC_WORK:86
- retrying
[...]


Version-Release number of selected component (if applicable):
cmirror-1.1.15-1.el5
kmod-cmirror-0.1.8-1.el5
openais-0.80.3-13.el5

Comment 1 Corey Marthaler 2008-03-07 20:22:08 UTC

This is reproducable:

SCENARIO - [extend_mirror_into_non_contig_space_on_primary_leg]
Create a mirror with non contiguous space on the primary leg only for expansion,
do the expansion, and verify the expansion didn't spoil mirror redundancy

lvcreate -m 1 -n non_contig_prim_expand -L 20M mirror_sanity /dev/sdd1:0-500
/dev/sdb1:0-500 /dev/sdf1:0-50
  Error locking on node taft-02: Command timed out
  Aborting. Failed to activate new LV to wipe the start of it.
  Error locking on node taft-02: Command timed out
couldn't create mirror non_contig_prim_expand

[...]
Mar  7 13:34:53 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_SET_REGION_SYNC:158788 - retrying
Mar  7 13:35:08 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_SET_REGION_SYNC:158789 - retrying
Mar  7 13:35:23 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_SET_REGION_SYNC:158790 - retrying
Mar  7 13:35:23 taft-03 clogd[6706]: kernel_recv:  Preallocated transfer structs
exhausted
Mar  7 13:35:23 taft-03 clogd[6706]: cpg_message_callback:  Preallocated
transfer structs exhausted
Mar  7 13:35:38 taft-03 last message repeated 2 times
Mar  7 13:35:38 taft-03 clogd[6706]: kernel_recv:  Preallocated transfer structs
exhausted
Mar  7 13:35:38 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_SET_REGION_SYNC:158791 - retrying
Mar  7 13:35:38 taft-03 clogd[6706]: cpg_message_callback:  Preallocated
transfer structs exhausted
Mar  7 13:35:53 taft-03 last message repeated 2 times
Mar  7 13:35:53 taft-03 clogd[6706]: kernel_recv:  Preallocated transfer structs
exhausted
Mar  7 13:35:53 taft-03 kernel: device-mapper: dm-log-clustered: Request timed
out on DM_CLOG_SET_REGION_SYNC:158792 - retrying
Mar  7 13:35:53 taft-03 clogd[6706]: cpg_message_callback:  Preallocated
transfer structs exhausted
Mar  7 13:36:08 taft-03 last message repeated 2 times

Comment 2 Steven Dake 2008-05-18 21:05:52 UTC

believe this to be a dupe of bugzill #400941.

*** This bug has been marked as a duplicate of 400941 ***

Comment 3 Corey Marthaler 2008-07-09 15:50:42 UTC

This is still reproducable with openais-0.80.3-17.el5.

SCENARIO - [extend_mirror_into_non_contig_space_on_primary_leg]
Create a mirror with non contiguous space on the primary leg only for expansion,
do the expansion, and verify the expansion didn't spoil mirror redundancy
lvcreate -m 1 -n non_contig_prim_expand -L 20M mirror_sanity
/dev/etherd/e1.1p4:0-500 /dev/etherd/e1.1p3:0-500 /dev/etherd/e1.1p2:0-50
  Error locking on node hayes-03: Command timed out
  Error locking on node hayes-01: Command timed out
  Aborting. Failed to activate new LV to wipe the start of it.
  Error locking on node hayes-03: Command timed out
  Error locking on node hayes-01: Command timed out
couldn't create mirror non_contig_prim_expand


I'm seeing tons of the following messages in the logs:

Jul  9 10:46:39 hayes-01 kernel: device-mapper: dm-log-clustered: [V21KYIHt]
Request timed out: [DM_CLOG_RESUME/34996] - retrying
Jul  9 10:46:39 hayes-01 clogd[3162]: kernel_recv:  Preallocated transfer
structs exhausted
Jul  9 10:46:39 hayes-01 clogd[3162]: cpg_message_callback:  Preallocated
transfer structs exhausted
Jul  9 10:46:53 hayes-01 clogd[3162]: cpg_message_callback:  Preallocated
transfer structs exhausted
Jul  9 10:46:54 hayes-01 clogd[3162]: kernel_recv:  Preallocated transfer
structs exhausted
Jul  9 10:46:54 hayes-01 kernel: device-mapper: dm-log-clustered: [V21KYIHt]
Request timed out: [DM_CLOG_RESUME/34997] - retrying

Comment 5 Jonathan Earl Brassow 2008-07-15 18:39:41 UTC

The messages:
Mar  7 10:51:57 taft-02 clogd[6689]: [R5GGAijX] Failed to open checkpoint for 3
Mar  7 10:51:57 taft-02 clogd[6689]: Failed to export checkpoint

are exactly what I would expect from bug 455453.  The later messages can happen
due to a massive backlog of requests that need processing... something that will
occur because the nodes cannot get their checkpoints.

I'm assuming this is fixed by the following check-in:
commit 6c8d7408095782bb00b5361a7df5973f3dcda183
Author: Jonathan Brassow <jbrassow>
Date:   Tue Jul 15 11:58:26 2008 -0500

    clogd:  Fix for bug 455453: small mirror creation fails

    Was setting the checkpoint attribute 'attr.maxSectionSize'
    with the size of the bitmap.  However, when mirrors are
    really small (<= 30M) other sections may have a larger
    size and need to considered.

Comment 6 Corey Marthaler 2008-08-25 22:11:26 UTC

No longer seeing this issue with the latest code. Marking verified.

2.6.18-98.el5

lvm2-2.02.32-4.el5    BUILT: Fri Apr  4 06:15:19 CDT 2008
lvm2-cluster-2.02.32-4.el5    BUILT: Wed Apr  2 03:56:50 CDT 2008
device-mapper-1.02.24-1.el5    BUILT: Thu Jan 17 16:46:05 CST 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008

Comment 8 errata-xmlrpc 2009-01-20 21:25:39 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html

Note You need to log in before you can comment on or make changes to this bug.