Bug 518665 - RHEL5 cmirror tracker: need to handle case where cluster is stopped and restarted with cmirror still running
Summary: RHEL5 cmirror tracker: need to handle case where cluster is stopped and resta...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-08-21 15:01 UTC by Corey Marthaler
Modified: 2010-03-30 09:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 09:05:18 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0307 normal SHIPPED_LIVE cmirror bug fix update 2010-03-29 14:35:04 UTC

Description Corey Marthaler 2009-08-21 15:01:28 UTC
Description of problem:
If you have a valid cluster with a valid cmirror, then delete it, and stop clvmd as well as cman (but leave cmirror running) and then restart cman and clvmd, cmirror will not be able to handle it when another cmirror create is attempted.

lvcreate -m 2 -n mirror_1 -L 3G revolution_9  /dev/sde1:0-10000 /dev/sdf1:0-10000 /dev/sdg1:0-10000 /dev/sdh1:0-150
  Error locking on node taft-04-bond: Command timed out
  Error locking on node taft-02-bond: Command timed out
  Error locking on node taft-01-bond: Command timed out
  Aborting. Failed to activate new LV to wipe the start of it.
  Error locking on node taft-04-bond: Command timed out
  Error locking on node taft-02-bond: Command timed out
  Error locking on node taft-01-bond: Command timed out
  Unable to deactivate failed new LV. Manual intervention required.


Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to open checkpoint for 2: SA_AIS_ERR_LIBRARY 
Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to export checkpoint for 2 
Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to open checkpoint for 2: SA_AIS_ERR_LIBRARY 
Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to export checkpoint for 2 
Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to open checkpoint for 2: SA_AIS_ERR_LIBRARY 
Aug 21 09:49:09 taft-03 clogd[7005]: [3cSnHZpb] Failed to export checkpoint for 2 


device-mapper: dm-log-clustered: [3cSnHZpb] Request timed out: [DM_CLOG_RESUME/64716] - retrying
Aug 21 09:59:08 taft-01 kernel: device-mapper: dm-log-clustered: [3cSnHZpb] Request timed out: [DM_CLOG_RESUME/64716]g
device-mapper: dm-log-clustered: [3cSnHZpb] Request timed out: [DM_CLOG_RESUME/64717] - retrying
Aug 21 09:59:23 taft-01 kernel: device-mapper: dm-log-clustered: [3cSnHZpb] Request timed out: [DM_CLOG_RESUME/64717]g
device-mapper: dm-log-clustered: [3cSnHZpb] Request timed out: [DM_CLOG_RESUME/64718] - retrying


Version-Release number of selected component (if applicable):
2.6.18-160.el5

lvm2-2.02.46-8.el5    BUILT: Thu Jun 18 08:06:12 CDT 2009
lvm2-cluster-2.02.46-8.el5    BUILT: Thu Jun 18 08:05:27 CDT 2009
cmirror-1.1.39-2.el5    BUILT: Mon Jul 27 15:39:05 CDT 2009
kmod-cmirror-0.1.22-1.el5    BUILT: Mon Jul 27 15:28:46 CDT 2009

How reproducible:
Everytime

Comment 1 Jonathan Earl Brassow 2009-10-14 15:08:13 UTC
Creative testing.  :)

What were the first error messages?  I'm wondering if there is a way to distinguish between "normal" SA_AIS_ERR_LIBRARY errors that can be retried, and those where the file descriptor has been invalidated by a stop/start of the aisexec.

Comment 2 Corey Marthaler 2009-10-14 20:47:32 UTC
The first error messages on each node just after the failed create attempt on taft-01 was tried are as follows:

[root@taft-01 ~]# lvcreate -n test -L 100M -m 1 taft
  Error locking on node taft-03-bond: Command timed out
  Error locking on node taft-02-bond: Command timed out
  Error locking on node taft-01-bond: Command timed out
  Aborting. Failed to activate new LV to wipe the start of it.
  Error locking on node taft-03-bond: Command timed out
  Error locking on node taft-02-bond: Command timed out
  Error locking on node taft-01-bond: Command timed out
  Unable to deactivate failed new LV. Manual intervention required.


taft-01:
Oct 14 15:40:30 taft-01 syslogd 1.4.1: restart.
Oct 14 15:40:30 taft-01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 14 15:41:14 taft-01 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/596] - retrying
Oct 14 15:41:29 taft-01 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/599] - retrying
Oct 14 15:41:44 taft-01 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/600] - retrying
Oct 14 15:41:59 taft-01 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/601] - retrying
Oct 14 15:42:14 taft-01 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/602] - retrying


taft-02:
Oct 14 15:39:56 taft-02 syslogd 1.4.1: restart.
Oct 14 15:39:56 taft-02 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 14 15:40:39 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/54] - retrying
Oct 14 15:40:39 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_GET_SYNC_COUNT/55] - retrying
Oct 14 15:40:54 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/56] - retrying
Oct 14 15:40:54 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_GET_SYNC_COUNT/57] - retrying
Oct 14 15:41:09 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/58] - retrying
Oct 14 15:41:09 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_GET_SYNC_COUNT/59] - retrying
Oct 14 15:41:24 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/60] - retrying
Oct 14 15:41:24 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_GET_SYNC_COUNT/61] - retrying
Oct 14 15:41:39 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/62] - retrying
Oct 14 15:41:39 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_GET_SYNC_COUNT/63] - retrying
Oct 14 15:41:54 taft-02 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/64] - retrying


taft-03:
Oct 14 15:40:16 taft-03 syslogd 1.4.1: restart.
Oct 14 15:40:16 taft-03 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 14 15:41:00 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/654] - retrying
Oct 14 15:41:15 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/657] - retrying
Oct 14 15:41:30 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/658] - retrying
Oct 14 15:41:45 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/659] - retrying
Oct 14 15:42:00 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/660] - retrying
Oct 14 15:42:15 taft-03 kernel: device-mapper: dm-log-clustered: [KxDnQk7J] Request timed out: [DM_CLOG_RESUME/661] - retrying


taft-04:
Oct 14 15:40:33 taft-04 syslogd 1.4.1: restart.
Oct 14 15:40:33 taft-04 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 [7160]: Monitoring mirror device taft-test for events
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to open checkpoint for 1: SA_AIS_ERR_LIBRARY
Oct 14 15:41:01 taft-04 clogd[7075]: [KxDnQk7J] Failed to export checkpoint for 1

Comment 3 Jonathan Earl Brassow 2009-12-22 08:23:47 UTC
commit ab4161cbc2c824fc32ae46e78f2342092ccbf406
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Tue Dec 22 02:17:38 2009 -0600

    clogd (cmirror): Reinit ckpt handle if openAIS goes away (bz518665)

    It is possible to pull the rug out from under the cluster log
    daemon by stopping the checkpoint service (openAIS) for which
    the log daemon has a handle.

    If the handle is later used, various errors are given.  It doesn't
    hurt to aquire a new handle though, so that is what is done.

Comment 6 Corey Marthaler 2010-02-24 19:45:30 UTC
Fix verified in cmirror-1.1.39-5.el5.

Feb 24 13:43:52 taft-01 clogd[7054]: [pjomOT0v] Failed to open checkpoint: SA_AIS_ERR_LIBRARY 
Feb 24 13:43:52 taft-01 clogd[7054]: [pjomOT0v] Reinitializing checkpoint library handle

Comment 8 errata-xmlrpc 2010-03-30 09:05:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0307.html


Note You need to log in before you can comment on or make changes to this bug.