Bug 566799 - clogd: Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Summary: clogd: Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cmirror
Version: 5.5
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
: 639114 (view as bug list)
Depends On:
Blocks: 806919
TreeView+ depends on / blocked
 
Reported: 2010-02-19 18:35 UTC by Nate Straz
Modified: 2018-11-28 20:59 UTC (History)
15 users (show)

Fixed In Version: cmirror-1.1.39-14.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 806919 (view as bug list)
Environment:
Last Closed: 2012-02-21 06:21:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0256 0 normal SHIPPED_LIVE cmirror bug fix update 2012-02-20 15:06:56 UTC

Description Nate Straz 2010-02-19 18:35:53 UTC
Description of problem:

While running the segmented_pvmove test in mirror_sanity on 8 nodes I consistently get into a state where one node is spewing the following messages:

Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Reinitializing checkpoint library handle
Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Reinitializing checkpoint library handle

Further LVM commands hang and the pvmove hangs.

INFO: task lvs:29922 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
lvs           D ffff81000100caa0     0 29922  29921                     (NOTLB)
 ffff81020dbb5cb8 0000000000000086 0000000000000400 ffffffff8001c211
 ffff8100da6eced0 0000000000000009 ffff81021fa94860 ffff8101239620c0
 0000135c36174b5d 00000000000156e4 ffff81021fa94a48 0000000200000000
Call Trace:
 [<ffffffff8001c211>] generic_make_request+0x211/0x228
 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90
 [<ffffffff800647ea>] io_schedule+0x3f/0x67
 [<ffffffff800f5668>] __blockdev_direct_IO+0x8da/0xa80
 [<ffffffff800e669d>] blkdev_direct_IO+0x32/0x37
 [<ffffffff800e65d5>] blkdev_get_blocks+0x0/0x96
 [<ffffffff8000c514>] __generic_file_aio_read+0xb8/0x198
 [<ffffffff8012b392>] inode_has_perm+0x56/0x63
 [<ffffffff800c773d>] generic_file_read+0xac/0xc5
 [<ffffffff8012b392>] inode_has_perm+0x56/0x63
 [<ffffffff800a19e6>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8004248c>] do_ioctl+0x21/0x6b
 [<ffffffff8012de9b>] selinux_file_permission+0x9f/0xb6
 [<ffffffff8000b6b0>] vfs_read+0xcb/0x171
 [<ffffffff80011c01>] sys_read+0x45/0x6e
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Version-Release number of selected component (if applicable):
cmirror-1.1.39-5.el5
kmod-cmirror-0.1.22-3.el5
lvm2-2.02.56-7.el5
lvm2-cluster-2.02.56-7.el5
kernel-2.6.18-187.el5
kernel-2.6.18-189.el5
openais-0.80.6-14.el5
cman-2.0.115-30.el5


How reproducible:
Fairly easily

Steps to Reproduce:
1.
2.
3.

Actual results:

mirror_sanity output:

SCENARIO - [segmented_pvmove]
Create a couple small segmented lvs and then pvmove them
Running lv_seg on west-07 to create the segmented volumes
west-07: /usr/tests/sts-rhel5.5/lvm2/bin/lv_seg -v mirror_sanity -n segment
Deactivating segment0 mirror
Moving data from pv /dev/sdb1
west-07: pvmove -v /dev/sdb1
    Finding volume group "mirror_sanity"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "mirror_sanity" metadata (seqno 195).
    Creating logical volume pvmove0
    Moving 781 extents of logical volume mirror_sanity/segment0
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 196).
    Checking progress every 15 seconds
    Updating volume group metadata
  Error locking on node west-06: Command timed out
  Failed to suspend segment0
Quick verification that pvmove is finished on other nodes as well
  Error locking on node west-06: Command timed out
  Attempt to drop cached metadata failed after reverted update for VG mirror_sanity.
  Error locking on node west-06: Command timed out
  Error locking on node west-06: Command timed out
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 197).
  ABORTING: Segment progression failed.
  Error locking on node west-06: Command timed out
  Unable to drop cached metadata for VG mirror_sanity.
  ABORTING: Failed to write new data locations to disk.
  /dev/sdb1: Moved: 16.0%
  /dev/sdb1: Moved: 30.6%
  /dev/sdb1: Moved: 49.8%
  /dev/sdb1: Moved: 63.9%
  /dev/sdb1: Moved: 80.0%


Expected results:


Additional info:

Comment 2 Ludek Smid 2010-03-11 12:18:00 UTC
Since it is too late to address this issue in RHEL 5.5, it has been proposed for RHEL 5.6.  Contact your support representative if you need to escalate this issue.

Comment 5 Nate Straz 2010-11-12 17:00:46 UTC
*** Bug 639114 has been marked as a duplicate of this bug. ***

Comment 6 Nate Straz 2010-11-12 19:41:20 UTC
Yes, I'm still hitting this in 5.6.

[root@dash-03 ~]# /usr/tests/sts-rhel5.6/lvm2/bin/lvm_rpms
kernel-2.6.18-229.el5

lvm2-2.02.74-1.el5    BUILT: Fri Oct 15 10:26:21 CDT 2010
lvm2-cluster-2.02.74-2.el5    BUILT: Fri Oct 29 07:48:11 CDT 2010
device-mapper-1.02.55-1.el5    BUILT: Fri Oct 15 06:15:55 CDT 2010
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Reinitializing checkpoint library handle
Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Reinitializing checkpoint library handle

Comment 7 Corey Marthaler 2010-11-18 19:32:39 UTC
based on comment #6, bumping priority and marking test blocker.

Comment 13 RHEL Program Management 2011-02-01 16:25:35 UTC
Quality Engineering Management has reviewed and declined this request.  You may
appeal this decision by reopening this request.

Comment 15 Jonathan Earl Brassow 2011-02-09 19:03:51 UTC
Please, can you reproduce this?

I think Nate has the easiest time reproducing this.  Corey was not able to (comment #7).  I am not able to.  I'd like it if Corey could reproduce (because he is at my location), but I'll take any reproduction as long as I can access the machines.

Comment 16 Corey Marthaler 2011-04-28 15:48:07 UTC
Now that we're testing 5.7 again, I hit this issue.

=== Iteration 8 of 10 started on taft-04 at Thu Apr 28 05:02:09 CDT 2011 ===
FREE PVS: /dev/sdc2 /dev/sde2 /dev/sdg1 /dev/sdh2
INUSE PVS:
      pv_shuffle_A=/dev/sdb2 /dev/sdc1 /dev/sde1
      pv_shuffle_B=/dev/sdd1 /dev/sdd2 /dev/sdg2
NOT INUSE PVS:
      pv_shuffle_A=/dev/sdb1 /dev/sdh1
      pv_shuffle_B=/dev/sdf1 /dev/sdf2

Adding /dev/sdh2 to pv_shuffle_A
Adding /dev/sdc2 to pv_shuffle_B

Moving data from /dev/sdc1 to /dev/sdh2 on taft-04
Moving data from /dev/sdd1 to /dev/sdc2 on taft-04
taft-04: pvmove -v /dev/sdc1 /dev/sdh2
taft-04: pvmove -v /dev/sdd1 /dev/sdc2
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-clustered 
    Archiving volume group "pv_shuffle_A" metadata (seqno 41).
    Finding volume group "pv_shuffle_B"
    Executing: /sbin/modprobe dm-log-clustered 

Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle
Apr 28 05:02:19 taft-03 clogd[11198]: import_checkpoint: 1 checkpoint sections found
Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle
Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle

Comment 17 Corey Marthaler 2011-05-19 18:41:23 UTC
Reproduced this again on the latest rpms.

2.6.18-261.el5

lvm2-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:24 CDT 2011
lvm2-cluster-2.02.84-3.el5    BUILT: Wed Apr 27 03:42:43 CDT 2011
device-mapper-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
device-mapper-event-1.02.63-3.el5    BUILT: Thu May 19 08:09:22 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009


=== Iteration 6 of 10 started on taft-03 at Thu May 19 10:27:46 CDT 2011 ===
FREE PVS: /dev/sde1 /dev/sdf2 /dev/sdg1 /dev/sdg2
INUSE PVS:
      pv_shuffle_A=/dev/sdc1 /dev/sdh1 /dev/sdh2
      pv_shuffle_B=/dev/sdb1 /dev/sdd2 /dev/sde2
NOT INUSE PVS:                    
      pv_shuffle_A=/dev/sdb2 /dev/sdd1
      pv_shuffle_B=/dev/sdc2 /dev/sdf1

Adding /dev/sdf2 to pv_shuffle_A
Adding /dev/sdg2 to pv_shuffle_B

Moving data from /dev/sdc1 to /dev/sdf2 on taft-03
Moving data from /dev/sdd2 to /dev/sdg2 on taft-03
taft-03: pvmove -v /dev/sdc1 /dev/sdf2
taft-03: pvmove -v /dev/sdd2 /dev/sdg2
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-clustered 
    Archiving volume group "pv_shuffle_A" metadata (seqno 29).
    Creating logical volume pvmove0
    Moving 299 extents of logical volume pv_shuffle_A/stripe
    Updating volume group metadata
    Finding volume group "pv_shuffle_B"
    Executing: /sbin/modprobe dm-log-clustered 
    Archiving volume group "pv_shuffle_B" metadata (seqno 33).
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_B/linear
    Moving 299 extents of logical volume pv_shuffle_B/stripe
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 30).
    Updating volume group metadata
    Checking progress before waiting every 15 seconds
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 34).
    Checking progress before waiting every 15 seconds


May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle
May 19 10:28:00 taft-03 clogd[8898]: import_checkpoint: 2 checkpoint sections found
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle
May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST

Comment 18 Jonathan Earl Brassow 2011-06-14 20:24:02 UTC
fix did not solve the problem (made it worse)... deferring to 5.8

Comment 21 Jonathan Earl Brassow 2011-11-02 22:15:01 UTC
Current code can get into an infinite loop of retries if it is unable to connect with openAIS.  It is better in this case to retry a couple times and fail if necessary.

Comment 23 Corey Marthaler 2011-12-14 23:59:24 UTC
Fix verified in the latest rpms.

2.6.18-300.el5
lvm2-2.02.88-5.el5    BUILT: Fri Dec  2 12:25:45 CST 2011
lvm2-cluster-2.02.88-5.el5    BUILT: Fri Dec  2 12:48:37 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-14.el5    BUILT: Wed Nov  2 17:25:33 CDT 2011
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009


============================================================
Iteration 20 of 20 started at Wed Dec 14 17:11:07 CST 2011
============================================================
SCENARIO - [segmented_pvmove]
Create a couple small segmented lvs and then pvmove them
Running lv_seg on taft-03 to create the segmented volumes
taft-03: /usr/tests/sts-rhel5.8/lvm2/bin/lv_seg -v mirror_sanity -n segment
Deactivating segment0 mirror
Moving data from pv /dev/sdh1 on taft-03
    Finding volume group "mirror_sanity"
    Executing: /sbin/modprobe dm-log-clustered 
    Archiving volume group "mirror_sanity" metadata (seqno 635).
    Creating logical volume pvmove0
    Moving 182 extents of logical volume mirror_sanity/segment0
    Moving 182 extents of logical volume mirror_sanity/segment1
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 636).
    Checking progress before waiting every 15 seconds
  /dev/sdh1: Moved: 0.0%
  /dev/sdh1: Moved: 15.9%
  /dev/sdh1: Moved: 31.3%
  /dev/sdh1: Moved: 46.7%
  /dev/sdh1: Moved: 50.0%
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 637).
  /dev/sdh1: Moved: 65.7%
  /dev/sdh1: Moved: 81.0%
  /dev/sdh1: Moved: 96.4%
  /dev/sdh1: Moved: 100.0%
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 639).
Quick verification that pvmove is finished on other nodes as well
Deactivating mirror segment0... and removing
Deactivating mirror segment1... and removing

Comment 24 Benjamin Cleyet-Marrel 2012-01-03 17:34:10 UTC
Hello, 

It seems we are facing the exact same problem. 

pvmove hang the cluster and some nodes start logging infinite loop of 

...
Reinitializing checkpoint library handle
Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
Reinitializing checkpoint library handle
...

The patch does not seem available on the fedora git or anywhere else.

http://git.fedorahosted.org/git/?p=cluster.git;a=history;f=cmirror;h=5f0d6fe0c90c8c2dfc52e174dd7c5d8e73344791;hb=refs/heads/RHEL58

And the rpm or rpms for cmirror-1.1.39-14.el5 neither.

Where can we get the patch. We could only find the -13

Also do you know any "clean" method to get out of this situation as it cause the cluster to hang.


Thanks a lot.

Comment 25 errata-xmlrpc 2012-02-21 06:21:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0256.html


Note You need to log in before you can comment on or make changes to this bug.