Bug 767002 - cmirror create deadlock - 'clogd: cpg_initialize failed: Cannot join cluster'
Summary: cmirror create deadlock - 'clogd: cpg_initialize failed: Cannot join cluster'
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster
Version: 5.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
: 782156 (view as bug list)
Depends On:
Blocks: 782156 807971 928849
TreeView+ depends on / blocked
 
Reported: 2011-12-12 23:43 UTC by Corey Marthaler
Modified: 2013-04-10 20:18 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 782156 (view as bug list)
Environment:
Last Closed: 2013-04-10 20:18:03 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Corey Marthaler 2011-12-12 23:43:32 UTC
Description of problem:
The creation attempt of multiple cmirror eventually results in a deadlock.


SCENARIO - [many_mirrors]
Recreating VG and PVs to increase metadata size
  Writing physical volume data to disk "/dev/sdd1"
  Writing physical volume data to disk "/dev/sdd2"
  Writing physical volume data to disk "/dev/sde1"
  Writing physical volume data to disk "/dev/sde2"
  Writing physical volume data to disk "/dev/sdf1"
  Writing physical volume data to disk "/dev/sdf2"
  Writing physical volume data to disk "/dev/sdg1"
  Writing physical volume data to disk "/dev/sdg2"
  Writing physical volume data to disk "/dev/sdh1"
  Writing physical volume data to disk "/dev/sdh2"
Making 200 mirrors...
1 taft-04: lvcreate -m 1 -n 200_1 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
2 taft-04: lvcreate -m 1 -n 200_2 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!

[...]

129 taft-02: lvcreate -m 1 -n 200_129 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
130 taft-03: lvcreate -m 1 -n 200_130 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!

[DEADLOCK]

Dec 12 13:38:38 taft-01 qarshd[18232]: Running cmdline: lvcreate -m 1 -n 500_122 -L 25M --nosync mirror_sanity
Dec 12 13:38:43 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:43 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:43 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_122 for events.
Dec 12 13:38:48 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:48 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:48 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_123 for events.
Dec 12 13:38:48 taft-01 qarshd[18367]: Running cmdline: lvcreate -m 1 -n 500_124 -L 25M --nosync mirror_sanity
Dec 12 13:38:53 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:53 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:53 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_124 for events.
Dec 12 13:38:53 taft-01 qarshd[18435]: Running cmdline: lvcreate -m 1 -n 500_125 -L 25M --nosync mirror_sanity
Dec 12 13:38:59 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:59 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:59 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_125 for events.
Dec 12 13:39:04 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:39:04 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:39:04 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_126 for events.


Version-Release number of selected component (if applicable):
2.6.18-274.el5

lvm2-2.02.88-5.el5    BUILT: Fri Dec  2 12:25:45 CST 2011
lvm2-cluster-2.02.88-5.el5    BUILT: Fri Dec  2 12:48:37 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-14.el5    BUILT: Wed Nov  2 17:25:33 CDT 2011
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 1 RHEL Program Management 2012-01-09 14:51:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 2 Nenad Peric 2012-01-16 17:00:05 UTC
*** Bug 782156 has been marked as a duplicate of this bug. ***

Comment 3 Nenad Peric 2012-01-16 17:09:35 UTC
Encountered the same issue while testing mirrors in a cluster. 

reserved_memory set to 32768
the error shows itself around 220th mirror. 

The errors in /var/log/messages are:

[lvm_cluster_mirror] [lvm_cluster_mirror_sanity] 230 a1: lvcreate -m 1 -n
500_230 -L 25M --nosync mirror_sanity
[lvm_cluster_mirror] [lvm_cluster_mirror_sanity]   WARNING: New mirror won't be
synchronised. Don't read what you didn't write!

errors in /var/log/messages:

(08:25:09) [root@a1:/var/log]$ tail /var/log/messages
Jan 16 08:26:23 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x4000000000005ef0
Jan 16 08:26:23 a1 kernel: clogd(26041): unaligned access to
0x6000000000011614, ip=0x4000000000005f10
Jan 16 08:26:23 a1 clogd[26041]: cpg_mcast_joined error: SA_AIS_ERR_BAD_HANDLE 
Jan 16 08:26:28 a1 last message repeated 36 times
Jan 16 08:26:28 a1 kernel: kernel unaligned access to 0xe0000001f1a60394,
ip=0xa00000020371e4d0
Jan 16 08:26:28 a1 kernel: kernel unaligned access to 0xe0000001f1a6043c,
ip=0xa00000020371e560
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x400000000006f9f0
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x4000000000005ef0
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x6000000000011614, ip=0x4000000000005f10
Jan 16 08:26:28 a1 clogd[26041]: cpg_mcast_joined error: SA_AIS_ERR_BAD_HANDLE 



The operation can be unlocked after executing vgs or vgscan command on the
active node (which can as well get stuck, then do the same command on one other
node and they both get unlocked), after which the mirrors continue being
created for maybe 10 more or 15 times and it gets deadlocked again. 
This can be repeated as far as I can tell indefinitely.


The errors showing in /var/log/messages then are:


Jan 16 11:04:40 a1 kernel: kernel unaligned access to 0xe0000001f3490714,
ip=0xa000000202f424d0
Jan 16 11:04:40 a1 kernel: kernel unaligned access to 0xe0000001f34907bc,
ip=0xa000000202f42560
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x600000000001160c,
ip=0x400000000006f9f0
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x60000000000116b4,
ip=0x4000000000057f60
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x60000000000116b4,
ip=0x4000000000058260
Jan 16 11:04:40 a1 clogd[4059]: cpg_initialize failed:  Cannot join cluster 
Jan 16 11:04:40 a1 clogd[4059]: clog_resume:  Failed to create cluster CPG 
Jan 16 11:04:40 a1 lvm[4524]: Monitoring mirror device mirror_sanity-500_322 for events

Comment 5 RHEL Program Management 2012-04-02 10:47:39 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 6 Jonathan Earl Brassow 2012-04-10 16:12:01 UTC
There simply may need to be limits placed on the number of cluster mirrors that are allowed.  It doesn't look like checkpointing/CPG can handle the load of all the mirrors.


Note You need to log in before you can comment on or make changes to this bug.