Bug 767002

Summary: cmirror create deadlock - 'clogd: cpg_initialize failed: Cannot join cluster'
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: lvm2-clusterAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: agk, ccaulfie, dwysocha, heinzm, jbrassow, lmiksik, nperic, prajnoha, prockai, thornber, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 782156 (view as bug list) Environment:
Last Closed: 2013-04-10 20:18:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 782156, 807971, 928849    

Description Corey Marthaler 2011-12-12 23:43:32 UTC
Description of problem:
The creation attempt of multiple cmirror eventually results in a deadlock.


SCENARIO - [many_mirrors]
Recreating VG and PVs to increase metadata size
  Writing physical volume data to disk "/dev/sdd1"
  Writing physical volume data to disk "/dev/sdd2"
  Writing physical volume data to disk "/dev/sde1"
  Writing physical volume data to disk "/dev/sde2"
  Writing physical volume data to disk "/dev/sdf1"
  Writing physical volume data to disk "/dev/sdf2"
  Writing physical volume data to disk "/dev/sdg1"
  Writing physical volume data to disk "/dev/sdg2"
  Writing physical volume data to disk "/dev/sdh1"
  Writing physical volume data to disk "/dev/sdh2"
Making 200 mirrors...
1 taft-04: lvcreate -m 1 -n 200_1 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
2 taft-04: lvcreate -m 1 -n 200_2 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!

[...]

129 taft-02: lvcreate -m 1 -n 200_129 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
130 taft-03: lvcreate -m 1 -n 200_130 -L 25M --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!

[DEADLOCK]

Dec 12 13:38:38 taft-01 qarshd[18232]: Running cmdline: lvcreate -m 1 -n 500_122 -L 25M --nosync mirror_sanity
Dec 12 13:38:43 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:43 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:43 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_122 for events.
Dec 12 13:38:48 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:48 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:48 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_123 for events.
Dec 12 13:38:48 taft-01 qarshd[18367]: Running cmdline: lvcreate -m 1 -n 500_124 -L 25M --nosync mirror_sanity
Dec 12 13:38:53 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:53 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:53 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_124 for events.
Dec 12 13:38:53 taft-01 qarshd[18435]: Running cmdline: lvcreate -m 1 -n 500_125 -L 25M --nosync mirror_sanity
Dec 12 13:38:59 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:38:59 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:38:59 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_125 for events.
Dec 12 13:39:04 taft-01 clogd[6351]: cpg_initialize failed:  Cannot join cluster
Dec 12 13:39:04 taft-01 clogd[6351]: clog_resume:  Failed to create cluster CPG
Dec 12 13:39:04 taft-01 lvm[6597]: Monitoring mirror device mirror_sanity-500_126 for events.


Version-Release number of selected component (if applicable):
2.6.18-274.el5

lvm2-2.02.88-5.el5    BUILT: Fri Dec  2 12:25:45 CST 2011
lvm2-cluster-2.02.88-5.el5    BUILT: Fri Dec  2 12:48:37 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-14.el5    BUILT: Wed Nov  2 17:25:33 CDT 2011
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 1 RHEL Program Management 2012-01-09 14:51:23 UTC
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.8 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 2 Nenad Peric 2012-01-16 17:00:05 UTC
*** Bug 782156 has been marked as a duplicate of this bug. ***

Comment 3 Nenad Peric 2012-01-16 17:09:35 UTC
Encountered the same issue while testing mirrors in a cluster. 

reserved_memory set to 32768
the error shows itself around 220th mirror. 

The errors in /var/log/messages are:

[lvm_cluster_mirror] [lvm_cluster_mirror_sanity] 230 a1: lvcreate -m 1 -n
500_230 -L 25M --nosync mirror_sanity
[lvm_cluster_mirror] [lvm_cluster_mirror_sanity]   WARNING: New mirror won't be
synchronised. Don't read what you didn't write!

errors in /var/log/messages:

(08:25:09) [root@a1:/var/log]$ tail /var/log/messages
Jan 16 08:26:23 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x4000000000005ef0
Jan 16 08:26:23 a1 kernel: clogd(26041): unaligned access to
0x6000000000011614, ip=0x4000000000005f10
Jan 16 08:26:23 a1 clogd[26041]: cpg_mcast_joined error: SA_AIS_ERR_BAD_HANDLE 
Jan 16 08:26:28 a1 last message repeated 36 times
Jan 16 08:26:28 a1 kernel: kernel unaligned access to 0xe0000001f1a60394,
ip=0xa00000020371e4d0
Jan 16 08:26:28 a1 kernel: kernel unaligned access to 0xe0000001f1a6043c,
ip=0xa00000020371e560
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x400000000006f9f0
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x600000000001160c, ip=0x4000000000005ef0
Jan 16 08:26:28 a1 kernel: clogd(26041): unaligned access to
0x6000000000011614, ip=0x4000000000005f10
Jan 16 08:26:28 a1 clogd[26041]: cpg_mcast_joined error: SA_AIS_ERR_BAD_HANDLE 



The operation can be unlocked after executing vgs or vgscan command on the
active node (which can as well get stuck, then do the same command on one other
node and they both get unlocked), after which the mirrors continue being
created for maybe 10 more or 15 times and it gets deadlocked again. 
This can be repeated as far as I can tell indefinitely.


The errors showing in /var/log/messages then are:


Jan 16 11:04:40 a1 kernel: kernel unaligned access to 0xe0000001f3490714,
ip=0xa000000202f424d0
Jan 16 11:04:40 a1 kernel: kernel unaligned access to 0xe0000001f34907bc,
ip=0xa000000202f42560
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x600000000001160c,
ip=0x400000000006f9f0
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x60000000000116b4,
ip=0x4000000000057f60
Jan 16 11:04:40 a1 kernel: clogd(4059): unaligned access to 0x60000000000116b4,
ip=0x4000000000058260
Jan 16 11:04:40 a1 clogd[4059]: cpg_initialize failed:  Cannot join cluster 
Jan 16 11:04:40 a1 clogd[4059]: clog_resume:  Failed to create cluster CPG 
Jan 16 11:04:40 a1 lvm[4524]: Monitoring mirror device mirror_sanity-500_322 for events

Comment 5 RHEL Program Management 2012-04-02 10:47:39 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 6 Jonathan Earl Brassow 2012-04-10 16:12:01 UTC
There simply may need to be limits placed on the number of cluster mirrors that are allowed.  It doesn't look like checkpointing/CPG can handle the load of all the mirrors.