Bug 514814

Summary:	RHEL5 cmirror tracker: openais ipc limits can lead to large cmirror problems
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED WONTFIX	QA Contact:	Cluster QE <mspqa-list>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	5.4	CC:	agk, ccaulfie, coughlan, dwysocha, edamato, heinzm, janne.peltonen, mbroz, rlerch
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Due to limitations in the cluster infrastructure, cluster mirrors greater than 1.5TB cannot be created with the default region size. If larger mirrors are required, the region size should be increased from its default (512kB), for example: <screen> # -R <region_size_in_MiB> lvcreate -m1 -L 2T -R 2 -n mirror vol_group </screen> Failure to increase the region size will result in the LVM creation process hanging and may cause other LVM commands to hang.	Story Points:	---
Clone Of:
Clones:	515742 (view as bug list)		Environment:
Last Closed:	2010-01-26 21:09:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	513501, 515742

Description Corey Marthaler 2009-07-30 21:37:50 UTC

Description of problem:
Every so often my hayes cluster (which uses AOE storage) will be unable to complete the simple task of creating/activating cmirrors. I'm not sure how I get into this state, but once it's there it's a hard process to get out of it.

[root@hayes-01 ~]# service clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs: 
[HANG DUE TO EXISTING MIRROR]

Jul 30 16:28:38 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/20] - retrying
Jul 30 16:28:53 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/20] - retrying
Jul 30 16:28:53 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/21] - retrying
Jul 30 16:29:08 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/21] - retrying
Jul 30 16:29:08 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/22] - retrying
Jul 30 16:29:23 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/22] - retrying
Jul 30 16:29:23 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/23] - retrying
Jul 30 16:29:38 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/23] - retrying
Jul 30 16:29:38 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/24] - retrying
Jul 30 16:29:53 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/24] - retrying
Jul 30 16:29:53 hayes-03 clogd[4395]: Invalid log request received, ignoring. 
device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/25] - retrying
Jul 30 16:30:08 hayes-03 kernel: device-mapper: dm-log-clustered: [U9LjlBOE] Request timed out: [DM_CLOG_CTR/25] - retrying
Jul 30 16:30:08 hayes-03 clogd[4395]: Invalid log request received, ignoring. 


Nothing shows up as being bad in the debugging:

Jul 30 16:30:39 hayes-02 clogd[4346]:
Jul 30 16:30:39 hayes-02 clogd[4346]: LOG COMPONENT DEBUGGING::
Jul 30 16:30:39 hayes-02 clogd[4346]: Official log list:
Jul 30 16:30:39 hayes-02 clogd[4346]: Pending log list:
Jul 30 16:30:39 hayes-02 clogd[4346]: Resync request history:
Jul 30 16:30:39 hayes-02 clogd[4346]:
Jul 30 16:30:39 hayes-02 clogd[4346]: CLUSTER COMPONENT DEBUGGING::
Jul 30 16:30:39 hayes-02 clogd[4346]: Command History:
Jul 30 16:30:42 hayes-02 clogd[4346]:
Jul 30 16:30:42 hayes-02 clogd[4346]: LOG COMPONENT DEBUGGING::
Jul 30 16:30:42 hayes-02 clogd[4346]: Official log list:
Jul 30 16:30:42 hayes-02 clogd[4346]: Pending log list:
Jul 30 16:30:42 hayes-02 clogd[4346]: Resync request history:
Jul 30 16:30:42 hayes-02 clogd[4346]:
Jul 30 16:30:42 hayes-02 clogd[4346]: CLUSTER COMPONENT DEBUGGING::
Jul 30 16:30:42 hayes-02 clogd[4346]: Command History:

This has been reproduce on this cluster quite a few times.

Version-Release number of selected component (if applicable):
2.6.18-158.el5                                                                                                                                        
lvm2-2.02.46-8.el5    BUILT: Thu Jun 18 08:06:12 CDT 2009
lvm2-cluster-2.02.46-8.el5    BUILT: Thu Jun 18 08:05:27 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.39-2.el5    BUILT: Mon Jul 27 15:39:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 1 Corey Marthaler 2009-07-31 22:35:17 UTC

This doesn't appear to be a net device issue as I just hit this on the grant cluster as well (fc storage). Right after clvmd segfaulted (due to bz 506986), I restarted the cluster and it hung attempting to start clvmd. Now even after cleaning up the volumes, and restarting everything, any cmirror create attempt hangs the cluster.

Comment 2 Corey Marthaler 2009-07-31 22:45:19 UTC

Looks like log daemon is failing to do something.

After dd'ing to really wipe the storage clean, I was still unable to create anything.

[root@hayes-01 ~]# lvcreate -m 1 -n mirror -L 10G  VG
  Aborting. Failed to activate mirror log.
  Failed to create mirror log.

Comment 3 Jonathan Earl Brassow 2009-08-04 20:53:10 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cluster mirrors larger than 1.5TB require larger region sizes, or will not work.

Due to limitations in the cluster infrastructure, cluster mirrors greater than 1.5TB cannot be created with the default region size.  Users that require larger mirrors should increase the region size from its default (512k) to something larger.

Example:
# -R <region_size_in_MiB>
lvcreate -m1 -L 2T -R 2 -n mirror vol_group

Failure to increase the region size will result in hung LVM creation and possibly hanging other LVM commands as well.

Comment 7 Ryan Lerch 2009-08-19 01:06:39 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,9 +1,6 @@
-Cluster mirrors larger than 1.5TB require larger region sizes, or will not work.
-
-Due to limitations in the cluster infrastructure, cluster mirrors greater than 1.5TB cannot be created with the default region size.  Users that require larger mirrors should increase the region size from its default (512k) to something larger.
-
-Example:
-# -R <region_size_in_MiB>
+Due to limitations in the cluster infrastructure, cluster mirrors greater than 1.5TB cannot be created with the default region size. If larger mirrors are required, the region size should be increased from its default (512kB), for example:
+<screen>
+# -R &lt;region_size_in_MiB&gt;
 lvcreate -m1 -L 2T -R 2 -n mirror vol_group
-
+</screen>
-Failure to increase the region size will result in hung LVM creation and possibly hanging other LVM commands as well.+Failure to increase the region size will result in the LVM creation process hanging and may cause other LVM commands to hang.

Comment 9 Jonathan Earl Brassow 2010-01-26 21:09:07 UTC

The given workaround will have to suffice for rhel5.

The openAIS checkpoint limits will not be increased in RHEL5; therefore, a solution would involve automatically picking a region_size based on these limits.  This would be a different behavior than single machine mirroring.

I am closing this bug with no intent to fix in rhel5.  It is possible to make-up for the limitations in openAIS checkpoints within LVM, but I don't see any demand for this - especially with such a simple workaround.

Comment 11 Janne Peltonen 2013-03-19 08:19:18 UTC

Hi, I seem to be running into similar problems when trying to pvmove stuff - weirdly enough, the pvmove mirrors are much smaller than 1.5 TB, but. Is there a way to increase the mirror region size for the pvmove mirrors?