1440160 – clvmd hanging for timeout after kernel upgrade to 2.6.32-696

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1440160 - clvmd hanging for timeout after kernel upgrade to 2.6.32-696

Summary: clvmd hanging for timeout after kernel upgrade to 2.6.32-696

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.9
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1422991 1425546
TreeView+	depends on / blocked

Reported:	2017-04-07 12:29 UTC by Andrea Costantino
Modified:	2021-09-09 12:14 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-09 14:40:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ZIP containing both straces. (112.40 KB, application/zip) 2017-04-07 12:29 UTC, Andrea Costantino	no flags	Details
View All

Description Andrea Costantino 2017-04-07 12:29:59 UTC

Created attachment 1269735 [details]
ZIP containing both straces.

Description of problem:

After upgrading CentOS 6.8 to 6.9 on a cluster pair, clvmd never starts correctly.

It hangs with CPU stuck messages and never complete the DLM negotiation.

The issue is only with new kernel 2.6.32-696 (last kernel released after version bump to 6.9).

The same cluster/lvm2 packages with fo
rmer kernel 2.6.32-642.15.1 works as expected.


Version-Release number of selected component (if applicable):
kernel-2.6.32-696.el6.x86_64
lvm2-cluster-2.02.143-12.el6.x86_64


How reproducible:
create a cluster on 6.8 or 6.9 with kernel-2.6.32-642.15.1.el6.x86_64 and shared storage. start clvmd and eventually gfs2 FS to check it works.
boot with kernel-2.6.32-696.el6.x86_64
start clvmd


Steps to Reproduce:
1. create a cluster on 6.8 or 6.9 with kernel-2.6.32-642.15.1.el6.x86_64 and shared storage. start clvmd and eventually gfs2 FS to check it works.

2. boot with kernel-2.6.32-696.el6.x86_64

3. start clvmd


Actual results:
clvmd hangs and never register for DLM correctly. CLVM volumes are unavailable. GFS2 will not work.

Expected results:
clvmd completes the initialization and volumes are available. GFS2 works.

Additional info:
attached straces for working and not working cases.

Comment 2 Marian Csontos 2017-04-07 15:17:47 UTC

We have a similar RHEL7 issue. Is corosync process at 100% CPU?

Comment 3 Andrea Costantino 2017-04-07 15:32:29 UTC

(In reply to Marian Csontos from comment #2)
> We have a similar RHEL7 issue. Is corosync process at 100% CPU?

Apparently no. The process hogging the CPU is clvmd, not corosync. At least one of the CPUs.

Indeed I did not check other CPUs not corosync CPU usage, so I cannot give 100% guarantee.


What kernel are you using?

Comment 4 Marian Csontos 2017-04-07 15:42:48 UTC

The issue is a race due to a misunderstanding about libqb thread safety.

Kernel version is not important, it is just slightly changing timing and the issue becomes easier to reproduce. 

Check all machines in the cluster please if there is a corosync using 100% CPU. The node will eventually get fenced.

Are you able to test with version 2.02.169 or provided 6.9 test build?

Comment 5 Andrea Costantino 2017-04-07 16:19:13 UTC

(In reply to Marian Csontos from comment #4)
> The issue is a race due to a misunderstanding about libqb thread safety.
> 
> Kernel version is not important, it is just slightly changing timing and the
> issue becomes easier to reproduce. 
> 
> Check all machines in the cluster please if there is a corosync using 100%
> CPU. The node will eventually get fenced.
> 
> Are you able to test with version 2.02.169 or provided 6.9 test build?


I remember first time I waited a lot and the node just upgraded was fenced after some minutes.

I'm not able to reproduce it right now since it's friday evening here in Italy and I won't touch production boxes until Monday.

The version I'm using is lvm2-cluster-2.02.143-12.el6.x86_64 as per my initial post.

I don't see any version 2.02.169 on the official CentOS repositories, maybe they are being ported from RH, but I don't see it either on Errata of RHN at the moment.

Comment 6 Marian Csontos 2017-04-07 16:20:06 UTC

The test packages: https://mcsontos.fedorapeople.org/clvmd-libqb-race-rhel-6.9.x86_64.tar.gz

This is lvm2-2.02.143-12.el6.x86_64 with additional patch:

commit dae4f53acb269219e876c229c8f034fcdaf3ff5a
Author: Zdenek Kabelac <zkabelac>
Date:   Sat Feb 4 14:47:27 2017 +0100

    clvmd: add mutex protection for cpg_ call

    The library for corosync multicasting is not supporting multithread
    usage - add local mutex to avoid parallel call of cpg_mcast_joined().

Comment 7 Marian Csontos 2017-04-12 12:32:50 UTC

I failed to reproduce on (virtualized) RHEL-6.9 cluster.

Andrea, is the issue reproducible - does it happen (almost) every time, or just sporadically?

Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after reading from that. Could that be an indication of the same clvmd+libqb issue we have seen in Bug 1361331, or is it definitely a different thing in need of investigation?

Andrea, sosreports from the cluster nodes would be helpful if it turns out to be a different issue.

Comment 8 Andrea Costantino 2017-04-13 11:46:53 UTC

(In reply to Marian Csontos from comment #7)
> I failed to reproduce on (virtualized) RHEL-6.9 cluster.
> 
> Andrea, is the issue reproducible - does it happen (almost) every time, or
> just sporadically?
> 
> Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after
> reading from that. Could that be an indication of the same clvmd+libqb issue
> we have seen in Bug 1361331, or is it definitely a different thing in need
> of investigation?
> 
> Andrea, sosreports from the cluster nodes would be helpful if it turns out
> to be a different issue.

Hello Marian,

I did not try the packages you suggested, very busy period and the downtime risk prevented me to test.

I can provide the sosreport from one node only, I need the second to be alive and working, but I think is enough. 

Allow me a couple of hours and I'll get back with it.

Comment 9 Andrea Costantino 2017-04-13 12:09:20 UTC

(In reply to Andrea Costantino from comment #8)
> (In reply to Marian Csontos from comment #7)
> > I failed to reproduce on (virtualized) RHEL-6.9 cluster.
> > 
> > Andrea, is the issue reproducible - does it happen (almost) every time, or
> > just sporadically?
> > 
> > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after
> > reading from that. Could that be an indication of the same clvmd+libqb issue
> > we have seen in Bug 1361331, or is it definitely a different thing in need
> > of investigation?
> > 
> > Andrea, sosreports from the cluster nodes would be helpful if it turns out
> > to be a different issue.
> 
> Hello Marian,
> 
> I did not try the packages you suggested, very busy period and the downtime
> risk prevented me to test.
> 
> I can provide the sosreport from one node only, I need the second to be
> alive and working, but I think is enough. 
> 
> Allow me a couple of hours and I'll get back with it.

Now the node is being fenced. Two tries and two fencing.

I got a sosreport after cman start and before clvmd start, but this is not showing anything useful, I guess.

I'll try to wait a little after cman start to allow things to settle. Moreover we also have an I/O spike these days due to application overusage, so this might be worsening the stuck condition.

Comment 10 Andrea Costantino 2017-04-13 13:07:41 UTC

(In reply to Andrea Costantino from comment #9)
> (In reply to Andrea Costantino from comment #8)
> > (In reply to Marian Csontos from comment #7)
> > > I failed to reproduce on (virtualized) RHEL-6.9 cluster.
> > > 
> > > Andrea, is the issue reproducible - does it happen (almost) every time, or
> > > just sporadically?
> > > 
> > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after
> > > reading from that. Could that be an indication of the same clvmd+libqb issue
> > > we have seen in Bug 1361331, or is it definitely a different thing in need
> > > of investigation?
> > > 
> > > Andrea, sosreports from the cluster nodes would be helpful if it turns out
> > > to be a different issue.
> > 
> > Hello Marian,
> > 
> > I did not try the packages you suggested, very busy period and the downtime
> > risk prevented me to test.
> > 
> > I can provide the sosreport from one node only, I need the second to be
> > alive and working, but I think is enough. 
> > 
> > Allow me a couple of hours and I'll get back with it.
> 
> Now the node is being fenced. Two tries and two fencing.
> 
> I got a sosreport after cman start and before clvmd start, but this is not
> showing anything useful, I guess.
> 
> I'll try to wait a little after cman start to allow things to settle.
> Moreover we also have an I/O spike these days due to application overusage,
> so this might be worsening the stuck condition.

Ok,

I got it.

If I start cman and clvmd without starting rgmanager, immediate  fencing happens.

If rgmanager is started when clvmd is started, machine stays up, no fencing, but it's painfully slow, and logging:

Message from syslogd@newnemo1 at Apr 13 15:03:01 ...
 kernel:BUG: soft lockup - CPU#25 stuck for 67s! [clvmd:7678]

From the priorites, rgmanager starts as last (99 run priority), so the use case without mangling things will be a fencing.

This was not the case as first try (after the initial upgrade and reboot), but this might be a matter of timing.

I'm still waiting for sosreport to collect, it's stuck since hours on collecting "parted -s /dev/dm-0 unit s print".

Comment 11 Andrea Costantino 2017-04-14 13:20:30 UTC

(In reply to Andrea Costantino from comment #10)
> (In reply to Andrea Costantino from comment #9)
> > (In reply to Andrea Costantino from comment #8)
> > > (In reply to Marian Csontos from comment #7)
> > > > I failed to reproduce on (virtualized) RHEL-6.9 cluster.
> > > > 
> > > > Andrea, is the issue reproducible - does it happen (almost) every time, or
> > > > just sporadically?
> > > > 
> > > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after
> > > > reading from that. Could that be an indication of the same clvmd+libqb issue
> > > > we have seen in Bug 1361331, or is it definitely a different thing in need
> > > > of investigation?
> > > > 
> > > > Andrea, sosreports from the cluster nodes would be helpful if it turns out
> > > > to be a different issue.
> > > 
> > > Hello Marian,
> > > 
> > > I did not try the packages you suggested, very busy period and the downtime
> > > risk prevented me to test.
> > > 
> > > I can provide the sosreport from one node only, I need the second to be
> > > alive and working, but I think is enough. 
> > > 
> > > Allow me a couple of hours and I'll get back with it.
> > 
> > Now the node is being fenced. Two tries and two fencing.
> > 
> > I got a sosreport after cman start and before clvmd start, but this is not
> > showing anything useful, I guess.
> > 
> > I'll try to wait a little after cman start to allow things to settle.
> > Moreover we also have an I/O spike these days due to application overusage,
> > so this might be worsening the stuck condition.
> 
> Ok,
> 
> I got it.
> 
> If I start cman and clvmd without starting rgmanager, immediate  fencing
> happens.
> 
> If rgmanager is started when clvmd is started, machine stays up, no fencing,
> but it's painfully slow, and logging:
> 
> Message from syslogd@newnemo1 at Apr 13 15:03:01 ...
>  kernel:BUG: soft lockup - CPU#25 stuck for 67s! [clvmd:7678]
> 
> From the priorites, rgmanager starts as last (99 run priority), so the use
> case without mangling things will be a fencing.
> 
> This was not the case as first try (after the initial upgrade and reboot),
> but this might be a matter of timing.
> 
> I'm still waiting for sosreport to collect, it's stuck since hours on
> collecting "parted -s /dev/dm-0 unit s print".

No way to collect sosreport. It hangs completely and keep one CPU and some IO stuck.

Need to reboot to recover, no kill possible, system now and then unworkable.

Comment 12 erikj 2017-04-19 23:47:31 UTC

We have hit this problem at HPE/SGI as well. Our work around is to use the RHEL6.8 GA kernel on the RHEL6.9 installation. If there is some information we can provide let us know.

Comment 13 Marian Csontos 2017-04-20 09:09:43 UTC

Erik, Andrea, it was pointed out to me libqb is not used by RHEL-6 clvmd, so the build from Comment #6 would not help.

Erik, if the problem persist, open a support case and work with GSS to gather more information, please. For now if you could at least SysRq the stuck cluster, so we see the processes running and their stacks.

Comment 14 Christine Caulfield 2017-04-20 09:37:33 UTC

This is a totally different thing to the RHEL7 bug. As Marian pointed out libqb is not used on RHEL-6 - though I haven't validated the socket IPC for thread-safety it's true.


If clvmd is hanging writing to /dev/misc/dlm-control then we should also be looking at the DLM to see why it's not responding quickly to that write. It might be in recovery.

Comment 15 Andrea Costantino 2017-04-20 11:12:23 UTC

Ok, so way forward on this?

I just want to point out that the cluster is not really stuck. Just clvmd and sosreport hangs.

SysRq might help, but I've already provided the strace and the stuck is happening on dlm-control, but I did not strace the dlm side. I can do it next week if it's of any value.

The current WA is using the old kernel 2.6.32-642.15.1.el6.x86_64 and I'm currently puzzling what's in the kernel is making the substatial difference.

I tried to review the changelogs for kernel and there're a plethora of changes that might be relevant to LVM and *mightly* impact cLVM.

Comment 16 Christine Caulfield 2017-04-20 12:46:10 UTC

The next thing I would look at are the dlm_tool debug logs. Those should tell you what state the DLM kernel & dlm_controld are in. Also (and more obviously) look in syslog for DLM messages

Comment 17 erikj 2017-04-20 14:35:04 UTC

(I see the request for information. I've requested access to the platform to collect the details. It may be a couple hours before I get access. It's a shared resource being used for other testing as well)

Comment 19 Andrea Costantino 2017-04-20 15:06:24 UTC

(In reply to Christine Caulfield from comment #16)
> The next thing I would look at are the dlm_tool debug logs. Those should
> tell you what state the DLM kernel & dlm_controld are in. Also (and more
> obviously) look in syslog for DLM messages

No big deal happening on the logs:
Apr 20 16:14:18 newnemo2 kernel: dlm: Using SCTP for communications
Apr 20 16:14:18 newnemo2 kernel: SCTP: Hash tables configured (established 65536 bind 65536)

Then fence occurs.

I suspect it's in kernel, a CPU is stuck and kernel changes resolves the issue...

Comment 20 Christine Caulfield 2017-04-20 15:15:06 UTC

Any reason you're using SCTP for the DLM? I don't think it's well tested. 

It might be worth setting it back to TCP (which might require also setting corosync to single-ring mode I accept) to see if it fixes the problem.

Comment 21 erikj 2017-04-20 15:27:16 UTC

(I will have the platform in 2.5 hours. I will try to collect what was suggested here and attach. Feel free to throw me more things to try during my time)

Comment 22 Andrea Costantino 2017-04-20 15:33:22 UTC

(In reply to Christine Caulfield from comment #20)
> Any reason you're using SCTP for the DLM? I don't think it's well tested. 
> 
> It might be worth setting it back to TCP (which might require also setting
> corosync to single-ring mode I accept) to see if it fixes the problem.

SCTP is the only way for redundant ring config (which I have).

I can disable it and it should revert to TCP, but a full restart of the cluster is needed and I cannot cope with it.

Anyway, besides of the kernel upgrade for RH6.9, it never gave me a single issue.

@Erik are you using a redundant ring or forcing use of SCTP?

Comment 23 Christine Caulfield 2017-04-20 16:06:29 UTC

Looking through our support documents, DLM over SCTP is not supported in RHEL-6 because the SCTP stack in the kernel was not found to be reliable for the job we wanted it to do.

If this happened after a kernel upgrade there's a good possibility that it's something in SCTP that is causing this hang.

I appreciate it's hard to dismantle a cluster but channel bonding is the supported and reliable way to approach this in RHEL-6

Comment 24 erikj 2017-04-20 16:52:59 UTC

I can confirm this output in our dmesg. We were not aware this is not supported. I wonder how we'll fix deployed solutions if we have to undo this. As far as I know our configuration hasn't changed in many RHEL releases here.

I am investigating how to not use SCTP. 

DLM (built Apr 13 2016 00:52:04) installed
dlm: Using SCTP for communications
dlm: connecting to 2
GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images"
dlm: node 2 already connected.

Comment 25 Andrea Costantino 2017-04-20 17:19:33 UTC

(In reply to Erik Jacobson from comment #24)
> I can confirm this output in our dmesg. We were not aware this is not
> supported. I wonder how we'll fix deployed solutions if we have to undo
> this. As far as I know our configuration hasn't changed in many RHEL
> releases here.
> 
> I am investigating how to not use SCTP. 
> 
> DLM (built Apr 13 2016 00:52:04) installed
> dlm: Using SCTP for communications
> dlm: connecting to 2
> GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images"
> dlm: node 2 already connected.

If using conga to configure, it's easy:

Login to conga, register cluster, go to expert mode (upper right, preference, enable expert)
Then Configure TAB, DLM subtab-> DLM Lowcomms Protocol -> TCP

Reboot all the cluster nodes. Probably you need to shutdown everything first, then starts all the cluster almost at the same time.

To achieve this, start all the nodes with the nocluster flag appended on the kernel boot like this:
 kernel /vmlinuz-2.6.32-696.1.1.el6.x86_64 ro root=/dev/mapper/root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_newnemo2/root rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_newnemo2/swap rd_NO_DM console=tty0 console=ttyS0,19200n8 nocluster

Then after everything has booted, start cman on all nodes simultaneously.

Alternative way is to disable all cluster services at boot time, and start cman simultaneously on all nodes after.

If nothing gets fenced and a quorum is formed, you're ready to go, and DLM will be using TCP.

Comment 26 Andrea Costantino 2017-04-20 17:22:11 UTC

(In reply to Christine Caulfield from comment #23)
> Looking through our support documents, DLM over SCTP is not supported in
> RHEL-6 because the SCTP stack in the kernel was not found to be reliable for
> the job we wanted it to do.
> 
> If this happened after a kernel upgrade there's a good possibility that it's
> something in SCTP that is causing this hang.
> 
> I appreciate it's hard to dismantle a cluster but channel bonding is the
> supported and reliable way to approach this in RHEL-6

Thanks Christine, I understand. I'm puzzled about not being supported and being automatically chosen if a RRP is available.. 

But we know, QA is not always possible in all conditions.

As soon as I have a downtime approved, I'll try to disable SCTP.

Comment 27 Andrea Costantino 2017-04-20 17:35:43 UTC

List of commits taken from changelogs regarding either DLM or SCTP between 642 and 696 kernel subversions:

* Thu Jan 12 2017 Phillip Lougher <plougher> [2.6.32-683.el6]
- [fs] dlm: Fix saving of NULL callbacks (Robert S Peterson) [1264492]
* Sat Dec 10 2016 Phillip Lougher <plougher> [2.6.32-678.el6]
- [net] sctp: validate chunk len before actually using it (Hangbin Liu) [1399457] {CVE-2016-9555}
* Fri Nov 25 2016 Phillip Lougher <plougher> [2.6.32-675.el6]
- [fs] dlm: Don't save callbacks after accept (Robert S Peterson) [1264492]
- [fs] dlm: Save and restore socket callbacks properly (Robert S Peterson) [1264492]
- [fs] dlm: Replace nodeid_to_addr with kernel_getpeername (Robert S Peterson) [1264492]
- [fs] dlm: print kernel message when we get an error from kernel_sendpage (Robert S Peterson) [1264492]
* Fri Nov 04 2016 Phillip Lougher <plougher> [2.6.32-668.el6]
- [net] sctp: use the same clock as if sock source timestamps were on (Xin Long) [1334561]
- [net] sctp: update the netstamp_needed counter when copying sockets (Xin Long) [1334561]
- [net] sctp: fix the transports round robin issue when init is retransmitted (Xin Long) [1312728]
* Fri Oct 21 2016 Phillip Lougher <plougher> [2.6.32-664.el6]
- [fs] dlm: free workqueues after the connections (Marcelo Leitner) [1365204]

Comment 28 erikj 2017-04-20 18:16:10 UTC

We do no special configuration of this item. We have an integrated setup that uses the RHEL HA stack but the rules are all custom to be deployed with our own cluster manager. For this reason we are not using configuration tools.

We're using mostly default values for the HA setup and it looks like this
produces a default but undesired configuration.

I'm trying to take the information you provided and apply it to the way we
setup rules for the integrated solution we ship.

I'm playing with the machin enow.

Comment 29 erikj 2017-04-20 18:27:54 UTC

It appears, dlm_controld man page suggests I need dlm prtocol set to
tcp in /etc/cluster/cluster.conf. I am attempting this change.

Comment 30 erikj 2017-04-20 19:02:50 UTC

Adding this to the <cluster/> section of /etc/cluster/cluster.conf on both of our nodes in our two-node HA environment avoids the problem.

  <dlm protocol="tcp"/>

DLM (built Mar 21 2017 12:20:07) installed
dlm: Using TCP for communications
dlm: got connection from 1
GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images"


The HA cluster seems to be functioning normally.

So it sounds like what you are saying is that...

Default behavior (when not using tools to configure the ha services) produces a dlm setup that is known to have issues in RHEL6 generally and is also causing this show stopper.

So the action SGI needs to take is:

Generate a release note  and/or bulletin to customers with existing rhel6.8 deployments to change the cluster.conf file prior to upgrading to rhel6.9 (for the product making use of this ha feature)

And secondly, change our configuration scripts and tools to add the above mentioned dlm configuration option to the cluster.conf by default.

Are there any similar worries with RHEL 7.3?

I was hoping to get a quick confirmation before I go forward with the changes outlined.

Thanks so much for the suggestion.

Comment 31 erikj 2017-04-20 20:33:39 UTC

Looks like maybe having our tools change
/etc/sysconfig/cman
#DLM_CONTROLD_OPTS=""
to
DLM_CONTROLD_OPTS="-r 0"

is easier for us since hacking /etc/cluster/cluster.conf is probably not the
best choice, the pcs commands don't allow you to modify this... and we're
not currently running ricci.

Comment 32 Marian Csontos 2017-04-26 13:28:22 UTC

Assigning to cluster so this gets addressed there.

SCTP must not be used by default.

Comment 33 Andrea Costantino 2017-04-26 14:58:22 UTC

(In reply to Marian Csontos from comment #32)
> Assigning to cluster so this gets addressed there.
> 
> SCTP must not be used by default.

SCTP is the only way when there's a redundant ring.

If you try to start a DLM-aware application (like clvmd) when forced to TCP and Redundant RIng configured, the application fails since DLM complains about not being able to use TCP with multi-homing.

Apr 26 16:12:35 new1 kernel: dlm: TCP protocol can't handle multi-homed hosts, try SCTP
Apr 26 16:12:36 new1 kernel: dlm: cannot start dlm lowcomms -22
Apr 26 16:12:36 new1 clvmd: Unable to create DLM lockspace for CLVM: Invalid argument
Apr 26 16:12:52 new1 kernel: dlm: closing connection to node 2
Apr 26 16:12:52 new1 kernel: dlm: closing connection to node 1

This is getting really weird. No redundant ring then.

Comment 36 Chris Feist 2017-11-09 14:40:33 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production
3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent
Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and
will be marked as CLOSED/WONTFIX. If this remains a critical requirement,
please contact Red Hat Customer Support to request a re-evaluation of the
issue, citing a clear business justification. Note that a strong business
justification will be required for re-evaluation. Red Hat Customer Support can
be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 37 Andrea Costantino 2017-12-18 15:11:48 UTC

(In reply to Chris Feist from comment #36)
> Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the
> Production
> 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent
> Priority Bug Fix Advisories (RHBAs) may be released as they become available.
> 
> The official life cycle policy can be reviewed here:
> 
> http://redhat.com/rhel/lifecycle
> 
> This issue does not meet the inclusion criteria for the Production 3 Phase
> and
> will be marked as CLOSED/WONTFIX. If this remains a critical requirement,
> please contact Red Hat Customer Support to request a re-evaluation of the
> issue, citing a clear business justification. Note that a strong business
> justification will be required for re-evaluation. Red Hat Customer Support
> can
> be contacted via the Red Hat Customer Portal at the following URL:
> 
> https://access.redhat.com/


Chris,

this ticket was open and discussed BEFORE RH6 entering Phase 3.

You can easily WONTFIX it for a number of reasons, but saying it's not time anymore sounds like a parody.

Comment 38 Chris Feist 2017-12-19 14:30:44 UTC

We evaluate our bugs during every release cycle, since we are now in Production Phase 3 for RHEL 6, all bugs (even bugs filed before the start of Production Phase 3) are evaluated against the current production phase criteria.

Note You need to log in before you can comment on or make changes to this bug.

agk
ccaulfie
cfeist
cluster-maint
costan
dan131riley
erikj
heinzm
jbrassow
jkachuck
jruemker
m.c.dixon
mcsontos
mjuricek
mprobierz
msnitzer
prajnoha
prockai
rpeterso
teigland
zkabelac
zren