Bug 1440160
Summary: | clvmd hanging for timeout after kernel upgrade to 2.6.32-696 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Andrea Costantino <costan> | ||||
Component: | cluster | Assignee: | Christine Caulfield <ccaulfie> | ||||
Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 6.9 | CC: | agk, ccaulfie, cfeist, cluster-maint, costan, dan131riley, erikj, heinzm, jbrassow, jkachuck, jruemker, m.c.dixon, mcsontos, mjuricek, mprobierz, msnitzer, prajnoha, prockai, rpeterso, teigland, zkabelac, zren | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-11-09 14:40:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1422991, 1425546 | ||||||
Attachments: |
|
We have a similar RHEL7 issue. Is corosync process at 100% CPU? (In reply to Marian Csontos from comment #2) > We have a similar RHEL7 issue. Is corosync process at 100% CPU? Apparently no. The process hogging the CPU is clvmd, not corosync. At least one of the CPUs. Indeed I did not check other CPUs not corosync CPU usage, so I cannot give 100% guarantee. What kernel are you using? The issue is a race due to a misunderstanding about libqb thread safety. Kernel version is not important, it is just slightly changing timing and the issue becomes easier to reproduce. Check all machines in the cluster please if there is a corosync using 100% CPU. The node will eventually get fenced. Are you able to test with version 2.02.169 or provided 6.9 test build? (In reply to Marian Csontos from comment #4) > The issue is a race due to a misunderstanding about libqb thread safety. > > Kernel version is not important, it is just slightly changing timing and the > issue becomes easier to reproduce. > > Check all machines in the cluster please if there is a corosync using 100% > CPU. The node will eventually get fenced. > > Are you able to test with version 2.02.169 or provided 6.9 test build? I remember first time I waited a lot and the node just upgraded was fenced after some minutes. I'm not able to reproduce it right now since it's friday evening here in Italy and I won't touch production boxes until Monday. The version I'm using is lvm2-cluster-2.02.143-12.el6.x86_64 as per my initial post. I don't see any version 2.02.169 on the official CentOS repositories, maybe they are being ported from RH, but I don't see it either on Errata of RHN at the moment. The test packages: https://mcsontos.fedorapeople.org/clvmd-libqb-race-rhel-6.9.x86_64.tar.gz This is lvm2-2.02.143-12.el6.x86_64 with additional patch: commit dae4f53acb269219e876c229c8f034fcdaf3ff5a Author: Zdenek Kabelac <zkabelac> Date: Sat Feb 4 14:47:27 2017 +0100 clvmd: add mutex protection for cpg_ call The library for corosync multicasting is not supporting multithread usage - add local mutex to avoid parallel call of cpg_mcast_joined(). I failed to reproduce on (virtualized) RHEL-6.9 cluster. Andrea, is the issue reproducible - does it happen (almost) every time, or just sporadically? Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after reading from that. Could that be an indication of the same clvmd+libqb issue we have seen in Bug 1361331, or is it definitely a different thing in need of investigation? Andrea, sosreports from the cluster nodes would be helpful if it turns out to be a different issue. (In reply to Marian Csontos from comment #7) > I failed to reproduce on (virtualized) RHEL-6.9 cluster. > > Andrea, is the issue reproducible - does it happen (almost) every time, or > just sporadically? > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after > reading from that. Could that be an indication of the same clvmd+libqb issue > we have seen in Bug 1361331, or is it definitely a different thing in need > of investigation? > > Andrea, sosreports from the cluster nodes would be helpful if it turns out > to be a different issue. Hello Marian, I did not try the packages you suggested, very busy period and the downtime risk prevented me to test. I can provide the sosreport from one node only, I need the second to be alive and working, but I think is enough. Allow me a couple of hours and I'll get back with it. (In reply to Andrea Costantino from comment #8) > (In reply to Marian Csontos from comment #7) > > I failed to reproduce on (virtualized) RHEL-6.9 cluster. > > > > Andrea, is the issue reproducible - does it happen (almost) every time, or > > just sporadically? > > > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after > > reading from that. Could that be an indication of the same clvmd+libqb issue > > we have seen in Bug 1361331, or is it definitely a different thing in need > > of investigation? > > > > Andrea, sosreports from the cluster nodes would be helpful if it turns out > > to be a different issue. > > Hello Marian, > > I did not try the packages you suggested, very busy period and the downtime > risk prevented me to test. > > I can provide the sosreport from one node only, I need the second to be > alive and working, but I think is enough. > > Allow me a couple of hours and I'll get back with it. Now the node is being fenced. Two tries and two fencing. I got a sosreport after cman start and before clvmd start, but this is not showing anything useful, I guess. I'll try to wait a little after cman start to allow things to settle. Moreover we also have an I/O spike these days due to application overusage, so this might be worsening the stuck condition. (In reply to Andrea Costantino from comment #9) > (In reply to Andrea Costantino from comment #8) > > (In reply to Marian Csontos from comment #7) > > > I failed to reproduce on (virtualized) RHEL-6.9 cluster. > > > > > > Andrea, is the issue reproducible - does it happen (almost) every time, or > > > just sporadically? > > > > > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after > > > reading from that. Could that be an indication of the same clvmd+libqb issue > > > we have seen in Bug 1361331, or is it definitely a different thing in need > > > of investigation? > > > > > > Andrea, sosreports from the cluster nodes would be helpful if it turns out > > > to be a different issue. > > > > Hello Marian, > > > > I did not try the packages you suggested, very busy period and the downtime > > risk prevented me to test. > > > > I can provide the sosreport from one node only, I need the second to be > > alive and working, but I think is enough. > > > > Allow me a couple of hours and I'll get back with it. > > Now the node is being fenced. Two tries and two fencing. > > I got a sosreport after cman start and before clvmd start, but this is not > showing anything useful, I guess. > > I'll try to wait a little after cman start to allow things to settle. > Moreover we also have an I/O spike these days due to application overusage, > so this might be worsening the stuck condition. Ok, I got it. If I start cman and clvmd without starting rgmanager, immediate fencing happens. If rgmanager is started when clvmd is started, machine stays up, no fencing, but it's painfully slow, and logging: Message from syslogd@newnemo1 at Apr 13 15:03:01 ... kernel:BUG: soft lockup - CPU#25 stuck for 67s! [clvmd:7678] From the priorites, rgmanager starts as last (99 run priority), so the use case without mangling things will be a fencing. This was not the case as first try (after the initial upgrade and reboot), but this might be a matter of timing. I'm still waiting for sosreport to collect, it's stuck since hours on collecting "parted -s /dev/dm-0 unit s print". (In reply to Andrea Costantino from comment #10) > (In reply to Andrea Costantino from comment #9) > > (In reply to Andrea Costantino from comment #8) > > > (In reply to Marian Csontos from comment #7) > > > > I failed to reproduce on (virtualized) RHEL-6.9 cluster. > > > > > > > > Andrea, is the issue reproducible - does it happen (almost) every time, or > > > > just sporadically? > > > > > > > > Chrissie, the clvmd stops on timeout on write to /dev/misc/dlm-control after > > > > reading from that. Could that be an indication of the same clvmd+libqb issue > > > > we have seen in Bug 1361331, or is it definitely a different thing in need > > > > of investigation? > > > > > > > > Andrea, sosreports from the cluster nodes would be helpful if it turns out > > > > to be a different issue. > > > > > > Hello Marian, > > > > > > I did not try the packages you suggested, very busy period and the downtime > > > risk prevented me to test. > > > > > > I can provide the sosreport from one node only, I need the second to be > > > alive and working, but I think is enough. > > > > > > Allow me a couple of hours and I'll get back with it. > > > > Now the node is being fenced. Two tries and two fencing. > > > > I got a sosreport after cman start and before clvmd start, but this is not > > showing anything useful, I guess. > > > > I'll try to wait a little after cman start to allow things to settle. > > Moreover we also have an I/O spike these days due to application overusage, > > so this might be worsening the stuck condition. > > Ok, > > I got it. > > If I start cman and clvmd without starting rgmanager, immediate fencing > happens. > > If rgmanager is started when clvmd is started, machine stays up, no fencing, > but it's painfully slow, and logging: > > Message from syslogd@newnemo1 at Apr 13 15:03:01 ... > kernel:BUG: soft lockup - CPU#25 stuck for 67s! [clvmd:7678] > > From the priorites, rgmanager starts as last (99 run priority), so the use > case without mangling things will be a fencing. > > This was not the case as first try (after the initial upgrade and reboot), > but this might be a matter of timing. > > I'm still waiting for sosreport to collect, it's stuck since hours on > collecting "parted -s /dev/dm-0 unit s print". No way to collect sosreport. It hangs completely and keep one CPU and some IO stuck. Need to reboot to recover, no kill possible, system now and then unworkable. We have hit this problem at HPE/SGI as well. Our work around is to use the RHEL6.8 GA kernel on the RHEL6.9 installation. If there is some information we can provide let us know. Erik, Andrea, it was pointed out to me libqb is not used by RHEL-6 clvmd, so the build from Comment #6 would not help. Erik, if the problem persist, open a support case and work with GSS to gather more information, please. For now if you could at least SysRq the stuck cluster, so we see the processes running and their stacks. This is a totally different thing to the RHEL7 bug. As Marian pointed out libqb is not used on RHEL-6 - though I haven't validated the socket IPC for thread-safety it's true. If clvmd is hanging writing to /dev/misc/dlm-control then we should also be looking at the DLM to see why it's not responding quickly to that write. It might be in recovery. Ok, so way forward on this? I just want to point out that the cluster is not really stuck. Just clvmd and sosreport hangs. SysRq might help, but I've already provided the strace and the stuck is happening on dlm-control, but I did not strace the dlm side. I can do it next week if it's of any value. The current WA is using the old kernel 2.6.32-642.15.1.el6.x86_64 and I'm currently puzzling what's in the kernel is making the substatial difference. I tried to review the changelogs for kernel and there're a plethora of changes that might be relevant to LVM and *mightly* impact cLVM. The next thing I would look at are the dlm_tool debug logs. Those should tell you what state the DLM kernel & dlm_controld are in. Also (and more obviously) look in syslog for DLM messages (I see the request for information. I've requested access to the platform to collect the details. It may be a couple hours before I get access. It's a shared resource being used for other testing as well) (In reply to Christine Caulfield from comment #16) > The next thing I would look at are the dlm_tool debug logs. Those should > tell you what state the DLM kernel & dlm_controld are in. Also (and more > obviously) look in syslog for DLM messages No big deal happening on the logs: Apr 20 16:14:18 newnemo2 kernel: dlm: Using SCTP for communications Apr 20 16:14:18 newnemo2 kernel: SCTP: Hash tables configured (established 65536 bind 65536) Then fence occurs. I suspect it's in kernel, a CPU is stuck and kernel changes resolves the issue... Any reason you're using SCTP for the DLM? I don't think it's well tested. It might be worth setting it back to TCP (which might require also setting corosync to single-ring mode I accept) to see if it fixes the problem. (I will have the platform in 2.5 hours. I will try to collect what was suggested here and attach. Feel free to throw me more things to try during my time) (In reply to Christine Caulfield from comment #20) > Any reason you're using SCTP for the DLM? I don't think it's well tested. > > It might be worth setting it back to TCP (which might require also setting > corosync to single-ring mode I accept) to see if it fixes the problem. SCTP is the only way for redundant ring config (which I have). I can disable it and it should revert to TCP, but a full restart of the cluster is needed and I cannot cope with it. Anyway, besides of the kernel upgrade for RH6.9, it never gave me a single issue. @Erik are you using a redundant ring or forcing use of SCTP? Looking through our support documents, DLM over SCTP is not supported in RHEL-6 because the SCTP stack in the kernel was not found to be reliable for the job we wanted it to do. If this happened after a kernel upgrade there's a good possibility that it's something in SCTP that is causing this hang. I appreciate it's hard to dismantle a cluster but channel bonding is the supported and reliable way to approach this in RHEL-6 I can confirm this output in our dmesg. We were not aware this is not supported. I wonder how we'll fix deployed solutions if we have to undo this. As far as I know our configuration hasn't changed in many RHEL releases here. I am investigating how to not use SCTP. DLM (built Apr 13 2016 00:52:04) installed dlm: Using SCTP for communications dlm: connecting to 2 GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images" dlm: node 2 already connected. (In reply to Erik Jacobson from comment #24) > I can confirm this output in our dmesg. We were not aware this is not > supported. I wonder how we'll fix deployed solutions if we have to undo > this. As far as I know our configuration hasn't changed in many RHEL > releases here. > > I am investigating how to not use SCTP. > > DLM (built Apr 13 2016 00:52:04) installed > dlm: Using SCTP for communications > dlm: connecting to 2 > GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images" > dlm: node 2 already connected. If using conga to configure, it's easy: Login to conga, register cluster, go to expert mode (upper right, preference, enable expert) Then Configure TAB, DLM subtab-> DLM Lowcomms Protocol -> TCP Reboot all the cluster nodes. Probably you need to shutdown everything first, then starts all the cluster almost at the same time. To achieve this, start all the nodes with the nocluster flag appended on the kernel boot like this: kernel /vmlinuz-2.6.32-696.1.1.el6.x86_64 ro root=/dev/mapper/root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_newnemo2/root rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_LVM_LV=vg_newnemo2/swap rd_NO_DM console=tty0 console=ttyS0,19200n8 nocluster Then after everything has booted, start cman on all nodes simultaneously. Alternative way is to disable all cluster services at boot time, and start cman simultaneously on all nodes after. If nothing gets fenced and a quorum is formed, you're ready to go, and DLM will be using TCP. (In reply to Christine Caulfield from comment #23) > Looking through our support documents, DLM over SCTP is not supported in > RHEL-6 because the SCTP stack in the kernel was not found to be reliable for > the job we wanted it to do. > > If this happened after a kernel upgrade there's a good possibility that it's > something in SCTP that is causing this hang. > > I appreciate it's hard to dismantle a cluster but channel bonding is the > supported and reliable way to approach this in RHEL-6 Thanks Christine, I understand. I'm puzzled about not being supported and being automatically chosen if a RRP is available.. But we know, QA is not always possible in all conditions. As soon as I have a downtime approved, I'll try to disable SCTP. List of commits taken from changelogs regarding either DLM or SCTP between 642 and 696 kernel subversions: * Thu Jan 12 2017 Phillip Lougher <plougher> [2.6.32-683.el6] - [fs] dlm: Fix saving of NULL callbacks (Robert S Peterson) [1264492] * Sat Dec 10 2016 Phillip Lougher <plougher> [2.6.32-678.el6] - [net] sctp: validate chunk len before actually using it (Hangbin Liu) [1399457] {CVE-2016-9555} * Fri Nov 25 2016 Phillip Lougher <plougher> [2.6.32-675.el6] - [fs] dlm: Don't save callbacks after accept (Robert S Peterson) [1264492] - [fs] dlm: Save and restore socket callbacks properly (Robert S Peterson) [1264492] - [fs] dlm: Replace nodeid_to_addr with kernel_getpeername (Robert S Peterson) [1264492] - [fs] dlm: print kernel message when we get an error from kernel_sendpage (Robert S Peterson) [1264492] * Fri Nov 04 2016 Phillip Lougher <plougher> [2.6.32-668.el6] - [net] sctp: use the same clock as if sock source timestamps were on (Xin Long) [1334561] - [net] sctp: update the netstamp_needed counter when copying sockets (Xin Long) [1334561] - [net] sctp: fix the transports round robin issue when init is retransmitted (Xin Long) [1312728] * Fri Oct 21 2016 Phillip Lougher <plougher> [2.6.32-664.el6] - [fs] dlm: free workqueues after the connections (Marcelo Leitner) [1365204] We do no special configuration of this item. We have an integrated setup that uses the RHEL HA stack but the rules are all custom to be deployed with our own cluster manager. For this reason we are not using configuration tools. We're using mostly default values for the HA setup and it looks like this produces a default but undesired configuration. I'm trying to take the information you provided and apply it to the way we setup rules for the integrated solution we ship. I'm playing with the machin enow. It appears, dlm_controld man page suggests I need dlm prtocol set to tcp in /etc/cluster/cluster.conf. I am attempting this change. Adding this to the <cluster/> section of /etc/cluster/cluster.conf on both of our nodes in our two-node HA environment avoids the problem. <dlm protocol="tcp"/> DLM (built Mar 21 2017 12:20:07) installed dlm: Using TCP for communications dlm: got connection from 1 GFS2: fsid=: Trying to join cluster "lock_dlm", "hacluster:images" The HA cluster seems to be functioning normally. So it sounds like what you are saying is that... Default behavior (when not using tools to configure the ha services) produces a dlm setup that is known to have issues in RHEL6 generally and is also causing this show stopper. So the action SGI needs to take is: Generate a release note and/or bulletin to customers with existing rhel6.8 deployments to change the cluster.conf file prior to upgrading to rhel6.9 (for the product making use of this ha feature) And secondly, change our configuration scripts and tools to add the above mentioned dlm configuration option to the cluster.conf by default. Are there any similar worries with RHEL 7.3? I was hoping to get a quick confirmation before I go forward with the changes outlined. Thanks so much for the suggestion. Looks like maybe having our tools change /etc/sysconfig/cman #DLM_CONTROLD_OPTS="" to DLM_CONTROLD_OPTS="-r 0" is easier for us since hacking /etc/cluster/cluster.conf is probably not the best choice, the pcs commands don't allow you to modify this... and we're not currently running ricci. Assigning to cluster so this gets addressed there. SCTP must not be used by default. (In reply to Marian Csontos from comment #32) > Assigning to cluster so this gets addressed there. > > SCTP must not be used by default. SCTP is the only way when there's a redundant ring. If you try to start a DLM-aware application (like clvmd) when forced to TCP and Redundant RIng configured, the application fails since DLM complains about not being able to use TCP with multi-homing. Apr 26 16:12:35 new1 kernel: dlm: TCP protocol can't handle multi-homed hosts, try SCTP Apr 26 16:12:36 new1 kernel: dlm: cannot start dlm lowcomms -22 Apr 26 16:12:36 new1 clvmd: Unable to create DLM lockspace for CLVM: Invalid argument Apr 26 16:12:52 new1 kernel: dlm: closing connection to node 2 Apr 26 16:12:52 new1 kernel: dlm: closing connection to node 1 This is getting really weird. No redundant ring then. Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com/ (In reply to Chris Feist from comment #36) > Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the > Production > 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent > Priority Bug Fix Advisories (RHBAs) may be released as they become available. > > The official life cycle policy can be reviewed here: > > http://redhat.com/rhel/lifecycle > > This issue does not meet the inclusion criteria for the Production 3 Phase > and > will be marked as CLOSED/WONTFIX. If this remains a critical requirement, > please contact Red Hat Customer Support to request a re-evaluation of the > issue, citing a clear business justification. Note that a strong business > justification will be required for re-evaluation. Red Hat Customer Support > can > be contacted via the Red Hat Customer Portal at the following URL: > > https://access.redhat.com/ Chris, this ticket was open and discussed BEFORE RH6 entering Phase 3. You can easily WONTFIX it for a number of reasons, but saying it's not time anymore sounds like a parody. We evaluate our bugs during every release cycle, since we are now in Production Phase 3 for RHEL 6, all bugs (even bugs filed before the start of Production Phase 3) are evaluated against the current production phase criteria. |
Created attachment 1269735 [details] ZIP containing both straces. Description of problem: After upgrading CentOS 6.8 to 6.9 on a cluster pair, clvmd never starts correctly. It hangs with CPU stuck messages and never complete the DLM negotiation. The issue is only with new kernel 2.6.32-696 (last kernel released after version bump to 6.9). The same cluster/lvm2 packages with fo rmer kernel 2.6.32-642.15.1 works as expected. Version-Release number of selected component (if applicable): kernel-2.6.32-696.el6.x86_64 lvm2-cluster-2.02.143-12.el6.x86_64 How reproducible: create a cluster on 6.8 or 6.9 with kernel-2.6.32-642.15.1.el6.x86_64 and shared storage. start clvmd and eventually gfs2 FS to check it works. boot with kernel-2.6.32-696.el6.x86_64 start clvmd Steps to Reproduce: 1. create a cluster on 6.8 or 6.9 with kernel-2.6.32-642.15.1.el6.x86_64 and shared storage. start clvmd and eventually gfs2 FS to check it works. 2. boot with kernel-2.6.32-696.el6.x86_64 3. start clvmd Actual results: clvmd hangs and never register for DLM correctly. CLVM volumes are unavailable. GFS2 will not work. Expected results: clvmd completes the initialization and volumes are available. GFS2 works. Additional info: attached straces for working and not working cases.