Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Fencing now occurs when DLM requires it, even when the cluster itself does not
Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the `ocf:pacemaker:controld` resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover.
Description of problem:
This is bug for the same problem as bz1241511 but with component resource-agent.
resume of bz124151: All nodes lose/resume quorum without fencing (caused by
network connection loss). This leads to dlm waiting for fencing causing clvmd
to appear hung and all access to lvs will wait forever.
The resulting situation is quorate cluster with clvmd-clone and dlm-clone (said
to be) running correctly where they both are hung. As concluded in this
bz1241511 the problem is not pacemaker's nor dlm's.
I believe that the correct resolution of this situation will be clvmd failing
it's monitor action leading fence.
Version-Release number of selected component (if applicable):
dlm-4.0.2-5.el7.x86_64
lvm2-cluster-2.02.115-3.el7.x86_64
pacemaker-1.1.12-22.el7.x86_64
corosync-2.3.4-4.el7.x86_64
How reproducible: very frequent
Steps to Reproduce:
. have quorate pacemaker cluster
. check nodes uptime
. disable network communication between all nodes with iptables and wait for
all nodes turning inquorate
. enable at the same time network communication between nodes
. check whether fencing occured and if it has not check dlm status and logs
Actual results: dlm hanging
Expected results: dlm happilly working
Additional info:
# tail /var/log/messages
...
Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul 9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:1 (virt-018 - blocked)
Jul 9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop dlm:2 (virt-019 - blocked)
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul 9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul 9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0
Jul 9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing
I'm not sure about the exact command.
We should somehow find out that controld is in the non operational state that can be resolved only by fencing.
'dlm_tool ls' at the moment of the problem shows:
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 3 joined 1 remove 0 failed 0 seq 1,1
members 1 2 3
new change member 3 joined 1 remove 0 failed 0 seq 2,4
new status wait fencing
new members 1 2 3
Maybe the 'wait fencing' status gives us this information?
controld is a shell script that calls into the dlm/clvmd, so
> We should somehow find out that controld is in the non operational state that
> can be resolved only by fencing.
has no meaning.
We can have it look for 'wait fencing' though.
Is there an equivelent command when controld is managing clvmd_controld?
Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.
Comment 6Oyvind Albrigtsen
2015-11-17 09:58:29 UTC
Is there an easy way to reproduce this issue?
I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing.
Comment 7Oyvind Albrigtsen
2015-11-19 13:53:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHSA-2016-2578.html