Bug 1306387 - fencing cycle starts before dlm is loaded [NEEDINFO]
fencing cycle starts before dlm is loaded
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dlm (Show other bugs)
7.1
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: David Teigland
cluster-qe@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-10 12:14 EST by Nate Straz
Modified: 2017-07-27 16:47 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-27 16:47:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
kgaillot: needinfo? (nstraz)


Attachments (Terms of Use)
Systems logs from cluster nodes (115.13 KB, application/x-gzip)
2016-02-10 12:14 EST, Nate Straz
no flags Details
cluster cib (28.91 KB, text/plain)
2016-02-10 12:27 EST, Nate Straz
no flags Details

  None (edit)
Description Nate Straz 2016-02-10 12:14:24 EST
Created attachment 1122847 [details]
Systems logs from cluster nodes

Description of problem:

During a long run of our recovery test, revolver, we ended up with a three node cluster that would continuously fence two of its nodes.

The cluster has three nodes, west-02, west-03, and west-04.

1. west-02 and west-03 are chosen by revolver to be killed.
2. west-03 is told to bring down it's network interfaces.
3. west-02 is rebooted.
4. west-02 leaves the cluster and west-03 fences it before it's network dies.
5. recovery for west-02's dlm lockspaces and GFS2 file systems.
6. west-03 leaves the cluster while west-02 is rebooting and west-04 notices, but can't do anything until quorum is restored.
7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03.
8. west-02 starts dlm and stops, waiting for a fence action.
9. west-03 rejoins the cluster.
10. clvmd startup on west-02 times out and is fenced because of a resource failure.
11. west-02 rejoins the cluster.
12. clvmd startup on west-03 times out and is fenced because of a resource failure.
13. goto 9


Version-Release number of selected component (if applicable):
pacemaker-1.1.13-10.el7.x86_64
dlm-4.0.2-6.el7.x86_64
kernel-4.4.0+

How reproducible:
Unknown

Steps to Reproduce:
1. revolver -i 100 -L CITY
2.
3.

Actual results:


Expected results:


Additional info:
Comment 2 Nate Straz 2016-02-10 12:27 EST
Created attachment 1122853 [details]
cluster cib
Comment 3 Andrew Beekhof 2016-02-16 20:57:33 EST
Step 8 - the fencing action its waiting for is the one at step 7?
How long is clvmd prepared to wait and how does this compare to how long the fencing operation might take?
Comment 4 Nate Straz 2016-02-17 09:54:33 EST
Yes, the fence action in step 7 is the one dlm is waiting for in step 8.  But DLM isn't aware of it because it hasn't started on that node yet.

clvmd waits for about 90 seconds before giving up and exiting with an error.  Fencing should only take a few seconds.  However, if quorum is lost, it'll take until a node reboots and rejoins the cluster.
Comment 5 Ken Gaillot 2016-02-17 20:34:50 EST
I'm still examining the logs, but some observations so far:

* west-04's CPU load is high for the entire time covered by the logs, peaking at 268.049988. It's hard to predict what effect that might have on cluster operation. Any idea what caused the load?

* Corosync doesn't support (or work well with) DHCP (see https://access.redhat.com/solutions/1529133). The nodes should be configured with static IPs.

* Are the clocks on these nodes out of sync? west-04's logs show west-02 was lost at 15:15:29, but west-02's logs don't show the reboot initiated until 15:15:34. That shouldn't cause problems, but it does complicate log analysis.

* We see a message "Merging stonith action reboot for node west-02 originating from client crmd.1852.82f58672 with identical request from stonith-api.4408@west-03.c469c351", but I believe the behavior here is unrelated to the issue affecting merges described in https://bugzilla.redhat.com/show_bug.cgi?id=1077404 , because dlm's request is first in this case, and dlm doesn't hang on west-04. Still, I'd be curious whether the fix made for that bug changes the behavior here at all.
Comment 6 Andrew Beekhof 2016-02-18 22:01:35 EST
> 8. west-02 starts dlm and stops, waiting for a fence action.

Stupid question: why is it waiting?

There's nothing on the pacemaker side to suggest that the dlm on west-02 asked for any fencing to occur, so why does it expect that waiting will achieve anything?

I thought it was supposed to look up when the last fencing event was and initiate a new one if it wasn't recent enough.  I jumped through several hoops (way back when) to make both steps as easy as possible for the dlm.

How reproducible is this? 
Can we get better logs from the dlm(_controld) ?

In their absence, its hard to point the finger at pacemaker.
If the dlm wants fencing to occur, it needs to ask for it. 
All the tools necessary are available.
Comment 7 Nate Straz 2016-02-19 09:56:40 EST
(In reply to Ken Gaillot from comment #5)
> I'm still examining the logs, but some observations so far:
> 
> * west-04's CPU load is high for the entire time covered by the logs,
> peaking at 268.049988. It's hard to predict what effect that might have on
> cluster operation. Any idea what caused the load?

When fencing is delayed by loss of quorum, GFS2 related kworker threads can spike.  We have not found the cause of this yet.  All of those threads are blocked on glocks and not using CPU time.

> * Corosync doesn't support (or work well with) DHCP (see
> https://access.redhat.com/solutions/1529133). The nodes should be configured
> with static IPs.

All hosts are given specific (static) IPs based on their MAC address.  QE has worked this way for 10+ years without issues.  We also have our DHCP server as part of an HA cluster to avoid single point of failure.

> * Are the clocks on these nodes out of sync? west-04's logs show west-02 was
> lost at 15:15:29, but west-02's logs don't show the reboot initiated until
> 15:15:34. That shouldn't cause problems, but it does complicate log analysis.

All nodes are running ntpd and are kept in sync as much as ntpd will allow.  I do find they can become out of sync from time to time.
Comment 8 Andrew Beekhof 2016-02-21 21:13:03 EST
(In reply to Nate Straz from comment #7)
> (In reply to Ken Gaillot from comment #5)
> > * Corosync doesn't support (or work well with) DHCP (see
> > https://access.redhat.com/solutions/1529133). The nodes should be configured
> > with static IPs.
> 
> All hosts are given specific (static) IPs based on their MAC address.  QE
> has worked this way for 10+ years without issues. 

Really?  

I was certainly under the impression that there is a history of NetworkManager causing network issues when refreshing DHCP leases, even with fixed assignment tables.
Comment 9 Fabio Massimo Di Nitto 2016-02-21 23:38:07 EST
(In reply to Andrew Beekhof from comment #8)
> (In reply to Nate Straz from comment #7)
> > (In reply to Ken Gaillot from comment #5)
> > > * Corosync doesn't support (or work well with) DHCP (see
> > > https://access.redhat.com/solutions/1529133). The nodes should be configured
> > > with static IPs.
> > 
> > All hosts are given specific (static) IPs based on their MAC address.  QE
> > has worked this way for 10+ years without issues. 
> 
> Really?  
> 
> I was certainly under the impression that there is a history of
> NetworkManager causing network issues when refreshing DHCP leases, even with
> fixed assignment tables.

We have seen this issue specifically around RHEL7.0/7.1 and i remember spending a great deal of time debugging it. We can use NM now, but no dhcp. That´s still a no go. Tho depending on the lease time, you might or might not see the issue at all. If the lease is very short, it´s more likely to trigger the problem. A very long lease might not be noticed if the test time is lower than the lease.
Comment 10 Nate Straz 2016-02-22 10:12:03 EST
(In reply to Fabio Massimo Di Nitto from comment #9)
> We have seen this issue specifically around RHEL7.0/7.1 and i remember
> spending a great deal of time debugging it. We can use NM now, but no dhcp.
> That´s still a no go. Tho depending on the lease time, you might or might
> not see the issue at all. If the lease is very short, it´s more likely to
> trigger the problem. A very long lease might not be noticed if the test time
> is lower than the lease.

Our default lease time is 5 days.

Regardless, you would see the NetworkManager messages in the logs if this did happen, and it didn't.
Comment 11 Ken Gaillot 2016-05-09 10:19:38 EDT
Reassigning to dlm to get another perspective.

From what I see, Pacemaker and the resource agent are reacting properly when the stop fails during recovery after the start timeout. The problem seems to be that DLM neither asks for fencing nor recognizes that fencing has already happened.
Comment 12 David Teigland 2016-05-09 11:17:16 EDT
> 7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03.

If power fencing is used, then 3 should be rebooting after this step.

> 8. west-02 starts dlm and stops, waiting for a fence action.

The fencing operation in step 7 should not take long, so this shouldn't wait long.

> 9. west-03 rejoins the cluster.

Is 2 still waiting for step 7 to finish when 3 rejoins, or did the fencing operation fail in step 7?  Did 3 restart dlm_controld after starting the cluster?  If so, then 2 would cancel/clear the fencing operation against it.
The dlm_controld logs would explain.

> 10. clvmd startup on west-02 times out and is fenced because of a resource failure.

If clvmd is configured to time out, then you get what you asked for. The same if you've asked to fence the machine if clvmd times out.  This step is always going to cause unreliable/unstable behavior and is fairly uninteresting.

As far as I'm concerned, if step 10 is simply taken away, things would probably have worked out ok after step 9, in spite of the unknown problems in steps 7-9 (which should still be answered from log files.)
Comment 13 Ken Gaillot 2016-05-16 14:03:57 EDT
(In reply to David Teigland from comment #12)
> > 7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03.
> 
> If power fencing is used, then 3 should be rebooting after this step.

Correct, you can see in the attached logs that west-02 reports the fencing successful at 15:18:41, and west-03 shows OS boot messages starting at 15:21:00. (Note, I am not sure how closely the clocks are synchronized between nodes.)

> > 8. west-02 starts dlm and stops, waiting for a fence action.
> 
> The fencing operation in step 7 should not take long, so this shouldn't wait
> long.

But it seems to wait forever. Note that west-03 starts DLM after the cluster had already requested and executed the fencing; DLM appears unable to tell that this has just happened, and also does not issue a fencing request of its own.

> > 9. west-03 rejoins the cluster.
> 
> Is 2 still waiting for step 7 to finish when 3 rejoins, or did the fencing
> operation fail in step 7?  Did 3 restart dlm_controld after starting the
> cluster? 

dlm_controld is a resource managed by the cluster, so it is not started at boot; the cluster will start it after the node joins the cluster and quorum is available. You can see it start on west-03 at 15:21:58 in the attached logs.

> If so, then 2 would cancel/clear the fencing operation against it.
> The dlm_controld logs would explain.

Are the attached logs enough? 

> > 10. clvmd startup on west-02 times out and is fenced because of a resource failure.
> 
> If clvmd is configured to time out, then you get what you asked for. The
> same if you've asked to fence the machine if clvmd times out.  This step is
> always going to cause unreliable/unstable behavior and is fairly
> uninteresting.

Once DLM is stuck in waiting for fencing, it is expected that clvmd will time out. Can clvmd do anything but timeout if it is unable to use DLM?

> As far as I'm concerned, if step 10 is simply taken away, things would
> probably have worked out ok after step 9, in spite of the unknown problems
> in steps 7-9 (which should still be answered from log files.)

I assume you mean here that fencing is not necessary. In this particular configuration, the user has configured on-fail=fence for dlm_controld and clvmd. I am not sure why that would be desired here, but it is a supported use case, and customers do sometimes use it.
Comment 14 David Teigland 2016-05-16 14:41:33 EDT
> > The fencing operation in step 7 should not take long, so this shouldn't wait
> > long.
> 
> But it seems to wait forever. 

We need specifics to investigate that, i.e. dlm_controld logs, dlm_tool output.  What's the status of the dlm's stonith helper/agent on west-02 while it's waiting?  Was it never run?  Is it running?  If so, what times is it comparing?  Or does dlm_controld indicate it failed?

Does the version include this patch?
https://git.fedorahosted.org/cgit/dlm.git/commit/?id=fb61984c9388cbbcc02c6a09c09948b21320412d

> > If so, then 2 would cancel/clear the fencing operation against it.
> > The dlm_controld logs would explain.
> 
> Are the attached logs enough?

We need dlm_controld logs (and dlm_tool output might be nice too)... have I missed those somewhere?
Comment 15 Ken Gaillot 2016-05-16 15:23:15 EDT
Nate,

Do you have the additional logging mentioned in Comment 14? Or can you try to reproduce it and collect the info?

David, the Pacemaker version used is the latest 7.2 package, so I would guess the DLM packages are the same -- would they have that commit?
Comment 16 David Teigland 2016-05-16 15:34:12 EDT
> David, the Pacemaker version used is the latest 7.2 package, so I would
> guess the DLM packages are the same -- would they have that commit?

That's bug 1115143 which wasn't approved for 7.2, but it'll be in 7.3 due to the rebase.
Comment 17 Nate Straz 2016-05-16 15:51:44 EDT
I don't have those logs from that run.  I am trying to reproduce it but I haven't had a success yet.
Comment 18 Nate Straz 2016-06-01 17:28:24 EDT
I was able to reproduce something similar and added some information to bug 1321079 which may be the root of the problem.  dlm_controld could get stuck on dlm if it tries to free gfs2 structures while allocation memory for itself.
Comment 19 Ken Gaillot 2016-06-23 15:59:15 EDT
Nate, the latest 7.3 builds of dlm (4.0.6-1) and pacemaker (1.1.15-3) have fixes that may be relevant here, particularly the fixes for Bug 1077404, Bug 1115143, and Bug 1268313. If you have the chance to try reproducing with those packages, that would help narrow down whether/where this is still an issue.
Comment 20 Jonathan Earl Brassow 2017-07-27 16:47:51 EDT
Nate, we believe this to be fixed.  Since it has been a year w/o comment, I'm going to close this bug.  Feel free to reopen if I'm wrong.

Note You need to log in before you can comment on or make changes to this bug.