| Summary: | fencing cycle starts before dlm is loaded | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Nate Straz <nstraz> | ||||||
| Component: | dlm | Assignee: | David Teigland <teigland> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.1 | CC: | abeekhof, cluster-maint, fdinitto, jbrassow, kgaillot, nstraz, rpeterso, teigland | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-07-27 20:47:51 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
Created attachment 1122853 [details]
cluster cib
Step 8 - the fencing action its waiting for is the one at step 7? How long is clvmd prepared to wait and how does this compare to how long the fencing operation might take? Yes, the fence action in step 7 is the one dlm is waiting for in step 8. But DLM isn't aware of it because it hasn't started on that node yet. clvmd waits for about 90 seconds before giving up and exiting with an error. Fencing should only take a few seconds. However, if quorum is lost, it'll take until a node reboots and rejoins the cluster. I'm still examining the logs, but some observations so far: * west-04's CPU load is high for the entire time covered by the logs, peaking at 268.049988. It's hard to predict what effect that might have on cluster operation. Any idea what caused the load? * Corosync doesn't support (or work well with) DHCP (see https://access.redhat.com/solutions/1529133). The nodes should be configured with static IPs. * Are the clocks on these nodes out of sync? west-04's logs show west-02 was lost at 15:15:29, but west-02's logs don't show the reboot initiated until 15:15:34. That shouldn't cause problems, but it does complicate log analysis. * We see a message "Merging stonith action reboot for node west-02 originating from client crmd.1852.82f58672 with identical request from stonith-api.4408", but I believe the behavior here is unrelated to the issue affecting merges described in https://bugzilla.redhat.com/show_bug.cgi?id=1077404 , because dlm's request is first in this case, and dlm doesn't hang on west-04. Still, I'd be curious whether the fix made for that bug changes the behavior here at all. > 8. west-02 starts dlm and stops, waiting for a fence action.
Stupid question: why is it waiting?
There's nothing on the pacemaker side to suggest that the dlm on west-02 asked for any fencing to occur, so why does it expect that waiting will achieve anything?
I thought it was supposed to look up when the last fencing event was and initiate a new one if it wasn't recent enough. I jumped through several hoops (way back when) to make both steps as easy as possible for the dlm.
How reproducible is this?
Can we get better logs from the dlm(_controld) ?
In their absence, its hard to point the finger at pacemaker.
If the dlm wants fencing to occur, it needs to ask for it.
All the tools necessary are available.
(In reply to Ken Gaillot from comment #5) > I'm still examining the logs, but some observations so far: > > * west-04's CPU load is high for the entire time covered by the logs, > peaking at 268.049988. It's hard to predict what effect that might have on > cluster operation. Any idea what caused the load? When fencing is delayed by loss of quorum, GFS2 related kworker threads can spike. We have not found the cause of this yet. All of those threads are blocked on glocks and not using CPU time. > * Corosync doesn't support (or work well with) DHCP (see > https://access.redhat.com/solutions/1529133). The nodes should be configured > with static IPs. All hosts are given specific (static) IPs based on their MAC address. QE has worked this way for 10+ years without issues. We also have our DHCP server as part of an HA cluster to avoid single point of failure. > * Are the clocks on these nodes out of sync? west-04's logs show west-02 was > lost at 15:15:29, but west-02's logs don't show the reboot initiated until > 15:15:34. That shouldn't cause problems, but it does complicate log analysis. All nodes are running ntpd and are kept in sync as much as ntpd will allow. I do find they can become out of sync from time to time. (In reply to Nate Straz from comment #7) > (In reply to Ken Gaillot from comment #5) > > * Corosync doesn't support (or work well with) DHCP (see > > https://access.redhat.com/solutions/1529133). The nodes should be configured > > with static IPs. > > All hosts are given specific (static) IPs based on their MAC address. QE > has worked this way for 10+ years without issues. Really? I was certainly under the impression that there is a history of NetworkManager causing network issues when refreshing DHCP leases, even with fixed assignment tables. (In reply to Andrew Beekhof from comment #8) > (In reply to Nate Straz from comment #7) > > (In reply to Ken Gaillot from comment #5) > > > * Corosync doesn't support (or work well with) DHCP (see > > > https://access.redhat.com/solutions/1529133). The nodes should be configured > > > with static IPs. > > > > All hosts are given specific (static) IPs based on their MAC address. QE > > has worked this way for 10+ years without issues. > > Really? > > I was certainly under the impression that there is a history of > NetworkManager causing network issues when refreshing DHCP leases, even with > fixed assignment tables. We have seen this issue specifically around RHEL7.0/7.1 and i remember spending a great deal of time debugging it. We can use NM now, but no dhcp. That´s still a no go. Tho depending on the lease time, you might or might not see the issue at all. If the lease is very short, it´s more likely to trigger the problem. A very long lease might not be noticed if the test time is lower than the lease. (In reply to Fabio Massimo Di Nitto from comment #9) > We have seen this issue specifically around RHEL7.0/7.1 and i remember > spending a great deal of time debugging it. We can use NM now, but no dhcp. > That´s still a no go. Tho depending on the lease time, you might or might > not see the issue at all. If the lease is very short, it´s more likely to > trigger the problem. A very long lease might not be noticed if the test time > is lower than the lease. Our default lease time is 5 days. Regardless, you would see the NetworkManager messages in the logs if this did happen, and it didn't. Reassigning to dlm to get another perspective. From what I see, Pacemaker and the resource agent are reacting properly when the stop fails during recovery after the start timeout. The problem seems to be that DLM neither asks for fencing nor recognizes that fencing has already happened. > 7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03. If power fencing is used, then 3 should be rebooting after this step. > 8. west-02 starts dlm and stops, waiting for a fence action. The fencing operation in step 7 should not take long, so this shouldn't wait long. > 9. west-03 rejoins the cluster. Is 2 still waiting for step 7 to finish when 3 rejoins, or did the fencing operation fail in step 7? Did 3 restart dlm_controld after starting the cluster? If so, then 2 would cancel/clear the fencing operation against it. The dlm_controld logs would explain. > 10. clvmd startup on west-02 times out and is fenced because of a resource failure. If clvmd is configured to time out, then you get what you asked for. The same if you've asked to fence the machine if clvmd times out. This step is always going to cause unreliable/unstable behavior and is fairly uninteresting. As far as I'm concerned, if step 10 is simply taken away, things would probably have worked out ok after step 9, in spite of the unknown problems in steps 7-9 (which should still be answered from log files.) (In reply to David Teigland from comment #12) > > 7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03. > > If power fencing is used, then 3 should be rebooting after this step. Correct, you can see in the attached logs that west-02 reports the fencing successful at 15:18:41, and west-03 shows OS boot messages starting at 15:21:00. (Note, I am not sure how closely the clocks are synchronized between nodes.) > > 8. west-02 starts dlm and stops, waiting for a fence action. > > The fencing operation in step 7 should not take long, so this shouldn't wait > long. But it seems to wait forever. Note that west-03 starts DLM after the cluster had already requested and executed the fencing; DLM appears unable to tell that this has just happened, and also does not issue a fencing request of its own. > > 9. west-03 rejoins the cluster. > > Is 2 still waiting for step 7 to finish when 3 rejoins, or did the fencing > operation fail in step 7? Did 3 restart dlm_controld after starting the > cluster? dlm_controld is a resource managed by the cluster, so it is not started at boot; the cluster will start it after the node joins the cluster and quorum is available. You can see it start on west-03 at 15:21:58 in the attached logs. > If so, then 2 would cancel/clear the fencing operation against it. > The dlm_controld logs would explain. Are the attached logs enough? > > 10. clvmd startup on west-02 times out and is fenced because of a resource failure. > > If clvmd is configured to time out, then you get what you asked for. The > same if you've asked to fence the machine if clvmd times out. This step is > always going to cause unreliable/unstable behavior and is fairly > uninteresting. Once DLM is stuck in waiting for fencing, it is expected that clvmd will time out. Can clvmd do anything but timeout if it is unable to use DLM? > As far as I'm concerned, if step 10 is simply taken away, things would > probably have worked out ok after step 9, in spite of the unknown problems > in steps 7-9 (which should still be answered from log files.) I assume you mean here that fencing is not necessary. In this particular configuration, the user has configured on-fail=fence for dlm_controld and clvmd. I am not sure why that would be desired here, but it is a supported use case, and customers do sometimes use it. > > The fencing operation in step 7 should not take long, so this shouldn't wait > > long. > > But it seems to wait forever. We need specifics to investigate that, i.e. dlm_controld logs, dlm_tool output. What's the status of the dlm's stonith helper/agent on west-02 while it's waiting? Was it never run? Is it running? If so, what times is it comparing? Or does dlm_controld indicate it failed? Does the version include this patch? https://git.fedorahosted.org/cgit/dlm.git/commit/?id=fb61984c9388cbbcc02c6a09c09948b21320412d > > If so, then 2 would cancel/clear the fencing operation against it. > > The dlm_controld logs would explain. > > Are the attached logs enough? We need dlm_controld logs (and dlm_tool output might be nice too)... have I missed those somewhere? Nate, Do you have the additional logging mentioned in Comment 14? Or can you try to reproduce it and collect the info? David, the Pacemaker version used is the latest 7.2 package, so I would guess the DLM packages are the same -- would they have that commit? > David, the Pacemaker version used is the latest 7.2 package, so I would > guess the DLM packages are the same -- would they have that commit? That's bug 1115143 which wasn't approved for 7.2, but it'll be in 7.3 due to the rebase. I don't have those logs from that run. I am trying to reproduce it but I haven't had a success yet. I was able to reproduce something similar and added some information to bug 1321079 which may be the root of the problem. dlm_controld could get stuck on dlm if it tries to free gfs2 structures while allocation memory for itself. Nate, the latest 7.3 builds of dlm (4.0.6-1) and pacemaker (1.1.15-3) have fixes that may be relevant here, particularly the fixes for Bug 1077404, Bug 1115143, and Bug 1268313. If you have the chance to try reproducing with those packages, that would help narrow down whether/where this is still an issue. Nate, we believe this to be fixed. Since it has been a year w/o comment, I'm going to close this bug. Feel free to reopen if I'm wrong. |
Created attachment 1122847 [details] Systems logs from cluster nodes Description of problem: During a long run of our recovery test, revolver, we ended up with a three node cluster that would continuously fence two of its nodes. The cluster has three nodes, west-02, west-03, and west-04. 1. west-02 and west-03 are chosen by revolver to be killed. 2. west-03 is told to bring down it's network interfaces. 3. west-02 is rebooted. 4. west-02 leaves the cluster and west-03 fences it before it's network dies. 5. recovery for west-02's dlm lockspaces and GFS2 file systems. 6. west-03 leaves the cluster while west-02 is rebooting and west-04 notices, but can't do anything until quorum is restored. 7. west-02 rejoins the cluster, quorum is restored, and west-02 fences west-03. 8. west-02 starts dlm and stops, waiting for a fence action. 9. west-03 rejoins the cluster. 10. clvmd startup on west-02 times out and is fenced because of a resource failure. 11. west-02 rejoins the cluster. 12. clvmd startup on west-03 times out and is fenced because of a resource failure. 13. goto 9 Version-Release number of selected component (if applicable): pacemaker-1.1.13-10.el7.x86_64 dlm-4.0.2-6.el7.x86_64 kernel-4.4.0+ How reproducible: Unknown Steps to Reproduce: 1. revolver -i 100 -L CITY 2. 3. Actual results: Expected results: Additional info: