Bug 1394068
Summary: | Pacemaker's lrmd crashes after certain systemd errors | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ken Gaillot <kgaillot> | ||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.2 | CC: | 252150241, abeekhof, cfeist, cluster-maint, cluster-qe, jkurik, kgaillot, mjuricek, mnavrati, royoung, tlavigne, vlad.socaciu | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | 7.3 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.15-11.el7_3.2 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1299339 | Environment: | |||||
Last Closed: | 2017-01-16 17:23:17 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1284069, 1299339 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Ken Gaillot
2016-11-10 23:57:57 UTC
The core backtrace (uploaded to the original bug) shows that an invalid memory address is being passed to a string comparison function. Further investigation will be needed to identify the ultimate cause; my first suspicion is a use-after-free issue. Created attachment 1219592 [details]
Another Set of Crash Data
This is the crash data from the other cluster node, in case it helps. The crash happened around the same time as the other one, so it may have the same or a similar root cause (or maybe not).
The crash occurred immediately after: Nov 08 21:32:01 uis1ccg2.uismag.com lrmd[1563]: warning: Cannot execute '/opt/unisys_ccg_gid/cluster_scripts/ccg_gid_resource_agent_1.sh': No such file or directory (2) so correcting that issue may avoid the crash. Of course, I will try to locate and correct the problem in pacemaker's handling of such an error. I am unsure how a resource agent under /opt can be executed. Can you attach the tar file created by the following command, so I can examine the full configuration and logs? pcs cluster report --from "2016-11-08 20:00:00" --to "2016-11-08 22:00:00" (In reply to Ken Gaillot from comment #4) > I am unsure how a resource agent under /opt can be executed. Can you attach > the tar file created by the following command, so I can examine the full > configuration and logs? > > pcs cluster report --from "2016-11-08 20:00:00" --to "2016-11-08 22:00:00" Using the --force argument to "pcs resource create", e.g.: pcs resource create resource_ccg_gid1 lsb:/opt/unisys_ccg_gid/cluster_scripts/ccg_gid_resource_agent_1.sh --force [... etc.] The report is still munching. I struggled with the report creation (see below, the output displayed at the end of the command). It seems that the report ends somewhat abnormally, no matter what. Not sure if I had to give the .tar extension to the report file specified on the command line, but without .tar, the program created a report only a few hundred bytes long. With the .tar extension, "pcs cluster report" appends another .tar before zipping. Nonetheless, I cannot submit the report; Bugzilla limits the upload size to ~20MB and the report has about 167MB. I will see if I can make it available some other way. uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10259 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10352 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10406 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10430 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10477 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10524 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10601 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10632 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10732 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10756 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10926 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11323 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11381 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11557 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11818 uis1ccg1: WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.12093 uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Skipping bad logfile '/var/log/cluster/corosync.log-20161109.gz': Could not determine log dates uis1ccg1: Including segment [7518-51881] from /var/log/messages-20161113 uis1ccg1: Including segment [7518-51881] from /var/log/messages-20161113 uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Unknown time format used by: uis1ccg1: WARN: Skipping bad logfile '/var/log/boot.log': Could not determine log dates uis1ccg1: Including segment [1478663990-1478671210] from journald tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors Connection closed by 10.0.1.46 tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors uis1ccg1: uis1ccg1: Collected results are available in /root/cluster_rep_2016-11-08_from_20_to_22.tar.tar.bz2 uis1ccg1: uis1ccg1: Please create a bug entry at uis1ccg1: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker uis1ccg1: Include a description of your problem and attach this tarball uis1ccg1: uis1ccg1: Thank you for taking time to create this report. uis1ccg1: The good news is I was able to reproduce this with the pacemaker packages in RHEL 7.2 but not with the just-released packages from 7.3. So, I believe this issue is already fixed. If you can test the 7.3 packages, let me know if you see any issues. I will continue to investigate to see if I can find the exact cause and fix. On a side note, the fencing topology is unnecessary in your configuration, though it doesn't hurt. Topology is needed when a single host requires two fence devices (for example, redundant power supplies can require two separate power switches, or someone might want to try IPMI to fence first, and if that fails, try a power switch). Topology is per-target, so your configuration has two topologies (targets), each with a single fence device. I believe the issue was fixed by upstream commits 67d68df1 and 786ebc41, which corrected potential use-after-free issues in the handling of in-flight operations. (In reply to Ken Gaillot from comment #7) > The good news is I was able to reproduce this with the pacemaker packages in > RHEL 7.2 but not with the just-released packages from 7.3. So, I believe > this issue is already fixed. If you can test the 7.3 packages, let me know > if you see any issues. > > I will continue to investigate to see if I can find the exact cause and fix. > > On a side note, the fencing topology is unnecessary in your configuration, > though it doesn't hurt. Topology is needed when a single host requires two > fence devices (for example, redundant power supplies can require two > separate power switches, or someone might want to try IPMI to fence first, > and if that fails, try a power switch). Topology is per-target, so your > configuration has two topologies (targets), each with a single fence device. May my ignorance be forgiven, but I wanted to understand the commentary about the "fencing topology", and right now I do not. What is a "fencing topology" in plain words: is it the fencing device that my colleague created? #pcs stonith show --full Resource: vmfence (class=stonith type=fence_vmware_soap) Attributes: pcmk_host_map=uis1ccg1-app:uis1ccg1-vm;uis1ccg2-app:uis1ccg2-vm ipaddr=10.0.0.30 login=administrator passwd=2Securepw pcmk_monitor_timeout=120s pcmk_host_list=uis1ccg1-app,uis1ccg2-app ipport=443 inet4_only=1 ssl_insecure=1 power_wait=3 pcmk_host_check=static-list Operations: monitor interval=60s (vmfence-monitor-interval-60s) Node: uis1ccg1-app Level 1 - vmfence Node: uis1ccg2-app Level 2 - vmfence Or it is something else ... . Did you mean to say that we did not need fencing at all? My understanding (from the manual) is that "fencing" is a mechanism which disallows a node from accessing the resources when some piece of software (the fencing software, I guess) decides that the node is irresponsive. The immediate objection to this "definition" is: if the node is irresponsive, how could it be forced not to access the resources? Perhaps the other node (or nodes) could have access to some overriding mechanism which could ensure that the irresponsive node is really disabled (such as automatically turning the power off on that node via a software mechanism). Let alone the "deathmatch" (found the term on the internet: http://ourobengr.com/ha/) which could occur if there are only two nodes in the configuration. I am sorry to admit that the following paragraph is beyond me: "Topology is needed when a single host requires two fence devices (for example, redundant power supplies can require two separate power switches, or someone might want to try IPMI to fence first, and if that fails, try a power switch). Topology is per-target, so your configuration has two topologies (targets), each with a single fence device". I would appreciate if you could maybe recommend a bibliography for this topic? If there is one ... . Thanks a lot. Sorry, I wasn't clear. Your fencing *devices* are necessary, and are configured correctly. These are created with "pcs stonith create". In addition to the devices, you also have fencing topologies, which are created by "pcs stonith level add". In your configuration, they don't hurt, but they don't do anything useful, either. A fencing device tells the cluster how to kill a node using a specific method (in your case, vmware's SOAP interface). A fencing topology tells the cluster that more than one device is needed to kill a node, which isn't true in your case. For example, a common situation for physical servers is to try on-board IPMI to kill the node first, but if the IPMI doesn't respond, try an intelligent power switch. In Pacemaker, the IPMI and the power switch would each be configured as a device, and a topology would tell Pacemaker to try the IPMI first and the power switch second. The reference is: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-fencing-HAAR.html I deleted the resource agents accidentally, while the resources were running. The node where they were running was brought down (by the fence, I assume), after lrmd crashed. After that, I was able to stop and delete the resources from the other node. But as soon as I create new resources, the node where the incident occurred initially keeps rebooting every three minutes, or so. Don't know how to escape this quagmire. Any advice would be appreciated! (In reply to vlad.socaciu from comment #12) > I deleted the resource agents accidentally, while the resources were > running. The node where they were running was brought down (by the fence, I > assume), after lrmd crashed. After that, I was able to stop and delete the > resources from the other node. But as soon as I create new resources, the > node where the incident occurred initially keeps rebooting every three > minutes, or so. > > Don't know how to escape this quagmire. > > Any advice would be appreciated! I got the reboot loop to stop. The circumstances of the incident were the following: - node A didn't have the resource agents - resources had been created on node B but not started If these two conditions were met, node A was rebooting continuously. I deleted the resources and copied the resource agents to node A. After that, I could create resources again and start them successfully. As a side note: the web interfaces on both nodes take a long time (~5 minutes or more) before they display both nodes as "error free". In the fist 5 minutes, both web interfaces showed pcsd not running on node B. (In reply to vlad.socaciu from comment #13) > I got the reboot loop to stop. The circumstances of the incident were the > following: > > - node A didn't have the resource agents > - resources had been created on node B but not started > > If these two conditions were met, node A was rebooting continuously. I > deleted the resources and copied the resource agents to node A. After that, > I could create resources again and start them successfully. This makes sense. Most likely, what happened is that the cluster tried to probe the status of the resources on A and couldn't, so it then tried to execute the stop action to be sure they weren't running. When the stop failed too, the cluster fenced the node. If the node rejoined, the same thing would happen again (until the agents existed). This issue is believed to be resolved in the 7.3 release packages. If it recurs with those packages, we can re-open for further investigation. |