Bug 1394068

Summary: Pacemaker's lrmd crashes after certain systemd errors
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: 252150241, abeekhof, cfeist, cluster-maint, cluster-qe, jkurik, kgaillot, mjuricek, mnavrati, royoung, tlavigne, vlad.socaciu
Target Milestone: rcKeywords: ZStream
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-11.el7_3.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1299339 Environment:
Last Closed: 2017-01-16 17:23:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1284069, 1299339    
Bug Blocks:    
Attachments:
Description Flags
Another Set of Crash Data none

Description Ken Gaillot 2016-11-10 23:57:57 UTC
+++ This bug was initially created as a clone of Bug #1299339 +++

--- Additional comment from  on 2016-11-10 18:46:30 EST ---

lrmd still crashes quite consistently in pacemaker-1.1.13-10.el7_2.4.x86_64. Here are just two examples:

/opt/unisys_ccg_gid/cluster_scripts #abrt-cli list --since 1478651531
id b10052d2f8f15628c3343c336a446ed40078a16b
reason:         lrmd killed by SIGSEGV
time:           Tue 08 Nov 2016 09:33:42 PM PST
cmdline:        /usr/libexec/pacemaker/lrmd
package:        pacemaker-1.1.13-10.el7_2.4
uid:            0 (root)
count:          1
Directory:      /var/spool/abrt/ccpp-2016-11-08-21:33:42-1563

/var/log/cluster #abrt-cli list
id b01b01d4708d3664f696ae9bea577f5c73dab35d
reason:         lrmd killed by SIGSEGV
time:           Tue 08 Nov 2016 09:36:28 PM PST
cmdline:        /usr/libexec/pacemaker/lrmd
package:        pacemaker-1.1.13-10.el7_2.4
uid:            0 (root)
count:          2
Directory:      /var/spool/abrt/ccpp-2016-11-08-21:36:28-1714

The content of /var/spool/abrt/ccpp-2016-11-08-21:36:28-1714 will be uploaded if the upload works.

--- Additional comment from  on 2016-11-10 18:48 EST ---

The content of /var/spool/abrt/ccpp-2016-11-08-21:36:28-1714 is posted

Comment 1 Ken Gaillot 2016-11-11 00:04:14 UTC
The core backtrace (uploaded to the original bug) shows that an invalid memory address is being passed to a string comparison function. Further investigation will be needed to identify the ultimate cause; my first suspicion is a use-after-free issue.

Comment 2 vlad.socaciu 2016-11-11 00:21:53 UTC
Created attachment 1219592 [details]
Another Set of Crash Data

This is the crash data from the other cluster node, in case it helps. The crash happened around the same time as the other one, so it may have the same or a similar root cause (or maybe not).

Comment 3 Ken Gaillot 2016-11-11 15:13:43 UTC
The crash occurred immediately after:

Nov 08 21:32:01 uis1ccg2.uismag.com lrmd[1563]:  warning: Cannot execute '/opt/unisys_ccg_gid/cluster_scripts/ccg_gid_resource_agent_1.sh': No such file or directory (2)

so correcting that issue may avoid the crash. Of course, I will try to locate and correct the problem in pacemaker's handling of such an error.

Comment 4 Ken Gaillot 2016-11-15 15:56:41 UTC
I am unsure how a resource agent under /opt can be executed. Can you attach the tar file created by the following command, so I can examine the full configuration and logs?

pcs cluster report --from "2016-11-08 20:00:00" --to "2016-11-08 22:00:00"

Comment 5 vlad.socaciu 2016-11-15 22:44:40 UTC
(In reply to Ken Gaillot from comment #4)
> I am unsure how a resource agent under /opt can be executed. Can you attach
> the tar file created by the following command, so I can examine the full
> configuration and logs?
> 
> pcs cluster report --from "2016-11-08 20:00:00" --to "2016-11-08 22:00:00"

Using the --force argument to "pcs resource create", e.g.:

pcs resource create resource_ccg_gid1 lsb:/opt/unisys_ccg_gid/cluster_scripts/ccg_gid_resource_agent_1.sh --force [... etc.] 

The report is still munching.

Comment 6 vlad.socaciu 2016-11-15 23:00:57 UTC
I struggled with the report creation (see below, the output displayed at the end of the command). It seems that the report ends somewhat abnormally, no matter what. Not sure if I had to give the .tar extension to the report file specified on the command line, but without .tar, the program created a report only a few hundred bytes long. With the .tar extension, "pcs cluster report" appends another .tar before zipping. 

Nonetheless, I cannot submit the report; Bugzilla limits the upload size to ~20MB and the report has about 167MB. I will see if I can make it available some other way.

uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10259
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10352
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10406
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10430
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10477
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10524
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10601
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10632
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10732
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10756
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.10926
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11323
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11381
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11557
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.11818
uis1ccg1:   WARN: Could not find the program path for core /var/lib/pacemaker/cores/core.12093
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Skipping bad logfile '/var/log/cluster/corosync.log-20161109.gz': Could not determine log dates
uis1ccg1:   Including segment [7518-51881] from /var/log/messages-20161113
uis1ccg1:   Including segment [7518-51881] from /var/log/messages-20161113
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Unknown time format used by: 
uis1ccg1:   WARN: Skipping bad logfile '/var/log/boot.log': Could not determine log dates
uis1ccg1:   Including segment [1478663990-1478671210] from journald
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Connection closed by 10.0.1.46
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
uis1ccg1:    
uis1ccg1:   Collected results are available in /root/cluster_rep_2016-11-08_from_20_to_22.tar.tar.bz2
uis1ccg1:    
uis1ccg1:   Please create a bug entry at
uis1ccg1:       http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
uis1ccg1:   Include a description of your problem and attach this tarball
uis1ccg1:    
uis1ccg1:   Thank you for taking time to create this report.
uis1ccg1:

Comment 7 Ken Gaillot 2016-11-16 16:00:18 UTC
The good news is I was able to reproduce this with the pacemaker packages in RHEL 7.2 but not with the just-released packages from 7.3. So, I believe this issue is already fixed. If you can test the 7.3 packages, let me know if you see any issues.

I will continue to investigate to see if I can find the exact cause and fix.

On a side note, the fencing topology is unnecessary in your configuration, though it doesn't hurt. Topology is needed when a single host requires two fence devices (for example, redundant power supplies can require two separate power switches, or someone might want to try IPMI to fence first, and if that fails, try a power switch). Topology is per-target, so your configuration has two topologies (targets), each with a single fence device.

Comment 8 Ken Gaillot 2016-11-16 16:35:35 UTC
I believe the issue was fixed by upstream commits 67d68df1 and 786ebc41, which corrected potential use-after-free issues in the handling of in-flight operations.

Comment 10 vlad.socaciu 2016-11-23 04:12:43 UTC
(In reply to Ken Gaillot from comment #7)
> The good news is I was able to reproduce this with the pacemaker packages in
> RHEL 7.2 but not with the just-released packages from 7.3. So, I believe
> this issue is already fixed. If you can test the 7.3 packages, let me know
> if you see any issues.
> 
> I will continue to investigate to see if I can find the exact cause and fix.
> 
> On a side note, the fencing topology is unnecessary in your configuration,
> though it doesn't hurt. Topology is needed when a single host requires two
> fence devices (for example, redundant power supplies can require two
> separate power switches, or someone might want to try IPMI to fence first,
> and if that fails, try a power switch). Topology is per-target, so your
> configuration has two topologies (targets), each with a single fence device.

May my ignorance be forgiven, but I wanted to understand the commentary about the "fencing topology", and right now I do not.

What is a "fencing topology" in plain words: is it the fencing device that my colleague created?

#pcs stonith show --full
 Resource: vmfence (class=stonith type=fence_vmware_soap)
  Attributes: pcmk_host_map=uis1ccg1-app:uis1ccg1-vm;uis1ccg2-app:uis1ccg2-vm ipaddr=10.0.0.30 login=administrator passwd=2Securepw pcmk_monitor_timeout=120s pcmk_host_list=uis1ccg1-app,uis1ccg2-app ipport=443 inet4_only=1 ssl_insecure=1 power_wait=3 pcmk_host_check=static-list 
  Operations: monitor interval=60s (vmfence-monitor-interval-60s)
 Node: uis1ccg1-app
  Level 1 - vmfence
 Node: uis1ccg2-app
  Level 2 - vmfence

Or it is something else ... .

Did you mean to say that we did not need fencing at all?

My understanding (from the manual) is that "fencing" is a mechanism which disallows a node from accessing the resources when some piece of software (the fencing software, I guess) decides that the node is irresponsive. The immediate objection to this "definition" is: if the node is irresponsive, how could it be forced not to access the resources? Perhaps the other node (or nodes) could have access to some overriding mechanism which could ensure that the irresponsive node is really disabled (such as automatically turning the power off on that node via a software mechanism).

Let alone the "deathmatch" (found the term on the internet: http://ourobengr.com/ha/) which could occur if there are only two nodes in the configuration.

I am sorry to admit that the following paragraph is beyond me:

"Topology is needed when a single host requires two fence devices (for example, redundant power supplies can require two separate power switches, or someone might want to try IPMI to fence first, and if that fails, try a power switch). Topology is per-target, so your configuration has two topologies (targets), each with a single fence device".

I would appreciate if you could maybe recommend a bibliography for this topic? If there is one ... .

Thanks a lot.

Comment 11 Ken Gaillot 2016-11-23 15:33:27 UTC
Sorry, I wasn't clear. Your fencing *devices* are necessary, and are configured correctly. These are created with "pcs stonith create".

In addition to the devices, you also have fencing topologies, which are created by "pcs stonith level add". In your configuration, they don't hurt, but they don't do anything useful, either.

A fencing device tells the cluster how to kill a node using a specific method (in your case, vmware's SOAP interface). A fencing topology tells the cluster that more than one device is needed to kill a node, which isn't true in your case.

For example, a common situation for physical servers is to try on-board IPMI to kill the node first, but if the IPMI doesn't respond, try an intelligent power switch. In Pacemaker, the IPMI and the power switch would each be configured as a device, and a topology would tell Pacemaker to try the IPMI first and the power switch second.

The reference is:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-fencing-HAAR.html

Comment 12 vlad.socaciu 2016-11-25 02:40:58 UTC
I deleted the resource agents accidentally, while the resources were running. The node where they were running was brought down (by the fence, I assume), after lrmd crashed. After that, I was able to stop and delete the resources from the other node. But as soon as I create new resources, the node where the incident occurred initially keeps rebooting every three minutes, or so.

Don't know how to escape this quagmire.

Any advice would be appreciated!

Comment 13 vlad.socaciu 2016-11-25 03:37:57 UTC
(In reply to vlad.socaciu from comment #12)
> I deleted the resource agents accidentally, while the resources were
> running. The node where they were running was brought down (by the fence, I
> assume), after lrmd crashed. After that, I was able to stop and delete the
> resources from the other node. But as soon as I create new resources, the
> node where the incident occurred initially keeps rebooting every three
> minutes, or so.
> 
> Don't know how to escape this quagmire.
> 
> Any advice would be appreciated!

I got the reboot loop to stop. The circumstances of the incident were the following: 

    - node A didn't have the resource agents
    - resources had been created on node B but not started

If these two conditions were met, node A was rebooting continuously. I deleted the resources and copied the resource agents to node A. After that, I could create resources again and start them successfully.

As a side note: the web interfaces on both nodes take a long time (~5 minutes or more) before they display both nodes as "error free". In the fist 5 minutes, both web interfaces showed pcsd not running on node B.

Comment 14 Ken Gaillot 2016-11-28 21:30:00 UTC
(In reply to vlad.socaciu from comment #13)
> I got the reboot loop to stop. The circumstances of the incident were the
> following: 
> 
>     - node A didn't have the resource agents
>     - resources had been created on node B but not started
> 
> If these two conditions were met, node A was rebooting continuously. I
> deleted the resources and copied the resource agents to node A. After that,
> I could create resources again and start them successfully.

This makes sense. Most likely, what happened is that the cluster tried to probe the status of the resources on A and couldn't, so it then tried to execute the stop action to be sure they weren't running. When the stop failed too, the cluster fenced the node. If the node rejoined, the same thing would happen again (until the agents existed).

Comment 15 Ken Gaillot 2017-01-16 17:23:17 UTC
This issue is believed to be resolved in the 7.3 release packages. If it recurs with those packages, we can re-open for further investigation.