Bug 1804259

Summary:	When using Instance HA and doing a serial reboot node by node a pacemaker remote service can get stuck
Product:	Red Hat Enterprise Linux 8	Reporter:	Michele Baldessari <michele>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED NOTABUG	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.2	CC:	cluster-maint, lmiccini
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-18 07:27:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michele Baldessari 2020-02-18 14:50:59 UTC

Description of problem:
On an Instance HA environment we rebooted all nodes in series one by one and compute-0  got stuck in rebooting and had to wait for the 30minutes timeout of the pacemaker_remote on service on the node. But in the end it was systemd that just rebooted the box forcibly:
Feb 18 13:35:00 compute-0.redhat.local systemd[1]: Forcibly rebooting: job timed out  


Timeline  of events:
0) The DC messaging-0 notices that compute-0 is being shut down:
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224] (crm_reap_dead_member)      info: Removing node with name compute-0 and id 0 from membership cache
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224] (reap_crm_member)   notice: Purged 1 peer with id=0 and/or uname=compute-0 from the membership cache
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224] (attrd_peer_update)         info: Setting shutdown[compute-0]: (null) -> 1582031100 from controller-1
...
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-schedulerd[2225] (unpack_handle_remote_attrs)        info: Node compute-0 is shutting down
...

1) The DC stops compute-unfence-trigger 
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-controld  [2226] (do_te_invoke)      info: Processing graph 10 (ref=pe_calc-dc-1582031100-296) derived from /var/lib/pacemaker/pengine/pe-input-27.bz2
Feb 18 13:05:00 messaging-0.redhat.local pacemaker-controld  [2226] (te_rsc_command)    notice: Initiating stop operation compute-unfence-trigger_stop_0 on compute-0 | action 297


2) Seemingly the cluster told lrmd on compute-0 to shutdown at 13:05:00 (?)
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (crm_signal_dispatch)         notice: Caught 'Terminated' signal | 15 (invoking handler)
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (lrmd_shutdown)       info: Sending shutdown request to cluster
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (lrmd_remote_connection_destroy)      notice: TLS server session ended
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (handle_shutdown_ack)         info: Received shutdown ack

3) Then it seems pacemaker-remote invokes the stop of the the dummy resource compute-unfence-trigger-clone
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (cancel_recurring_action)     info: Cancelling ocf operation compute-unfence-trigger_monitor_10000
Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516] (log_execute)         info: executing - rsc:compute-unfence-trigger action:stop call_id:159

4) But then it seems that this stop times out [not entirely sure why, potentially it has to do with point (2)]:
Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516] (child_timeout_callback)      warning: compute-unfence-trigger_stop_0 process (PID 9807) timed out
Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516] (operation_finished)  warning: compute-unfence-trigger_stop_0:9807 - timed out after 20000ms
Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516] (log_finished)        info: finished - rsc:compute-unfence-trigger action:stop call_id:159 pid:9807 exit-code:1 exec-time:20003ms queue-time:0ms

5) At this point, about 30mins later, it is systemd that is sending sigterm
And then systemd forcibly sends SIGTERM to pcmk remote:
Feb 18 13:35:00 compute-0.redhat.local pacemaker-remoted   [2516] (crm_signal_dispatch)         notice: Caught 'Terminated' signal | 15 (invoking handler)

Version-Release number of selected component (if applicable):
pacemaker-2.0.2-3.el8.x86_64

How reproducible:
We're not sure yet how often we reproduce this, but we will keep running the job that triggered it and report on how often we see it.

If https://bugzilla.redhat.com/show_bug.cgi?id=1791841 could be relevant here, we could test it with 2.0.3-5.el8 (from an initial look it did not seem to be the same problem)

Comment 4 Ken Gaillot 2020-02-25 21:27:15 UTC

(In reply to Michele Baldessari from comment #0)
> Description of problem:
> On an Instance HA environment we rebooted all nodes in series one by one and
> compute-0  got stuck in rebooting and had to wait for the 30minutes timeout
> of the pacemaker_remote on service on the node. But in the end it was
> systemd that just rebooted the box forcibly:
> Feb 18 13:35:00 compute-0.redhat.local systemd[1]: Forcibly rebooting: job
> timed out  
> 
> 
> Timeline  of events:
> 0) The DC messaging-0 notices that compute-0 is being shut down:
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224]
> (crm_reap_dead_member)      info: Removing node with name compute-0 and id 0
> from membership cache
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224]
> (reap_crm_member)   notice: Purged 1 peer with id=0 and/or uname=compute-0
> from the membership cache
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-attrd     [2224]
> (attrd_peer_update)         info: Setting shutdown[compute-0]: (null) ->
> 1582031100 from controller-1
> ...
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-schedulerd[2225]
> (unpack_handle_remote_attrs)        info: Node compute-0 is shutting down
> ...
> 
> 1) The DC stops compute-unfence-trigger 
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-controld  [2226]
> (do_te_invoke)      info: Processing graph 10
> (ref=pe_calc-dc-1582031100-296) derived from
> /var/lib/pacemaker/pengine/pe-input-27.bz2
> Feb 18 13:05:00 messaging-0.redhat.local pacemaker-controld  [2226]
> (te_rsc_command)    notice: Initiating stop operation
> compute-unfence-trigger_stop_0 on compute-0 | action 297
> 
> 
> 2) Seemingly the cluster told lrmd on compute-0 to shutdown at 13:05:00 (?)
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (crm_signal_dispatch)         notice: Caught 'Terminated' signal | 15
> (invoking handler)
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (lrmd_shutdown)       info: Sending shutdown request to cluster
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (lrmd_remote_connection_destroy)      notice: TLS server session ended
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (handle_shutdown_ack)         info: Received shutdown ack

This is actually the first thing that happens. The 'Terminated' signal comes from the local host (in this case systemd preparing for reboot). The remote node notifies the DC that it wants to leave, and the "Received shutdown ack" indicates that the DC acknowledged this, at which point the remote node will wait for the DC to tell it that everything has safely stopped before actually exiting.

However the problem here is the "TLS server session ended". This means the remote connection dropped at this very moment, which is unexpected and problematic. The remote node received the ack (obviously before the connection drop), so it's going to wait for a DC message that will never arrive since the connection has been dropped.

I see two problems here: (1) Why did the connection drop? Logs from the DC and cluster node hosting the connection may give some more information. (2) If the connection drops while pacemaker-remoted is waiting for the DC "all clear" message, it should stop waiting and immediately exit with error status. (This will require a code change.)

> 3) Then it seems pacemaker-remote invokes the stop of the the dummy resource
> compute-unfence-trigger-clone
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (cancel_recurring_action)     info: Cancelling ocf operation
> compute-unfence-trigger_monitor_10000
> Feb 18 13:05:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (log_execute)         info: executing - rsc:compute-unfence-trigger
> action:stop call_id:159

This would indicate that the DC was able to tell it to initiate the stop before the connection dropped. It seems odd the log messages are out of order, but they are asynchronously coming from different bits of code, so it's not necessarily a problem.
 
> 4) But then it seems that this stop times out [not entirely sure why,
> potentially it has to do with point (2)]:
> Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516]
> (child_timeout_callback)      warning: compute-unfence-trigger_stop_0
> process (PID 9807) timed out
> Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516]
> (operation_finished)  warning: compute-unfence-trigger_stop_0:9807 - timed
> out after 20000ms
> Feb 18 13:05:20 compute-0.redhat.local pacemaker-remoted   [2516]
> (log_finished)        info: finished - rsc:compute-unfence-trigger
> action:stop call_id:159 pid:9807 exit-code:1 exec-time:20003ms queue-time:0ms

This is incredibly surprising for a Dummy resource. It should just be a couple trivial file system operations. Any chance the file system is gone at this point? We can also check for anything unusual in the system logs at this time.

> 5) At this point, about 30mins later, it is systemd that is sending sigterm
> And then systemd forcibly sends SIGTERM to pcmk remote:
> Feb 18 13:35:00 compute-0.redhat.local pacemaker-remoted   [2516]
> (crm_signal_dispatch)         notice: Caught 'Terminated' signal | 15
> (invoking handler)

Of course this is pacemaker-remoted waiting for an all-clear from the DC that can't arrive, and the systemd unit timeout kicking in.

> Version-Release number of selected component (if applicable):
> pacemaker-2.0.2-3.el8.x86_64
> 
> How reproducible:
> We're not sure yet how often we reproduce this, but we will keep running the
> job that triggered it and report on how often we see it.
> 
> If https://bugzilla.redhat.com/show_bug.cgi?id=1791841 could be relevant
> here, we could test it with 2.0.3-5.el8 (from an initial look it did not
> seem to be the same problem)

This appears unrelated

Comment 7 RHEL Program Management 2021-08-18 07:27:07 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 8 Ken Gaillot 2021-08-18 14:07:14 UTC

This is still a priority and will be reopened when developer time becomes available.

Comment 9 Ken Gaillot 2022-05-02 20:14:35 UTC

After all this time, I revisited this and got a little more insight.

Whatever happened to the remote connection was on the remote end; the connection host didn't notice the connection was lost until much later. So, the cluster attempted to stop compute-unfence-trigger, and that stop timed out. This part of the configuration is the culprit for what happens next:

    <op id="compute-unfence-trigger-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="20"/>

on-fail is set to block, so it does.

There's no clear indication why the connection dropped immediately after pacemaker-remoted got SIGTERM, but I'm guessing the signal just happened to arrive at an unfortunate point in the socket I/O code (most likely in glib, but possibly pacemaker or GnuTLS) that caused it to end the connection.

I'm going to leave this bz closed permanently since pacemaker behaved according to the configuration, though I do still plan to make sure pacemaker-remoted doesn't wait if the connection drops while shutting down.

Comment 10 Ken Gaillot 2022-05-26 19:49:31 UTC

(In reply to Ken Gaillot from comment #9)
> I'm going to leave this bz closed permanently since pacemaker behaved
> according to the configuration, though I do still plan to make sure
> pacemaker-remoted doesn't wait if the connection drops while shutting down.

FYI, that issue is fixed in upstream main branch as of commit d61494347

Comment 11 Ken Gaillot 2023-01-23 22:18:09 UTC

My analysis in Comment 4 was completely wrong. :( "TLS server session ended" can happen not only when the connection drops unexpectedly, but also at this point it's intentional to stop accepting new connections from the cluster. That's why the DC was able to initiate the stop -- the existing connection was unaffected.

The question that can't be answered from the logs here is why did the compute-unfence-trigger stop time out.

But once it did, this is the correct behavior. compute-unfence-trigger has on-fail=block, so it did, eventually leading to the systemd timeout.

The "fix" in Comment 10 was thus totally misguided and actually introduced a regression. :( At least the regression did not make it into a released RHEL build, though it did make it into upstream release 2.1.5 and will be fixed in 2.1.6.