Bug 1322822

Summary: garbd failure when peer is fenced
Product: Red Hat Enterprise Linux 7 Reporter: Sebastien Aime <saime>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, dciabrin, fdinitto, jruemker, mbayer, michele, royoung, saime, srevivo
Target Milestone: pre-dev-freeze   
Target Release: 7.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-16 02:21:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sebastien Aime 2016-03-31 11:40:30 UTC
Dear all,

This condition has been identified when testing fencing with Air France in the context of their POC with a lightweight controller.

SFDC: 01539026
RHBZ: 1282406

When the fencing happened, pacemaker has restarted garbd, which has triggered a loss of quorum. Resource-agents has identified this condition as an error and has stopped Galera.

This service restart has been identified as not needed and can be considered as a bug.

Regards,

Sebastien.

Comment 1 Fabio Massimo Di Nitto 2016-03-31 11:45:54 UTC
Env:

2 or more node cluster
1 remote node

configure a resource to run only on the remote node (via pacemaker_remoted)

check on pcs status which node is managing the connection to pacemaker_remoted and fence that node (crash it or kill it).

another node will take over the connection to the remote node and the resource configured to run on the remote node is restarted.

The restart should not happen since the resource is running.

Comment 4 Ken Gaillot 2016-03-31 17:23:02 UTC
I do not think the initial description of the behavior is correct. I believe the restart is required by the configuration, and is not a bug.

From what I can tell, garbd is restarted due to the ordering constraint "promote galera-master then start garbd". If the lost controller node was running the galera master (which I believe it was), then garbd would be restarted because the other controller needed to be promoted.

FYI, some confusion may have arisen because pacemaker does initially schedule restarts for resources on a remote node whose connection host is lost. Pacemaker does not know whether the connection will ultimately be re-established successfully on another node, so it initially schedules the remote node for fencing. However, if the connection is re-established, it cancels that fencing and re-probes the status of all resources on the remote node.

Comment 5 Fabio Massimo Di Nitto 2016-03-31 17:27:51 UTC
(In reply to Ken Gaillot from comment #4)
> I do not think the initial description of the behavior is correct. I believe
> the restart is required by the configuration, and is not a bug.
> 
> From what I can tell, garbd is restarted due to the ordering constraint
> "promote galera-master then start garbd". If the lost controller node was
> running the galera master (which I believe it was), then garbd would be
> restarted because the other controller needed to be promoted.

galera was master on both nodes at the time. So it doesn´t explain why garbd was restarted (to me at least).

Comment 9 Ken Gaillot 2016-03-31 18:14:22 UTC
I misunderstood; I now see galera is configured master/master.

However, I still believe the ordering constraint is involved, possibly because interleave is left to default to false on the galera clone. I can reproduce the behavior with a dummy resource only if I add a similar ordering constraint.

I'm still investigating whether my theory is accurate, and whether this should be the desired/expected behavior when interleave=false.

Comment 10 Andrew Beekhof 2016-03-31 23:35:31 UTC
I think Ken inability to reproduce is specific to the manner in which we're causing the  controller to be isolated.

It now seems likely that controller-0, having lost quorum, is the one causing garbd to be stopped.  A combination of sbd and/or no-quorum-policy=freeze should be effective at providing the desired behaviour.  I will attempt to verify this today.

Comment 11 Andrew Beekhof 2016-04-01 03:23:04 UTC
Can we get new sos reports from both controllers please?
I'd like to confirm a couple of things before we go too deep down the rabbit hole.

Comment 12 Sebastien Aime 2016-04-01 08:02:43 UTC
Hi Andrew,

I'll call Air France to ask them to provide new sosreports now.

Cheers,

Sebastien.

Comment 13 Sebastien Aime 2016-04-01 08:28:08 UTC
Hello,

New sosreports from controllers 0 and 1 are available in collab-shell:

# ssh your_kerb.redhat.com
# cd /cases/01608978/01Apr2016/

Best regards,

Sebastien.

Comment 14 Fabio Massimo Di Nitto 2016-04-01 08:29:19 UTC
(In reply to Sebastien Aime from comment #13)
> Hello,
> 
> New sosreports from controllers 0 and 1 are available in collab-shell:
> 
> # ssh your_kerb.redhat.com
> # cd /cases/01608978/01Apr2016/
> 
> Best regards,
> 
> Sebastien.

Can we please also get sosreports + garbd.log from the light node?

Comment 15 Fabio Massimo Di Nitto 2016-04-01 08:29:49 UTC
(In reply to Fabio Massimo Di Nitto from comment #14)
> (In reply to Sebastien Aime from comment #13)
> > Hello,
> > 
> > New sosreports from controllers 0 and 1 are available in collab-shell:
> > 
> > # ssh your_kerb.redhat.com
> > # cd /cases/01608978/01Apr2016/
> > 
> > Best regards,
> > 
> > Sebastien.
> 
> Can we please also get sosreports + garbd.log from the light node?

and make sure garbd.log has > 24h logs (to include yesterday´s testing)

Comment 16 Sebastien Aime 2016-04-01 08:52:50 UTC
Hello,

Info added to /cases/01608978/01Apr2016/disaster-2 (with correct permissions)

Best regards,

Sebastien.

Comment 17 Ken Gaillot 2016-04-01 23:19:05 UTC
Looking at the logs around the Mar 31 09:25:19 incident, along with Damien Cabrini's separate analysis, it is clear that garbd exited on its own, and was not stopped by pacemaker.

While pacemaker does schedule garbd for a restart, it infers the stop and does not do an actual stop. Instead, it re-probes the status of garbd, and the garbd resource agent reports that garbd is not running (because garbd had already started its cluster detach sequence by that point). While I think pacemaker's internal handling of this situation could be made more efficient, it behaves correctly, and would not have restarted garbd had it stayed running.

Comment 19 Andrew Beekhof 2016-04-03 23:26:34 UTC
(In reply to Ken Gaillot from comment #17)
> Looking at the logs around the Mar 31 09:25:19 incident, along with Damien
> Cabrini's separate analysis, it is clear that garbd exited on its own, and
> was not stopped by pacemaker.
> 
> While pacemaker does schedule garbd for a restart, it infers the stop and
> does not do an actual stop. Instead, it re-probes the status of garbd, and
> the garbd resource agent reports that garbd is not running (because garbd
> had already started its cluster detach sequence by that point). While I
> think pacemaker's internal handling of this situation could be made more
> efficient, it behaves correctly, and would not have restarted garbd had it
> stayed running.

Agreed.

We see:

Mar 31 09:26:04 qvi-eng-overcloud-controller-1 crmd[29605]: warning: Action 309 (garbd_monitor_20000) on qvi-eng-overcloud-disaster-2 failed (target: 0 vs. rc: 7): Error

followed by:

Mar 31 09:26:21 qvi-eng-overcloud-controller-1 crmd[29605]:  notice: Initiating action 36: stop garbd_stop_0 on qvi-eng-overcloud-disaster-2 (local)

So the stop is definitely a result of a legitimate garbd failure.
Changing the summary and re-assigning to Damien to investigate further (the cluster team will continue to monitor the bug in case assistance is required).

Comment 24 Michael Bayer 2016-04-04 00:09:23 UTC
restoring needinfo

Comment 25 Andrew Beekhof 2016-04-04 00:49:18 UTC
The server itself wasn't rebooted (based on the pid of os-collect-config in disaster-2/sosreport-qvi-eng-overcloud-disaster-2.localdomain-20160401083747/sos_commands/cluster/crm_report/qvi-eng-overcloud-disaster-2.localdomain/journal.log)

However another thing that further suggests pacemaker wasn't involved is that if we look at the garbd logs matching "Mar 31 09:2" in that file, we see only:

Mar 31 09:26:04 qvi-eng-overcloud-disaster-2.localdomain garbd(garbd)[12879]: ERROR: garbd not running: removing old PID file
Mar 31 09:26:21 qvi-eng-overcloud-disaster-2.localdomain garbd(garbd)[12913]: INFO: garbd is not running

The first corresponds to the failed monitor action and the second to the scheduled stop.

If pacemaker had asked garbd to stop prior to the monitor failure, we should have seen the results of one of these two lines (see full function below):

        ocf_log info "garbd is not running"
        ocf_log info "garbd stopped"

The only exception is if the call to ocf_stop_processes() failed.
However we see nothing prior.

garbd_stop()
{
    local rc
    local pid

    if [ ! -f $OCF_RESKEY_pid ]; then
        ocf_log info "garbd is not running"
        return $OCF_SUCCESS
    fi

    pid=`cat $OCF_RESKEY_pid 2> /dev/null `

    # make sure the process is stopped
    ocf_stop_processes TERM 10 $pid
    rc=$?

    if [ $rc -ne 0 ]; then
        return $OCF_ERR_GENERIC
    else
        rm -f $OCF_RESKEY_pid
        ocf_log info "garbd stopped"
        return $OCF_SUCCESS
    fi
}

The only other activity at the time of the SIGTERM is:

Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Removed slice user-27.slice.
Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Stopping user-27.slice.
Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted[11665]:   notice: LRMD client disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id: 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03


Which suggests two further avenues for investigation:
1. pacemaker_remoted is auto-stopping services when its peer disconnects (however, should have produced garbd logging)
2. systemd is involved somehow???

I will try to confirm/deny (1) shortly, however following the user-27 rabbit hole, I see this:

Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: (to mysql) root on none
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Created slice user-27.slice.
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Starting user-27.slice.
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Started Session c4 of user mysql.
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Starting Session c4 of user mysql.
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: pam_unix(su-l:session): session opened for user mysql by (uid=0)
Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: pam_unix(su-l:session): session closed for user mysql

Which corresponds to the last known garbd start operation (controller-0/sosreport-na.01608978-20160401080531/sos_commands/cluster/crm_report/qvi-eng-overcloud-controller-0.localdomain/corosync.log):

Mar 31 07:36:08 [3026] qvi-eng-overcloud-controller-0.localdomain       crmd:     info: do_lrm_rsc_op:	Performing key=363:25:0:d8ce311b-fc56-460e-8742-e177e10b3dad op=garbd_start_0
Mar 31 07:36:10 [3026] qvi-eng-overcloud-controller-0.localdomain       crmd:   notice: process_lrm_event:	Operation garbd_start_0: ok (node=qvi-eng-overcloud-disaster-2, call=148, rc=0, cib-update=248, confirmed=true)

So it makes sense that slice 27 going away would correspond to garbd stopping.
What is less clear is which is the cause and which the effect.

In the live session today, the first thing we need to do is replicate the test with some additional logging in the garbd agent, so that we can definitively say if SIGTERM is coming from the agent (and if so, when).

Comment 30 Andrew Beekhof 2016-04-04 08:32:42 UTC
From the new logs:

Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:     info: cancel_recurring_action:	Cancelling ocf operation garbd_monitor_20000
Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:     info: log_execute:	executing - rsc:garbd action:stop call_id:523
Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:     info: log_finished:	finished - rsc:garbd action:stop call_id:523 pid:12699 exit-code:0 exec-time:231ms queue-time:0ms
Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:     info: lrmd_remote_client_msg:	Client disconnect detected in tls msg dispatcher.
Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:   notice: lrmd_remote_client_destroy:	LRMD client disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id: 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03
Mar 31 09:25:29 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted:   notice: lrmd_remote_listen:	LRMD client connection established. 0x1d567a0 id: f0253115-dfb3-4a8f-ac62-d271af1bff0f


Some part of pacemaker definitely requested a stop at that time.
Remaining questions:
1) Why didn't the RA produce any logs?
2) Did the controller initiate it or did pacemaker remote initiate it on its own when it detected the connection drop?

At this point I think it makes sense to move the bug back to pacemaker.

Comment 37 Ken Gaillot 2016-04-04 17:53:02 UTC
(In reply to Andrew Beekhof from comment #30)
> From the new logs:
> 
> Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:     info: cancel_recurring_action:	Cancelling ocf
> operation garbd_monitor_20000
> Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:     info: log_execute:	executing - rsc:garbd action:stop
> call_id:523
> Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:     info: log_finished:	finished - rsc:garbd action:stop
> call_id:523 pid:12699 exit-code:0 exec-time:231ms queue-time:0ms
> Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:     info: lrmd_remote_client_msg:	Client disconnect
> detected in tls msg dispatcher.
> Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:   notice: lrmd_remote_client_destroy:	LRMD client
> disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id:
> 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03
> Mar 31 09:25:29 [11665] qvi-eng-overcloud-disaster-2.localdomain
> pacemaker_remoted:   notice: lrmd_remote_listen:	LRMD client connection
> established. 0x1d567a0 id: f0253115-dfb3-4a8f-ac62-d271af1bff0f
> 
> 
> Some part of pacemaker definitely requested a stop at that time.
> Remaining questions:
> 1) Why didn't the RA produce any logs?
> 2) Did the controller initiate it or did pacemaker remote initiate it on its
> own when it detected the connection drop?

pacemaker_remote has no concept of what resources it is "running" and cannot initiate any action on its own. It only executes actions relayed to it by the cluster.

Since we have all the logs from controller-1 and it did not initiate a stop before garbd was found to be not running at 09:25:33, the stop must have been initiated by controller-0 before it was fenced. (Unfortunately, we don't have controller-0's final log messages, because its disk cache was lost when it was fenced. Having a central rsyslog server for syslog messages from all nodes could help troubleshooting in such cases.)

In any case, given the new logs, the stop does appear to be cluster-initiated. Further investigation will be needed to determine why.

> At this point I think it makes sense to move the bug back to pacemaker.

Comment 39 Andrew Beekhof 2016-04-06 10:37:57 UTC
Syncing with email thread...

From the logs we saw that pacemaker on controller-0 did initiate a stop for the garbd resource.

Despite the loss of logs and PE files from controller-0 due to the fencing operation, I managed to recreate what would have happened by taking pe-warn-7.bz2 from controller-2 and swapping which node had:
 
   in_ccm="false" crmd="offline"

With that crm_simulate dutifully shows what controller-0 would have executed:

Executing cluster transition:
 * Resource action: galera          cancel=10000 on qvi-eng-overcloud-controller-0
 * Pseudo action:   redis-master_pre_notify_demote_0
 * Resource action: garbd           stop on qvi-eng-overcloud-disaster-2

which is consistent with what we saw from the remote node.

Looking more closely, the reason for this is the empty transient attributes section.
Restore the attributes and the right thing happens, but without them, the cluster lacks the information needed to allow galera to remain promoted.

And without a galera master, garbd needs to be stopped.

You can see controller-1 deleting the attributes here in response to the node being lost:

Mar 31 09:25:19 [29605] qvi-eng-overcloud-controller-1.localdomain       crmd:     info: peer_update_callback:	Peer qvi-eng-overcloud-controller-0 left us
Mar 31 09:25:19 [29605] qvi-eng-overcloud-controller-1.localdomain       crmd:     info: erase_status_tag:	Deleting xpath: //node_state[@uname='qvi-eng-overcloud-controller-0']/transient_attributes


It shouldn't be possible that the node we've just lost connection to is able to receive the update where we remove its attributes, but thats the most plausible explanation I have right now.


Possible mitigation strategies include delaying the attribute removal (although I recall that we started doing them immediately due to a different bug report) and changing the galera/garbd ordering constraint to apply only in the startup direction.

Attempting to handle this in the PE does not seem viable at this point.

Comment 40 Ken Gaillot 2016-04-06 16:58:11 UTC
(In reply to Andrew Beekhof from comment #39)
> Possible mitigation strategies include delaying the attribute removal
> (although I recall that we started doing them immediately due to a different
> bug report) and changing the galera/garbd ordering constraint to apply only
> in the startup direction.

@Michael Bayer,

Currently, this cluster uses the constraint "promote galera-master than start garbd", which is symmetrical, so it also requires that garbd be stopped if galera is demoted to slave mode.

Are either of these alternatives feasible:

* "start galera-master than start garbd" -- so that garbd can be started after galera is started, regardless of whether galera is in slave or master mode

* Make the original constraint asymmetrical, so that garbd is still started after galera is promoted to master, but if galera is demoted (or even stopped), garbd can still run

?

If so, that will be a perfectly good workaround for this issue, although we will still want to address the underlying pacemaker catch-22 somehow.

Comment 41 Michael Bayer 2016-04-06 17:09:23 UTC
If the problem we're trying to solve is, "we'd like to not unnecessarily restart garbd", then I see no reason why we can't leave it running.   But also it can be running before galera-master is started / promoted as well.   It would just be up to the resource agent to be more closely familiar with the states of garbd, to distinguish "garbd is failing, kill it" from "garbd is just waiting for other nodes to come online".

Comment 42 Ken Gaillot 2016-04-06 17:37:30 UTC
(In reply to Michael Bayer from comment #41)
> If the problem we're trying to solve is, "we'd like to not unnecessarily
> restart garbd", then I see no reason why we can't leave it running.   But
> also it can be running before galera-master is started / promoted as well.  
> It would just be up to the resource agent to be more closely familiar with
> the states of garbd, to distinguish "garbd is failing, kill it" from "garbd
> is just waiting for other nodes to come online".

@Damien Ciabrini,

How would the garbd resource agent currently behave in those situations? I.e. will it report an error if garbd is running with one of the galera nodes in slave mode, or with galera not running at all?

Comment 43 Andrew Beekhof 2016-04-06 22:13:48 UTC
(In reply to Ken Gaillot from comment #40)

> * Make the original constraint asymmetrical, so that garbd is still started
> after galera is promoted to master, but if galera is demoted (or even
> stopped), garbd can still run

Yep, thats what I suggested on IRC late last night :-)

The garbd monitor is written such that it will be considered failed if there are no tcp sockets in the CONNECTED state.  So in a split-brain situation garbd would stay up because it is connected to the other controller.

So:
  
   pcs -f garbd.xml constraint order promote galera-master then start garbd

becomes:

   pcs -f garbd.xml constraint order promote galera-master then start garbd symmetrical=false

Comment 44 Andrew Beekhof 2016-06-16 02:21:37 UTC
Closing.

The AF document was updated and since this was a custom deployment there is no change required in director or any other component.