Dear all, This condition has been identified when testing fencing with Air France in the context of their POC with a lightweight controller. SFDC: 01539026 RHBZ: 1282406 When the fencing happened, pacemaker has restarted garbd, which has triggered a loss of quorum. Resource-agents has identified this condition as an error and has stopped Galera. This service restart has been identified as not needed and can be considered as a bug. Regards, Sebastien.
Env: 2 or more node cluster 1 remote node configure a resource to run only on the remote node (via pacemaker_remoted) check on pcs status which node is managing the connection to pacemaker_remoted and fence that node (crash it or kill it). another node will take over the connection to the remote node and the resource configured to run on the remote node is restarted. The restart should not happen since the resource is running.
I do not think the initial description of the behavior is correct. I believe the restart is required by the configuration, and is not a bug. From what I can tell, garbd is restarted due to the ordering constraint "promote galera-master then start garbd". If the lost controller node was running the galera master (which I believe it was), then garbd would be restarted because the other controller needed to be promoted. FYI, some confusion may have arisen because pacemaker does initially schedule restarts for resources on a remote node whose connection host is lost. Pacemaker does not know whether the connection will ultimately be re-established successfully on another node, so it initially schedules the remote node for fencing. However, if the connection is re-established, it cancels that fencing and re-probes the status of all resources on the remote node.
(In reply to Ken Gaillot from comment #4) > I do not think the initial description of the behavior is correct. I believe > the restart is required by the configuration, and is not a bug. > > From what I can tell, garbd is restarted due to the ordering constraint > "promote galera-master then start garbd". If the lost controller node was > running the galera master (which I believe it was), then garbd would be > restarted because the other controller needed to be promoted. galera was master on both nodes at the time. So it doesn´t explain why garbd was restarted (to me at least).
I misunderstood; I now see galera is configured master/master. However, I still believe the ordering constraint is involved, possibly because interleave is left to default to false on the galera clone. I can reproduce the behavior with a dummy resource only if I add a similar ordering constraint. I'm still investigating whether my theory is accurate, and whether this should be the desired/expected behavior when interleave=false.
I think Ken inability to reproduce is specific to the manner in which we're causing the controller to be isolated. It now seems likely that controller-0, having lost quorum, is the one causing garbd to be stopped. A combination of sbd and/or no-quorum-policy=freeze should be effective at providing the desired behaviour. I will attempt to verify this today.
Can we get new sos reports from both controllers please? I'd like to confirm a couple of things before we go too deep down the rabbit hole.
Hi Andrew, I'll call Air France to ask them to provide new sosreports now. Cheers, Sebastien.
Hello, New sosreports from controllers 0 and 1 are available in collab-shell: # ssh your_kerb.redhat.com # cd /cases/01608978/01Apr2016/ Best regards, Sebastien.
(In reply to Sebastien Aime from comment #13) > Hello, > > New sosreports from controllers 0 and 1 are available in collab-shell: > > # ssh your_kerb.redhat.com > # cd /cases/01608978/01Apr2016/ > > Best regards, > > Sebastien. Can we please also get sosreports + garbd.log from the light node?
(In reply to Fabio Massimo Di Nitto from comment #14) > (In reply to Sebastien Aime from comment #13) > > Hello, > > > > New sosreports from controllers 0 and 1 are available in collab-shell: > > > > # ssh your_kerb.redhat.com > > # cd /cases/01608978/01Apr2016/ > > > > Best regards, > > > > Sebastien. > > Can we please also get sosreports + garbd.log from the light node? and make sure garbd.log has > 24h logs (to include yesterday´s testing)
Hello, Info added to /cases/01608978/01Apr2016/disaster-2 (with correct permissions) Best regards, Sebastien.
Looking at the logs around the Mar 31 09:25:19 incident, along with Damien Cabrini's separate analysis, it is clear that garbd exited on its own, and was not stopped by pacemaker. While pacemaker does schedule garbd for a restart, it infers the stop and does not do an actual stop. Instead, it re-probes the status of garbd, and the garbd resource agent reports that garbd is not running (because garbd had already started its cluster detach sequence by that point). While I think pacemaker's internal handling of this situation could be made more efficient, it behaves correctly, and would not have restarted garbd had it stayed running.
(In reply to Ken Gaillot from comment #17) > Looking at the logs around the Mar 31 09:25:19 incident, along with Damien > Cabrini's separate analysis, it is clear that garbd exited on its own, and > was not stopped by pacemaker. > > While pacemaker does schedule garbd for a restart, it infers the stop and > does not do an actual stop. Instead, it re-probes the status of garbd, and > the garbd resource agent reports that garbd is not running (because garbd > had already started its cluster detach sequence by that point). While I > think pacemaker's internal handling of this situation could be made more > efficient, it behaves correctly, and would not have restarted garbd had it > stayed running. Agreed. We see: Mar 31 09:26:04 qvi-eng-overcloud-controller-1 crmd[29605]: warning: Action 309 (garbd_monitor_20000) on qvi-eng-overcloud-disaster-2 failed (target: 0 vs. rc: 7): Error followed by: Mar 31 09:26:21 qvi-eng-overcloud-controller-1 crmd[29605]: notice: Initiating action 36: stop garbd_stop_0 on qvi-eng-overcloud-disaster-2 (local) So the stop is definitely a result of a legitimate garbd failure. Changing the summary and re-assigning to Damien to investigate further (the cluster team will continue to monitor the bug in case assistance is required).
restoring needinfo
The server itself wasn't rebooted (based on the pid of os-collect-config in disaster-2/sosreport-qvi-eng-overcloud-disaster-2.localdomain-20160401083747/sos_commands/cluster/crm_report/qvi-eng-overcloud-disaster-2.localdomain/journal.log) However another thing that further suggests pacemaker wasn't involved is that if we look at the garbd logs matching "Mar 31 09:2" in that file, we see only: Mar 31 09:26:04 qvi-eng-overcloud-disaster-2.localdomain garbd(garbd)[12879]: ERROR: garbd not running: removing old PID file Mar 31 09:26:21 qvi-eng-overcloud-disaster-2.localdomain garbd(garbd)[12913]: INFO: garbd is not running The first corresponds to the failed monitor action and the second to the scheduled stop. If pacemaker had asked garbd to stop prior to the monitor failure, we should have seen the results of one of these two lines (see full function below): ocf_log info "garbd is not running" ocf_log info "garbd stopped" The only exception is if the call to ocf_stop_processes() failed. However we see nothing prior. garbd_stop() { local rc local pid if [ ! -f $OCF_RESKEY_pid ]; then ocf_log info "garbd is not running" return $OCF_SUCCESS fi pid=`cat $OCF_RESKEY_pid 2> /dev/null ` # make sure the process is stopped ocf_stop_processes TERM 10 $pid rc=$? if [ $rc -ne 0 ]; then return $OCF_ERR_GENERIC else rm -f $OCF_RESKEY_pid ocf_log info "garbd stopped" return $OCF_SUCCESS fi } The only other activity at the time of the SIGTERM is: Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Removed slice user-27.slice. Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Stopping user-27.slice. Mar 31 09:25:22 qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted[11665]: notice: LRMD client disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id: 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03 Which suggests two further avenues for investigation: 1. pacemaker_remoted is auto-stopping services when its peer disconnects (however, should have produced garbd logging) 2. systemd is involved somehow??? I will try to confirm/deny (1) shortly, however following the user-27 rabbit hole, I see this: Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: (to mysql) root on none Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain kernel: SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Created slice user-27.slice. Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Starting user-27.slice. Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Started Session c4 of user mysql. Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain systemd[1]: Starting Session c4 of user mysql. Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: pam_unix(su-l:session): session opened for user mysql by (uid=0) Mar 31 07:36:08 qvi-eng-overcloud-disaster-2.localdomain su[922]: pam_unix(su-l:session): session closed for user mysql Which corresponds to the last known garbd start operation (controller-0/sosreport-na.01608978-20160401080531/sos_commands/cluster/crm_report/qvi-eng-overcloud-controller-0.localdomain/corosync.log): Mar 31 07:36:08 [3026] qvi-eng-overcloud-controller-0.localdomain crmd: info: do_lrm_rsc_op: Performing key=363:25:0:d8ce311b-fc56-460e-8742-e177e10b3dad op=garbd_start_0 Mar 31 07:36:10 [3026] qvi-eng-overcloud-controller-0.localdomain crmd: notice: process_lrm_event: Operation garbd_start_0: ok (node=qvi-eng-overcloud-disaster-2, call=148, rc=0, cib-update=248, confirmed=true) So it makes sense that slice 27 going away would correspond to garbd stopping. What is less clear is which is the cause and which the effect. In the live session today, the first thing we need to do is replicate the test with some additional logging in the garbd agent, so that we can definitively say if SIGTERM is coming from the agent (and if so, when).
From the new logs: Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf operation garbd_monitor_20000 Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: info: log_execute: executing - rsc:garbd action:stop call_id:523 Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: info: log_finished: finished - rsc:garbd action:stop call_id:523 pid:12699 exit-code:0 exec-time:231ms queue-time:0ms Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id: 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03 Mar 31 09:25:29 [11665] qvi-eng-overcloud-disaster-2.localdomain pacemaker_remoted: notice: lrmd_remote_listen: LRMD client connection established. 0x1d567a0 id: f0253115-dfb3-4a8f-ac62-d271af1bff0f Some part of pacemaker definitely requested a stop at that time. Remaining questions: 1) Why didn't the RA produce any logs? 2) Did the controller initiate it or did pacemaker remote initiate it on its own when it detected the connection drop? At this point I think it makes sense to move the bug back to pacemaker.
(In reply to Andrew Beekhof from comment #30) > From the new logs: > > Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf > operation garbd_monitor_20000 > Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: info: log_execute: executing - rsc:garbd action:stop > call_id:523 > Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: info: log_finished: finished - rsc:garbd action:stop > call_id:523 pid:12699 exit-code:0 exec-time:231ms queue-time:0ms > Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect > detected in tls msg dispatcher. > Mar 31 09:25:22 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: notice: lrmd_remote_client_destroy: LRMD client > disconnecting remote client - name: remote-lrmd-10.60.162.44:3121 id: > 5d95c4c5-851a-4fa5-a3a4-ff0b8d25ab03 > Mar 31 09:25:29 [11665] qvi-eng-overcloud-disaster-2.localdomain > pacemaker_remoted: notice: lrmd_remote_listen: LRMD client connection > established. 0x1d567a0 id: f0253115-dfb3-4a8f-ac62-d271af1bff0f > > > Some part of pacemaker definitely requested a stop at that time. > Remaining questions: > 1) Why didn't the RA produce any logs? > 2) Did the controller initiate it or did pacemaker remote initiate it on its > own when it detected the connection drop? pacemaker_remote has no concept of what resources it is "running" and cannot initiate any action on its own. It only executes actions relayed to it by the cluster. Since we have all the logs from controller-1 and it did not initiate a stop before garbd was found to be not running at 09:25:33, the stop must have been initiated by controller-0 before it was fenced. (Unfortunately, we don't have controller-0's final log messages, because its disk cache was lost when it was fenced. Having a central rsyslog server for syslog messages from all nodes could help troubleshooting in such cases.) In any case, given the new logs, the stop does appear to be cluster-initiated. Further investigation will be needed to determine why. > At this point I think it makes sense to move the bug back to pacemaker.
Syncing with email thread... From the logs we saw that pacemaker on controller-0 did initiate a stop for the garbd resource. Despite the loss of logs and PE files from controller-0 due to the fencing operation, I managed to recreate what would have happened by taking pe-warn-7.bz2 from controller-2 and swapping which node had: in_ccm="false" crmd="offline" With that crm_simulate dutifully shows what controller-0 would have executed: Executing cluster transition: * Resource action: galera cancel=10000 on qvi-eng-overcloud-controller-0 * Pseudo action: redis-master_pre_notify_demote_0 * Resource action: garbd stop on qvi-eng-overcloud-disaster-2 which is consistent with what we saw from the remote node. Looking more closely, the reason for this is the empty transient attributes section. Restore the attributes and the right thing happens, but without them, the cluster lacks the information needed to allow galera to remain promoted. And without a galera master, garbd needs to be stopped. You can see controller-1 deleting the attributes here in response to the node being lost: Mar 31 09:25:19 [29605] qvi-eng-overcloud-controller-1.localdomain crmd: info: peer_update_callback: Peer qvi-eng-overcloud-controller-0 left us Mar 31 09:25:19 [29605] qvi-eng-overcloud-controller-1.localdomain crmd: info: erase_status_tag: Deleting xpath: //node_state[@uname='qvi-eng-overcloud-controller-0']/transient_attributes It shouldn't be possible that the node we've just lost connection to is able to receive the update where we remove its attributes, but thats the most plausible explanation I have right now. Possible mitigation strategies include delaying the attribute removal (although I recall that we started doing them immediately due to a different bug report) and changing the galera/garbd ordering constraint to apply only in the startup direction. Attempting to handle this in the PE does not seem viable at this point.
(In reply to Andrew Beekhof from comment #39) > Possible mitigation strategies include delaying the attribute removal > (although I recall that we started doing them immediately due to a different > bug report) and changing the galera/garbd ordering constraint to apply only > in the startup direction. @Michael Bayer, Currently, this cluster uses the constraint "promote galera-master than start garbd", which is symmetrical, so it also requires that garbd be stopped if galera is demoted to slave mode. Are either of these alternatives feasible: * "start galera-master than start garbd" -- so that garbd can be started after galera is started, regardless of whether galera is in slave or master mode * Make the original constraint asymmetrical, so that garbd is still started after galera is promoted to master, but if galera is demoted (or even stopped), garbd can still run ? If so, that will be a perfectly good workaround for this issue, although we will still want to address the underlying pacemaker catch-22 somehow.
If the problem we're trying to solve is, "we'd like to not unnecessarily restart garbd", then I see no reason why we can't leave it running. But also it can be running before galera-master is started / promoted as well. It would just be up to the resource agent to be more closely familiar with the states of garbd, to distinguish "garbd is failing, kill it" from "garbd is just waiting for other nodes to come online".
(In reply to Michael Bayer from comment #41) > If the problem we're trying to solve is, "we'd like to not unnecessarily > restart garbd", then I see no reason why we can't leave it running. But > also it can be running before galera-master is started / promoted as well. > It would just be up to the resource agent to be more closely familiar with > the states of garbd, to distinguish "garbd is failing, kill it" from "garbd > is just waiting for other nodes to come online". @Damien Ciabrini, How would the garbd resource agent currently behave in those situations? I.e. will it report an error if garbd is running with one of the galera nodes in slave mode, or with galera not running at all?
(In reply to Ken Gaillot from comment #40) > * Make the original constraint asymmetrical, so that garbd is still started > after galera is promoted to master, but if galera is demoted (or even > stopped), garbd can still run Yep, thats what I suggested on IRC late last night :-) The garbd monitor is written such that it will be considered failed if there are no tcp sockets in the CONNECTED state. So in a split-brain situation garbd would stay up because it is connected to the other controller. So: pcs -f garbd.xml constraint order promote galera-master then start garbd becomes: pcs -f garbd.xml constraint order promote galera-master then start garbd symmetrical=false
Closing. The AF document was updated and since this was a custom deployment there is no change required in director or any other component.