Bug 1284526

Summary: HA | Galera fails to start after rebooting one of the controllers.
Product: Red Hat Enterprise Linux 7 Reporter: Leonid Natapov <lnatapov>
Component: resource-agentsAssignee: Damien Ciabrini <dciabrin>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: abeekhof, agk, bschmaus, cluster-maint, dciabrin, dmaley, ebarrera, fdinitto, jherrman, jruemker, jschluet, kgaillot, lnatapov, mbayer, mburns, mcornea, mschuppe, oblaut, pablo.iranzo, phagara, rfreire, royoung, sknauss, srevivo, ushkalim
Target Milestone: pre-dev-freezeKeywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-68.el7 Doc Type: Bug Fix
Doc Text:
Previously, if a Galera node was unexpectedly terminated while a replication process was in progress, the MariaDB database could not be restarted on the node afterwards. With this update, the resource agent passes proper recovery arguments to MariaDB, which ensures that in the described circumstances, MariaDB can restart and synchronize to the Galera cluster as expected.
Story Points: ---
Clone Of:
: 1318744 (view as bug list) Environment:
Last Closed: 2016-11-03 23:59:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1318744    
Attachments:
Description Flags
logs
none
mysql log files
none
gdb test script to trigger the condition none

Description Leonid Natapov 2015-11-23 14:22:31 UTC
Created attachment 1097683 [details]
logs

HA deployment 3 controllers and 2 computes. Sometimes when rebooting one of controllers galera fails to start after controller is up


How reproducible:
80%

Steps to Reproduce:
1.Deploy HA with 3 controllers and 1 or 2 computes.
2.Check that everything up and running.
3.Reboot one of the controllers.

Actual result.
Galera failes to start.

Expected result.
Galera successfully starts.

Failed Actions:
* galera_start_0 on overcloud-controller-0 'unknown error' (1): call=210, status=complete, exitreason='Unable to detect last known write sequence number',
    last-rc-change='Mon Nov 23 08:52:58 2015', queued=0ms, exec=7162ms


Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]


mariadb.log
-----------

151123  8:50:41  InnoDB: Starting shutdown...
151123  8:50:42  InnoDB: Shutdown completed; log sequence number 701494371
151123  8:50:42 [Note] /usr/libexec/mysqld: Shutdown complete'
151123 08:52:59 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151123 08:52:59 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.YEorJG' --pid-file='/var/lib/mysql/overcloud-controller-0.localdomain-recover.pid'
151123  8:52:59 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
151123  8:52:59 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
151123  8:52:59 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4907)
151123 08:53:05 mysqld_safe WSREP: Failed to recover position: 
'151123  8:52:59 InnoDB: The InnoDB memory heap is disabled
151123  8:52:59 InnoDB: Mutexes and rw_locks use GCC atomic builtins
151123  8:52:59 InnoDB: Compressed tables use zlib 1.2.7
151123  8:52:59 InnoDB: Using Linux native AIO
151123  8:52:59 InnoDB: Initializing buffer pool, size = 128.0M
151123  8:52:59 InnoDB: Completed initialization of buffer pool
151123  8:52:59 InnoDB: highest supported file format is Barracuda.
InnoDB: Transaction 20DE95 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 20E000
InnoDB: Starting in background the rollback of uncommitted transactions
151123  8:53:02  InnoDB: Rollback of non-prepared transactions completed
151123  8:53:02  InnoDB: Waiting for the background threads to start
151123  8:53:03 Percona XtraDB (http://www.percona.com) 5.5.41-MariaDB-37.0 started; log sequence number 701494371
151123  8:53:03 [Note] Plugin 'FEEDBACK' is disabled.
151123  8:53:03  InnoDB: Starting recovery for XA transactions...
151123  8:53:03  InnoDB: Transaction 20DE95 in prepared state after recovery
151123  8:53:03  InnoDB: Transaction contains changes to 1 rows
151123  8:53:03  InnoDB: 1 transactions in prepared state after recovery
151123  8:53:03 [Note] Found 1 prepared transaction(s) in InnoDB
151123  8:53:03 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
151123  8:53:03 [ERROR] Aborting

151123  8:53:04  InnoDB: Starting shutdown...
151123  8:53:05  InnoDB: Shutdown completed; log sequence number 701494371
151123  8:53:05 [Note] /usr/libexec/mysqld: Shutdown complete'

logs attached:
/etc/my.cnf
/etc/my.cnf.d/*
/var/log/mariadb/*
/var/log/mysqld.log
/var/log/pacemaker.log
/var/log/nova/*

Comment 1 Michael Bayer 2015-11-23 14:28:24 UTC
the attached logs show this particular node being started successfully.  Pacemaker starts MySQL with logging to mysqld.log, the log ends with the node being SYNCED, hence it's started.

have you tried just emitting a simple "pcs cleanup galera" when this occurs just to get the resource agent to check again?

Comment 2 Leonid Natapov 2015-11-23 14:39:58 UTC
yes, I have tried. No success.

here is the output of /var/log/messages while performing pcs resource cleanup galera

Nov 23 09:36:04 overcloud-controller-0 su: (to rabbitmq) root on none
Nov 23 09:36:04 overcloud-controller-0 systemd: Started Session c29 of user rabbitmq.
Nov 23 09:36:04 overcloud-controller-0 systemd: Starting Session c29 of user rabbitmq.
Nov 23 09:36:05 overcloud-controller-0 su: (to rabbitmq) root on none
Nov 23 09:36:05 overcloud-controller-0 systemd: Started Session c30 of user rabbitmq.
Nov 23 09:36:05 overcloud-controller-0 systemd: Starting Session c30 of user rabbitmq.
Nov 23 09:36:05 overcloud-controller-0 crmd[3509]:  notice: Notifications disabled
Nov 23 09:36:05 overcloud-controller-0 crmd[3509]:  notice: Notifications disabled
Nov 23 09:36:06 overcloud-controller-0 galera(galera)[9443]: INFO: MySQL is not running
Nov 23 09:36:06 overcloud-controller-0 crmd[3509]:  notice: Operation galera_monitor_0: not running (node=overcloud-controller-0, call=234, rc=7, cib-update=81, confirmed=true)
Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: ERROR: MySQL is not running
Nov 23 09:36:07 overcloud-controller-0 su: (to mysql) root on none
Nov 23 09:36:07 overcloud-controller-0 systemd: Created slice user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Starting user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Started Session c31 of user mysql.
Nov 23 09:36:07 overcloud-controller-0 systemd: Starting Session c31 of user mysql.
Nov 23 09:36:07 overcloud-controller-0 su: (to mysql) root on none
Nov 23 09:36:07 overcloud-controller-0 systemd: Removed slice user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Stopping user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Created slice user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Starting user-27.slice.
Nov 23 09:36:07 overcloud-controller-0 systemd: Started Session c32 of user mysql.
Nov 23 09:36:07 overcloud-controller-0 systemd: Starting Session c32 of user mysql.
Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover'
Nov 23 09:36:09 overcloud-controller-0 galera(galera)[9501]: ERROR: Unable to detect last known write sequence number
Nov 23 09:36:09 overcloud-controller-0 lrmd[3506]:  notice: galera_start_0:9501:stderr [ ocf-exit-reason:Unable to detect last known write sequence number ]
Nov 23 09:36:09 overcloud-controller-0 crmd[3509]:  notice: Operation galera_start_0: unknown error (node=overcloud-controller-0, call=235, rc=1, cib-update=82, confirmed=true)
Nov 23 09:36:09 overcloud-controller-0 crmd[3509]:  notice: overcloud-controller-0-galera_start_0:235 [ ocf-exit-reason:Unable to detect last known write sequence number\n ]
Nov 23 09:36:10 overcloud-controller-0 galera(galera)[10508]: INFO: MySQL is not running
Nov 23 09:36:10 overcloud-controller-0 crmd[3509]:  notice: Operation galera_stop_0: ok (node=overcloud-controller-0, call=236, rc=0, cib-update=83, confirmed=true)
Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.395 3594 WARNING os-collect-config [-] Source [request] Unavailable.
Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.local [-] /var/lib/os-collect-config/local-data not found. Skipping
Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data'])
Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.zaqar [-] No auth_url configured.
Nov 23 09:36:15 overcloud-controller-0 su: (to rabbitmq) root on none
Nov 23 09:36:15 overcloud-controller-0 systemd: Started Session c33 of user rabbitmq.
Nov 23 09:36:15 overcloud-controller-0 systemd: Starting Session c33 of user rabbitmq.
Nov 23 09:36:15 overcloud-controller-0 su: (to rabbitmq) root on none
Nov 23 09:36:15 overcloud-controller-0 systemd: Started Session c34 of user rabbitmq.

Comment 3 Michael Bayer 2015-11-23 14:42:20 UTC
OK can you please attach /var/log/mysqld.log from all three nodes please

Comment 4 Leonid Natapov 2015-11-24 06:28:04 UTC
Created attachment 1098007 [details]
mysql log files

Mysql log file from 3 controllers

Comment 5 Damien Ciabrini 2015-11-24 09:06:12 UTC
It looks like controller0 has been rebooted in the middle of a transaction. This has the following effects:

 . Since the mysqld wasn't shutdown properly, file /var/lib/mysql/grastate.dat (which normally holds the local node's replication state) couldn't be written.

 . When the resource agent tries to restart the mysql after controller0 has reboot, it first tries unconditionally to retrieve the last commit on the node. file grastate.dat doesn't contain the info, so the resource agent start mysqld_safe in "recovery mode" to find out.

 . mysqld has a pending transaction and refuses to go ahead before it is either committed or rollback manually.

 . resource agent cannot retrieve the last commit, ends up in error, and this prevents galera node from coming back up on controller0.


To fix the issue, you have to force commit the pending transaction manually, and then clean up the resource state in pacemaker.

 controller0# mysqld_safe --user=mysql --datadir=/var/lib/mysql --tc-heuristic-recover commit --wsrep-recover

 controller0# pcs cleanup galera

Can you apply the steps and tell me if it fixes the issue?

Comment 6 Leonid Natapov 2015-11-24 09:45:11 UTC
[root@overcloud-controller-0 ~]# mysqld_safe --user=mysql --datadir=/var/lib/mysql --tc-heuristic-recover commit --wsrep-recover


151124 04:21:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
/bin/mysqld_safe: line 728: ulimit: -1: invalid option
ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit]
151124 04:21:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151124 04:21:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.yksri9' --pid-file='/var/lib/mysql/overcloud-controller-0.localdomain-recover.pid'
151124 04:21:58 mysqld_safe WSREP: Failed to recover position: 
'151124  4:21:56 InnoDB: The InnoDB memory heap is disabled
151124  4:21:56 InnoDB: Mutexes and rw_locks use GCC atomic builtins
151124  4:21:56 InnoDB: Compressed tables use zlib 1.2.7
151124  4:21:56 InnoDB: Using Linux native AIO
151124  4:21:56 InnoDB: Initializing buffer pool, size = 128.0M
151124  4:21:56 InnoDB: Completed initialization of buffer pool
151124  4:21:56 InnoDB: highest supported file format is Barracuda.
InnoDB: Transaction 20DE95 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 20E000
InnoDB: Starting in background the rollback of uncommitted transactions
151124  4:21:56  InnoDB: Rollback of non-prepared transactions completed
151124  4:21:56  InnoDB: Waiting for the background threads to start
151124  4:21:57 Percona XtraDB (http://www.percona.com) 5.5.41-MariaDB-37.0 started; log sequence number 701494371
151124  4:21:57 [Note] Plugin 'FEEDBACK' is disabled.
151124  4:21:57  InnoDB: Starting recovery for XA transactions...
151124  4:21:57  InnoDB: Transaction 20DE95 in prepared state after recovery
151124  4:21:57  InnoDB: Transaction contains changes to 1 rows
151124  4:21:57  InnoDB: 1 transactions in prepared state after recovery
151124  4:21:57 [Note] Found 1 prepared transaction(s) in InnoDB
151124  4:21:57 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
151124  4:21:57 [ERROR] Aborting

151124  4:21:57  InnoDB: Starting shutdown...
151124  4:21:58  InnoDB: Shutdown completed; log sequence number 701494371
151124  4:21:58 [Note] /usr/libexec/mysqld: Shutdown complete'

Comment 7 Damien Ciabrini 2015-11-24 10:52:13 UTC
After a second try, the following commands unblocked the cluster as expected:

# mysqld_safe --user=mysql --datadir=/var/lib/mysql --wsrep-recover --tc-heuristic-recover rollback
# pcs cleanup galera
# pcs status resources | grep -A1 galera
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Comment 8 Michael Bayer 2015-11-24 14:50:29 UTC
OK but a graceful reboot should not cause this issue - controller0 was forcibly killed via kill -9 or similar.  Can we please confirm that pacemaker is configured to not fence this node with a kill?

Comment 9 Michael Bayer 2015-11-24 14:55:39 UTC
My browser cached a stale page here inappropriately so I've constructed an analysis that is not based on Damien's information already, nevertheless it sums up what we need to ask about the config on this node if we are in fact looking for a "bug", because the behavior of, "node was killed forcibly (e.g. kill -9, power cycle) and goes into tc-heuristic-recover" is *not* a bug, it is a known behavior.

So far here's what we can see:

1. controller0 is the one with the problem

2. controller0 was under pacemaker control, then stopped normally under pacemaker, at 151123  6:11:20.

2. controller0 was then started manually either via systemctl or mysqld_safe without using pacemaker at all at 151123 06:17:49.

3. controller0 was shutdown normally at 151123 06:17:59. 

4. controller0 was started under pacemaker control 151123 06:19:49.

5. the node is forcibly killed some unknown time after 151123  6:20:03 and before 151123 08:21:51.   Reasons for this generally include:
 

  a. the server was kill -9'ed by the user.

  b. the server was power-cycled or otherwise subject to some sudden systemwide crash

  c. Pacemaker fenced the node, but was not configured for "on-fail=block".  the pacemaker.log we have here is not illustrating this, it appears that it was logging to corosync.log.


6. after this point, pacemaker no longer attempts to do anything with this node according to the log, because there are no more log messages in this mysqld.log (unless, extremely unusually the resource agent were somehow reconfigured to use mariadb.log here on *just* this node and not the others, as they remain logging in mysqld.log).  I dont know pacemaker that well but it seems like this might be because it considers the node to be fenced (need confirmation on that from pacemaker people).

7. we do see attempts to start mysqld using systemctl, here we pick up again in mariadb.log where the node has crashed due to the tc-heuristic-recover issue, which was caused by one of the events in #5.    The original bug description shows this happening as the symptom, but this is not a bug, this is a known mariadb/galera behavior when a node is killed ungracefully and is not caused by normal reboots or by pacemaker when properly configured.

8. It's not clear why the original description of "reboot the node" is given but then we see all these systemctl starts, so can we ensure that the mariadb service is **not** enabled under systemctl:

    $systemctl status mariadb
      ● mariadb.service - MariaDB 10.0 database server
      Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
      Active: inactive (dead)


e.g. above we want to see "disabled;"

Resolution of the cluster issue immediately:

1. go to controller0

2. start manually with --tc-heurstic-recover=rollback:

    mysqld_safe --tc-heuristic-recover=rollback

3. shutdown:

    mysqladmin shutdown

4. get pacemaker to unfence / unblock the node, "pcs cleanup galera" but then I'm not sure about the fencing part.



Remaining questions:

1. what happened to the node in step 5 above between 151123  6:20:03 and before 151123 08:21:51?   What kind of "shutdown" are we talking about?  

2. why didn't pacemaker attempt to look at this node at all after 6:20? 

   a. is it fenced?

   b. was it unmanaged or otherwise taken out of pacemaker control?

3. is the pacemaker configuration appropriately configured with on-fail=block?  Need to see pacemaker configuration files.

4. Confirm that all the remaining mariadb activity here on controller0 were manual attempts to restart the node using systemctl and/or mysqld_safe

5. What is the "bug" being reported here, that pacemaker inappropriately force-killed the node, or the tc-heuristic-recover issue itself, which is not a bug by itself?

Next steps:

1. if the issue is only the tc-heuristic-recover state after the node was known to be forcibly killed, no bug.  Run the steps above to recover it.

2. if the issue is, the galera node was put into this forcibly killed state as a result of a graceful shutdown, need to see all pacemaker activity and configuration.  If we don't have access to these servers then we'd need SOS reports.

Comment 10 Leonid Natapov 2015-11-24 20:25:37 UTC
1.The initial shutdown of the controller was done executing "reboot" command.
2.pacemaker configured as follows:
Master: galera-master
  Resource: galera (class=ocf provider=heartbeat type=galera)
   Operations: start interval=0s timeout=120 (galera-start-interval-0s)
               stop interval=0s timeout=120 (galera-stop-interval-0s)
               monitor interval=20 timeout=30 (galera-monitor-interval-20)
               monitor interval=10 role=Master timeout=30 (galera-monitor-interval-10)
               monitor interval=30 role=Slave timeout=30 (galera-monitor-interval-30)
               demote interval=0s timeout=120 (galera-demote-interval-0s)
               promote interval=0s timeout=300s on-fail=block (galera-promote-interval-0s)
  promote galera-master then start openstack-keystone-clone (kind:Mandatory) (id:order-galera-master-openstack-keystone-clone-mandatory)
3.Confirming that remaining mariadb activity was manual attempt to restart the node using systemctl.
4.At the same time with the galera issue there was also an issue with the redis due to another bug (redis can't start after rebooting the host or stopping cluster)

here are the pacemaker conf. for redis:
[root@overcloud-controller-0 heat-admin]# pcs config | grep redis
 Master: redis-master
  Resource: redis (class=ocf provider=heartbeat type=redis)
   Operations: start interval=0s timeout=120 (redis-start-interval-0s)
               stop interval=0s timeout=120 (redis-stop-interval-0s)
               monitor interval=45 timeout=60 (redis-monitor-interval-45)
               monitor interval=20 role=Master timeout=60 (redis-monitor-interval-20)
               monitor interval=60 role=Slave timeout=60 (redis-monitor-interval-60)
               promote interval=0s timeout=120 (redis-promote-interval-0s)
               demote interval=0s timeout=120 (redis-demote-interval-0s)
  promote redis-master then start openstack-ceilometer-central-clone (kind:Mandatory) (Options: require-all=false) (id:order-redis-master-openstack-ceilometer-central-clone-mandatory)
 redis_REPL_INFO: overcloud-controller-2

controller-0 was consistently fenced every several minutes. may be because of redis that was unable to start ?

In any case ,and Damien agreed with me,we probably have to make galera recovery process automatic in case of reboot/power off/electricity failure of the cluster node. I expect us to be able to recover all the cluster's services  without manual intervention.

Comment 11 Michael Bayer 2015-11-24 21:29:15 UTC
OK but still not answered is:

1. what happened to the node after 6:20 that it just ended and/or was kill -9'ed ?   An SOS report might show us better what exactly happened when the node was rebooted.  What time was the node rebooted?    Did this mariadb crash occur before the reboot?   

2. When the node came up again why didn't pacemaker even *attempt* to start the node?  Even after pcs cleanup, usually we see logs in mysqld.log where we'd see mariadb failing to start due to the tc-heuristic-recover issue.  We would *not* see anything in mariadb/mariadb.log at all.

Basically the tc-heuristic-recover message should *never occur* unless a node is power cycled, crashes, or the service is kill -9'ed.  There's lots of other unrecoverable things that can happen from an ungraceful kill as well, so just working around this one condition won't fix a bug that is ungracefully killing  galera nodes.

Comment 12 Leonid Natapov 2015-11-25 04:49:47 UTC
1.The answer to the first question may be that node was fenced and stayed shotdown  until manually started. See https://bugzilla.redhat.com/show_bug.cgi?id=1194301

In any case I will collect and attach sos report ASAP.

Comment 15 Ken Gaillot 2015-11-30 18:19:47 UTC
(In reply to Michael Bayer from comment #8)
> OK but a graceful reboot should not cause this issue - controller0 was
> forcibly killed via kill -9 or similar.  Can we please confirm that
> pacemaker is configured to not fence this node with a kill?

FYI fencing is always non-graceful. Usually it's implemented as an immediate power shutoff (in this case, via IPMI). That's because it's designed to handle cases where the node is unreachable or unresponsive. What we want in this case is a graceful shutdown of cluster services, and not fencing.

I'm still looking through the logs to see what happened here, but the proper procedure for rebooting a cluster node is documented at https://access.redhat.com/articles/2059253 (put node into standby mode, then stop cluster services, before rebooting).

Comment 16 Ken Gaillot 2015-11-30 19:14:05 UTC
On Nov 23, Pacemaker repeatedly fenced overcloud-controller-0 (most immediately around the reported issue at 08:40:45, 08:48:29, and 08:51:10).

The sequence is: someone presses the power button on overcloud-controller-0, everything begins to shut down, but neutron-ovs-cleanup fails to stop, so the node is fenced. Since neutron-openvswitch-agent depends (indirectly) on both neutron-ovs-cleanup and galera-master, galera is not stopped before the fence.

As an example, the 08:40 event begins with:

Nov 23 08:40:41 overcloud-controller-0 systemd-logind: Power key pressed.

Pacemaker initiates a stop of neutron-ovs-cleanup at 08:40:44, which returns a "generic error" (exit status 1) one second later, so Pacemaker power-cycles the node. Unfortunately, we don't have logs from this time on the overcloud-controller-0, likely due to write caches lost at the fence event.

So the cause, and unanswered question, is why the attempt to stop neutron-ovs-cleanup immediately fails on overcloud-controller-0. One way to test would be to turn on maintenance mode, then run the resource agent manually to stop neutron-ovs-cleanup. Ideally these should be stopped in this order first: neutron-metadata-agent, neutron-l3-agent, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-netns-cleanup.

BTW, hardware design can mitigate the effects of events like this, i.e. use hardware RAID with power backup (BBU or supercapacitor), and disable write caching at all levels other than the RAID card. That can help with both the logs and the mariadb state files.

Comment 17 Fabio Massimo Di Nitto 2016-01-18 08:07:57 UTC
Andrew, this sounds like the same problem we are seeing now with undercloud trying to shutdown an overcloud node. Do you mind to check the logs if you see the same behavior as the scale lab?

Comment 18 Andrew Beekhof 2016-01-18 12:13:43 UTC
I see a lot of the oslo messaging bug in those logs, but i'd need the underclouds logs to see if it matches the reboot cycle we've seen elsewhere

Comment 19 Leonid Natapov 2016-01-20 06:35:01 UTC
Andrew,what logs do you need from the undercloud machine ?
I will try to reproduce it and attach the logs.

Comment 20 Fabio Massimo Di Nitto 2016-01-20 09:10:37 UTC
(In reply to Leonid Natapov from comment #19)
> Andrew,what logs do you need from the undercloud machine ?
> I will try to reproduce it and attach the logs.

Just grab an sosreport. it´s the easiest.

Comment 21 Leonid Natapov 2016-01-20 17:09:42 UTC
(In reply to Andrew Beekhof from comment #18)
> I see a lot of the oslo messaging bug in those logs, but i'd need the
> underclouds logs to see if it matches the reboot cycle we've seen elsewhere

I don't have working environment at the moment. Will try to reproduce and attach sos report from the undercloud machines as soon as possible.

Comment 22 Jon Schlueter 2016-02-09 14:55:22 UTC
moving out to 8.0.z release since not a blocker for GA and not likely to land by then.

Comment 27 Damien Ciabrini 2016-03-18 10:24:48 UTC
Created attachment 1137738 [details]
gdb test script to trigger the condition

Comment 28 Damien Ciabrini 2016-03-18 10:26:14 UTC
Instructions for test

Test #1: node restarts when it is killed in the middle of a replication

1- Start the galera resource on the cluster

    pcs resource enable galera

2- On any node, create a temporary table for the test. It will get
replicated by galera on all the nodes.

    mysql -e 'drop database if exists racts; drop table if exists racts.break; create database racts; create table racts.break (i int) engine=innodb;'

3- Prepare to trigger the initial bug condition. Go on controller#3,
and gdb the running galera server with the attached script

    gdb -x /tmp/kill-during-txn.gdb -p $(cat /var/run/mysql/mysqld.pid)

4- On any node, write something to the cluster. The write will get
replicated to controller#3 and the gdb script will kill the server in
the middle of integrating the write set.

    mysql -e 'insert into racts.break values (42);'

Pacemaker monitoring detects that the node is not running, logs an error
and try to restart it. The restart won't succeed and logs should exhibit
the error message reported in the bug.

5- load the new resource agent and cleanup pacemaker resource state.

    pcs resource cleanup galera

Pacemaker will try the restart galera on controller#3 and will succeed.
In the journal, one can see the resource agent log that trigger the recovery.

    Mar 18 11:04:27 db3 galera(galera)[11322]: WARNING: local node <db3> was not shutdown properly. Rollback stuck transaction with --tc-heuristic-recover
    Mar 18 11:04:32 db3 galera(galera)[11932]: WARNING: State recovered. force SST at next restart for full resynchronization

After the restart, the write test is correctly replicated on controller#3:

    [root@db3 ~]# mysql -e 'select * from racts.break;'
    +------+
    | i    |
    +------+
    |   42 |
    +------+

Test #2: cluster can be bootstrapped when a node is blocked by a pending replication.

With the old resource-agent code, repeat steps 1- to 4-

5- stop all the galera nodes, and cleanup the resource status in pacemaker

    pcs resource disable galera
    pcs resource cleanup galera

6- Install the new resource-agents and restart the galera cluster

    pcs resource enable galera

The entire cluster should come up, and recovery logs should appear
in the journal on controller#3.


Housekeeping after the tests, when all galera nodes are up

    mysql -e 'drop database if exists racts;'

Comment 30 Leonid Natapov 2016-04-06 08:32:53 UTC
resource-agents-3.9.5-68.el7
ospd8

was unable to reproduce

Comment 39 errata-xmlrpc 2016-11-03 23:59:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html