Bug 1284526
Summary: | HA | Galera fails to start after rebooting one of the controllers. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Leonid Natapov <lnatapov> | ||||||||
Component: | resource-agents | Assignee: | Damien Ciabrini <dciabrin> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.0 | CC: | abeekhof, agk, bschmaus, cluster-maint, dciabrin, dmaley, ebarrera, fdinitto, jherrman, jruemker, jschluet, kgaillot, lnatapov, mbayer, mburns, mcornea, mschuppe, oblaut, pablo.iranzo, phagara, rfreire, royoung, sknauss, srevivo, ushkalim | ||||||||
Target Milestone: | pre-dev-freeze | Keywords: | Regression, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | resource-agents-3.9.5-68.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, if a Galera node was unexpectedly terminated while a replication process was in progress, the MariaDB database could not be restarted on the node afterwards. With this update, the resource agent passes proper recovery arguments to MariaDB, which ensures that in the described circumstances, MariaDB can restart and synchronize to the Galera cluster as expected.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1318744 (view as bug list) | Environment: | |||||||||
Last Closed: | 2016-11-03 23:59:33 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1318744 | ||||||||||
Attachments: |
|
Description
Leonid Natapov
2015-11-23 14:22:31 UTC
the attached logs show this particular node being started successfully. Pacemaker starts MySQL with logging to mysqld.log, the log ends with the node being SYNCED, hence it's started. have you tried just emitting a simple "pcs cleanup galera" when this occurs just to get the resource agent to check again? yes, I have tried. No success. here is the output of /var/log/messages while performing pcs resource cleanup galera Nov 23 09:36:04 overcloud-controller-0 su: (to rabbitmq) root on none Nov 23 09:36:04 overcloud-controller-0 systemd: Started Session c29 of user rabbitmq. Nov 23 09:36:04 overcloud-controller-0 systemd: Starting Session c29 of user rabbitmq. Nov 23 09:36:05 overcloud-controller-0 su: (to rabbitmq) root on none Nov 23 09:36:05 overcloud-controller-0 systemd: Started Session c30 of user rabbitmq. Nov 23 09:36:05 overcloud-controller-0 systemd: Starting Session c30 of user rabbitmq. Nov 23 09:36:05 overcloud-controller-0 crmd[3509]: notice: Notifications disabled Nov 23 09:36:05 overcloud-controller-0 crmd[3509]: notice: Notifications disabled Nov 23 09:36:06 overcloud-controller-0 galera(galera)[9443]: INFO: MySQL is not running Nov 23 09:36:06 overcloud-controller-0 crmd[3509]: notice: Operation galera_monitor_0: not running (node=overcloud-controller-0, call=234, rc=7, cib-update=81, confirmed=true) Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: ERROR: MySQL is not running Nov 23 09:36:07 overcloud-controller-0 su: (to mysql) root on none Nov 23 09:36:07 overcloud-controller-0 systemd: Created slice user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Starting user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Started Session c31 of user mysql. Nov 23 09:36:07 overcloud-controller-0 systemd: Starting Session c31 of user mysql. Nov 23 09:36:07 overcloud-controller-0 su: (to mysql) root on none Nov 23 09:36:07 overcloud-controller-0 systemd: Removed slice user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Stopping user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Created slice user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Starting user-27.slice. Nov 23 09:36:07 overcloud-controller-0 systemd: Started Session c32 of user mysql. Nov 23 09:36:07 overcloud-controller-0 systemd: Starting Session c32 of user mysql. Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Nov 23 09:36:07 overcloud-controller-0 galera(galera)[9501]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover' Nov 23 09:36:09 overcloud-controller-0 galera(galera)[9501]: ERROR: Unable to detect last known write sequence number Nov 23 09:36:09 overcloud-controller-0 lrmd[3506]: notice: galera_start_0:9501:stderr [ ocf-exit-reason:Unable to detect last known write sequence number ] Nov 23 09:36:09 overcloud-controller-0 crmd[3509]: notice: Operation galera_start_0: unknown error (node=overcloud-controller-0, call=235, rc=1, cib-update=82, confirmed=true) Nov 23 09:36:09 overcloud-controller-0 crmd[3509]: notice: overcloud-controller-0-galera_start_0:235 [ ocf-exit-reason:Unable to detect last known write sequence number\n ] Nov 23 09:36:10 overcloud-controller-0 galera(galera)[10508]: INFO: MySQL is not running Nov 23 09:36:10 overcloud-controller-0 crmd[3509]: notice: Operation galera_stop_0: ok (node=overcloud-controller-0, call=236, rc=0, cib-update=83, confirmed=true) Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.395 3594 WARNING os-collect-config [-] Source [request] Unavailable. Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.local [-] /var/lib/os-collect-config/local-data not found. Skipping Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data']) Nov 23 09:36:12 overcloud-controller-0 os-collect-config: 2015-11-23 09:36:12.397 3594 WARNING os_collect_config.zaqar [-] No auth_url configured. Nov 23 09:36:15 overcloud-controller-0 su: (to rabbitmq) root on none Nov 23 09:36:15 overcloud-controller-0 systemd: Started Session c33 of user rabbitmq. Nov 23 09:36:15 overcloud-controller-0 systemd: Starting Session c33 of user rabbitmq. Nov 23 09:36:15 overcloud-controller-0 su: (to rabbitmq) root on none Nov 23 09:36:15 overcloud-controller-0 systemd: Started Session c34 of user rabbitmq. OK can you please attach /var/log/mysqld.log from all three nodes please Created attachment 1098007 [details]
mysql log files
Mysql log file from 3 controllers
It looks like controller0 has been rebooted in the middle of a transaction. This has the following effects: . Since the mysqld wasn't shutdown properly, file /var/lib/mysql/grastate.dat (which normally holds the local node's replication state) couldn't be written. . When the resource agent tries to restart the mysql after controller0 has reboot, it first tries unconditionally to retrieve the last commit on the node. file grastate.dat doesn't contain the info, so the resource agent start mysqld_safe in "recovery mode" to find out. . mysqld has a pending transaction and refuses to go ahead before it is either committed or rollback manually. . resource agent cannot retrieve the last commit, ends up in error, and this prevents galera node from coming back up on controller0. To fix the issue, you have to force commit the pending transaction manually, and then clean up the resource state in pacemaker. controller0# mysqld_safe --user=mysql --datadir=/var/lib/mysql --tc-heuristic-recover commit --wsrep-recover controller0# pcs cleanup galera Can you apply the steps and tell me if it fixes the issue? [root@overcloud-controller-0 ~]# mysqld_safe --user=mysql --datadir=/var/lib/mysql --tc-heuristic-recover commit --wsrep-recover 151124 04:21:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. /bin/mysqld_safe: line 728: ulimit: -1: invalid option ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit] 151124 04:21:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 151124 04:21:56 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.yksri9' --pid-file='/var/lib/mysql/overcloud-controller-0.localdomain-recover.pid' 151124 04:21:58 mysqld_safe WSREP: Failed to recover position: '151124 4:21:56 InnoDB: The InnoDB memory heap is disabled 151124 4:21:56 InnoDB: Mutexes and rw_locks use GCC atomic builtins 151124 4:21:56 InnoDB: Compressed tables use zlib 1.2.7 151124 4:21:56 InnoDB: Using Linux native AIO 151124 4:21:56 InnoDB: Initializing buffer pool, size = 128.0M 151124 4:21:56 InnoDB: Completed initialization of buffer pool 151124 4:21:56 InnoDB: highest supported file format is Barracuda. InnoDB: Transaction 20DE95 was in the XA prepared state. InnoDB: 1 transaction(s) which must be rolled back or cleaned up InnoDB: in total 0 row operations to undo InnoDB: Trx id counter is 20E000 InnoDB: Starting in background the rollback of uncommitted transactions 151124 4:21:56 InnoDB: Rollback of non-prepared transactions completed 151124 4:21:56 InnoDB: Waiting for the background threads to start 151124 4:21:57 Percona XtraDB (http://www.percona.com) 5.5.41-MariaDB-37.0 started; log sequence number 701494371 151124 4:21:57 [Note] Plugin 'FEEDBACK' is disabled. 151124 4:21:57 InnoDB: Starting recovery for XA transactions... 151124 4:21:57 InnoDB: Transaction 20DE95 in prepared state after recovery 151124 4:21:57 InnoDB: Transaction contains changes to 1 rows 151124 4:21:57 InnoDB: 1 transactions in prepared state after recovery 151124 4:21:57 [Note] Found 1 prepared transaction(s) in InnoDB 151124 4:21:57 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 151124 4:21:57 [ERROR] Aborting 151124 4:21:57 InnoDB: Starting shutdown... 151124 4:21:58 InnoDB: Shutdown completed; log sequence number 701494371 151124 4:21:58 [Note] /usr/libexec/mysqld: Shutdown complete' After a second try, the following commands unblocked the cluster as expected: # mysqld_safe --user=mysql --datadir=/var/lib/mysql --wsrep-recover --tc-heuristic-recover rollback # pcs cleanup galera # pcs status resources | grep -A1 galera Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] OK but a graceful reboot should not cause this issue - controller0 was forcibly killed via kill -9 or similar. Can we please confirm that pacemaker is configured to not fence this node with a kill? My browser cached a stale page here inappropriately so I've constructed an analysis that is not based on Damien's information already, nevertheless it sums up what we need to ask about the config on this node if we are in fact looking for a "bug", because the behavior of, "node was killed forcibly (e.g. kill -9, power cycle) and goes into tc-heuristic-recover" is *not* a bug, it is a known behavior. So far here's what we can see: 1. controller0 is the one with the problem 2. controller0 was under pacemaker control, then stopped normally under pacemaker, at 151123 6:11:20. 2. controller0 was then started manually either via systemctl or mysqld_safe without using pacemaker at all at 151123 06:17:49. 3. controller0 was shutdown normally at 151123 06:17:59. 4. controller0 was started under pacemaker control 151123 06:19:49. 5. the node is forcibly killed some unknown time after 151123 6:20:03 and before 151123 08:21:51. Reasons for this generally include: a. the server was kill -9'ed by the user. b. the server was power-cycled or otherwise subject to some sudden systemwide crash c. Pacemaker fenced the node, but was not configured for "on-fail=block". the pacemaker.log we have here is not illustrating this, it appears that it was logging to corosync.log. 6. after this point, pacemaker no longer attempts to do anything with this node according to the log, because there are no more log messages in this mysqld.log (unless, extremely unusually the resource agent were somehow reconfigured to use mariadb.log here on *just* this node and not the others, as they remain logging in mysqld.log). I dont know pacemaker that well but it seems like this might be because it considers the node to be fenced (need confirmation on that from pacemaker people). 7. we do see attempts to start mysqld using systemctl, here we pick up again in mariadb.log where the node has crashed due to the tc-heuristic-recover issue, which was caused by one of the events in #5. The original bug description shows this happening as the symptom, but this is not a bug, this is a known mariadb/galera behavior when a node is killed ungracefully and is not caused by normal reboots or by pacemaker when properly configured. 8. It's not clear why the original description of "reboot the node" is given but then we see all these systemctl starts, so can we ensure that the mariadb service is **not** enabled under systemctl: $systemctl status mariadb ● mariadb.service - MariaDB 10.0 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled) Active: inactive (dead) e.g. above we want to see "disabled;" Resolution of the cluster issue immediately: 1. go to controller0 2. start manually with --tc-heurstic-recover=rollback: mysqld_safe --tc-heuristic-recover=rollback 3. shutdown: mysqladmin shutdown 4. get pacemaker to unfence / unblock the node, "pcs cleanup galera" but then I'm not sure about the fencing part. Remaining questions: 1. what happened to the node in step 5 above between 151123 6:20:03 and before 151123 08:21:51? What kind of "shutdown" are we talking about? 2. why didn't pacemaker attempt to look at this node at all after 6:20? a. is it fenced? b. was it unmanaged or otherwise taken out of pacemaker control? 3. is the pacemaker configuration appropriately configured with on-fail=block? Need to see pacemaker configuration files. 4. Confirm that all the remaining mariadb activity here on controller0 were manual attempts to restart the node using systemctl and/or mysqld_safe 5. What is the "bug" being reported here, that pacemaker inappropriately force-killed the node, or the tc-heuristic-recover issue itself, which is not a bug by itself? Next steps: 1. if the issue is only the tc-heuristic-recover state after the node was known to be forcibly killed, no bug. Run the steps above to recover it. 2. if the issue is, the galera node was put into this forcibly killed state as a result of a graceful shutdown, need to see all pacemaker activity and configuration. If we don't have access to these servers then we'd need SOS reports. 1.The initial shutdown of the controller was done executing "reboot" command. 2.pacemaker configured as follows: Master: galera-master Resource: galera (class=ocf provider=heartbeat type=galera) Operations: start interval=0s timeout=120 (galera-start-interval-0s) stop interval=0s timeout=120 (galera-stop-interval-0s) monitor interval=20 timeout=30 (galera-monitor-interval-20) monitor interval=10 role=Master timeout=30 (galera-monitor-interval-10) monitor interval=30 role=Slave timeout=30 (galera-monitor-interval-30) demote interval=0s timeout=120 (galera-demote-interval-0s) promote interval=0s timeout=300s on-fail=block (galera-promote-interval-0s) promote galera-master then start openstack-keystone-clone (kind:Mandatory) (id:order-galera-master-openstack-keystone-clone-mandatory) 3.Confirming that remaining mariadb activity was manual attempt to restart the node using systemctl. 4.At the same time with the galera issue there was also an issue with the redis due to another bug (redis can't start after rebooting the host or stopping cluster) here are the pacemaker conf. for redis: [root@overcloud-controller-0 heat-admin]# pcs config | grep redis Master: redis-master Resource: redis (class=ocf provider=heartbeat type=redis) Operations: start interval=0s timeout=120 (redis-start-interval-0s) stop interval=0s timeout=120 (redis-stop-interval-0s) monitor interval=45 timeout=60 (redis-monitor-interval-45) monitor interval=20 role=Master timeout=60 (redis-monitor-interval-20) monitor interval=60 role=Slave timeout=60 (redis-monitor-interval-60) promote interval=0s timeout=120 (redis-promote-interval-0s) demote interval=0s timeout=120 (redis-demote-interval-0s) promote redis-master then start openstack-ceilometer-central-clone (kind:Mandatory) (Options: require-all=false) (id:order-redis-master-openstack-ceilometer-central-clone-mandatory) redis_REPL_INFO: overcloud-controller-2 controller-0 was consistently fenced every several minutes. may be because of redis that was unable to start ? In any case ,and Damien agreed with me,we probably have to make galera recovery process automatic in case of reboot/power off/electricity failure of the cluster node. I expect us to be able to recover all the cluster's services without manual intervention. OK but still not answered is: 1. what happened to the node after 6:20 that it just ended and/or was kill -9'ed ? An SOS report might show us better what exactly happened when the node was rebooted. What time was the node rebooted? Did this mariadb crash occur before the reboot? 2. When the node came up again why didn't pacemaker even *attempt* to start the node? Even after pcs cleanup, usually we see logs in mysqld.log where we'd see mariadb failing to start due to the tc-heuristic-recover issue. We would *not* see anything in mariadb/mariadb.log at all. Basically the tc-heuristic-recover message should *never occur* unless a node is power cycled, crashes, or the service is kill -9'ed. There's lots of other unrecoverable things that can happen from an ungraceful kill as well, so just working around this one condition won't fix a bug that is ungracefully killing galera nodes. 1.The answer to the first question may be that node was fenced and stayed shotdown until manually started. See https://bugzilla.redhat.com/show_bug.cgi?id=1194301 In any case I will collect and attach sos report ASAP. (In reply to Michael Bayer from comment #8) > OK but a graceful reboot should not cause this issue - controller0 was > forcibly killed via kill -9 or similar. Can we please confirm that > pacemaker is configured to not fence this node with a kill? FYI fencing is always non-graceful. Usually it's implemented as an immediate power shutoff (in this case, via IPMI). That's because it's designed to handle cases where the node is unreachable or unresponsive. What we want in this case is a graceful shutdown of cluster services, and not fencing. I'm still looking through the logs to see what happened here, but the proper procedure for rebooting a cluster node is documented at https://access.redhat.com/articles/2059253 (put node into standby mode, then stop cluster services, before rebooting). On Nov 23, Pacemaker repeatedly fenced overcloud-controller-0 (most immediately around the reported issue at 08:40:45, 08:48:29, and 08:51:10). The sequence is: someone presses the power button on overcloud-controller-0, everything begins to shut down, but neutron-ovs-cleanup fails to stop, so the node is fenced. Since neutron-openvswitch-agent depends (indirectly) on both neutron-ovs-cleanup and galera-master, galera is not stopped before the fence. As an example, the 08:40 event begins with: Nov 23 08:40:41 overcloud-controller-0 systemd-logind: Power key pressed. Pacemaker initiates a stop of neutron-ovs-cleanup at 08:40:44, which returns a "generic error" (exit status 1) one second later, so Pacemaker power-cycles the node. Unfortunately, we don't have logs from this time on the overcloud-controller-0, likely due to write caches lost at the fence event. So the cause, and unanswered question, is why the attempt to stop neutron-ovs-cleanup immediately fails on overcloud-controller-0. One way to test would be to turn on maintenance mode, then run the resource agent manually to stop neutron-ovs-cleanup. Ideally these should be stopped in this order first: neutron-metadata-agent, neutron-l3-agent, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-netns-cleanup. BTW, hardware design can mitigate the effects of events like this, i.e. use hardware RAID with power backup (BBU or supercapacitor), and disable write caching at all levels other than the RAID card. That can help with both the logs and the mariadb state files. Andrew, this sounds like the same problem we are seeing now with undercloud trying to shutdown an overcloud node. Do you mind to check the logs if you see the same behavior as the scale lab? I see a lot of the oslo messaging bug in those logs, but i'd need the underclouds logs to see if it matches the reboot cycle we've seen elsewhere Andrew,what logs do you need from the undercloud machine ? I will try to reproduce it and attach the logs. (In reply to Leonid Natapov from comment #19) > Andrew,what logs do you need from the undercloud machine ? > I will try to reproduce it and attach the logs. Just grab an sosreport. it´s the easiest. (In reply to Andrew Beekhof from comment #18) > I see a lot of the oslo messaging bug in those logs, but i'd need the > underclouds logs to see if it matches the reboot cycle we've seen elsewhere I don't have working environment at the moment. Will try to reproduce and attach sos report from the undercloud machines as soon as possible. moving out to 8.0.z release since not a blocker for GA and not likely to land by then. Created attachment 1137738 [details]
gdb test script to trigger the condition
Instructions for test Test #1: node restarts when it is killed in the middle of a replication 1- Start the galera resource on the cluster pcs resource enable galera 2- On any node, create a temporary table for the test. It will get replicated by galera on all the nodes. mysql -e 'drop database if exists racts; drop table if exists racts.break; create database racts; create table racts.break (i int) engine=innodb;' 3- Prepare to trigger the initial bug condition. Go on controller#3, and gdb the running galera server with the attached script gdb -x /tmp/kill-during-txn.gdb -p $(cat /var/run/mysql/mysqld.pid) 4- On any node, write something to the cluster. The write will get replicated to controller#3 and the gdb script will kill the server in the middle of integrating the write set. mysql -e 'insert into racts.break values (42);' Pacemaker monitoring detects that the node is not running, logs an error and try to restart it. The restart won't succeed and logs should exhibit the error message reported in the bug. 5- load the new resource agent and cleanup pacemaker resource state. pcs resource cleanup galera Pacemaker will try the restart galera on controller#3 and will succeed. In the journal, one can see the resource agent log that trigger the recovery. Mar 18 11:04:27 db3 galera(galera)[11322]: WARNING: local node <db3> was not shutdown properly. Rollback stuck transaction with --tc-heuristic-recover Mar 18 11:04:32 db3 galera(galera)[11932]: WARNING: State recovered. force SST at next restart for full resynchronization After the restart, the write test is correctly replicated on controller#3: [root@db3 ~]# mysql -e 'select * from racts.break;' +------+ | i | +------+ | 42 | +------+ Test #2: cluster can be bootstrapped when a node is blocked by a pending replication. With the old resource-agent code, repeat steps 1- to 4- 5- stop all the galera nodes, and cleanup the resource status in pacemaker pcs resource disable galera pcs resource cleanup galera 6- Install the new resource-agents and restart the galera cluster pcs resource enable galera The entire cluster should come up, and recovery logs should appear in the journal on controller#3. Housekeeping after the tests, when all galera nodes are up mysql -e 'drop database if exists racts;' resource-agents-3.9.5-68.el7 ospd8 was unable to reproduce Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2174.html |