| Summary: | rhel-osp-director: 9.0 After successful minor update + reboot of OC nodes, galera starts on 2 out of 3 controllers. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
| Component: | rhosp-director | Assignee: | Michael Bayer <mbayer> |
| Status: | CLOSED NOTABUG | QA Contact: | Omri Hochman <ohochman> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 9.0 (Mitaka) | CC: | dbecker, dciabrin, mbayer, mburns, morazi, rhel-osp-director-maint, sasha |
| Target Milestone: | async | ||
| Target Release: | 9.0 (Mitaka) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-10-21 15:37:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Alexander Chuzhoy
2016-10-20 18:28:19 UTC
The bootstrap of the galera cluster has not been performed because on controller-2, when the resource agent is calling mysqld_safe to retrieve the last sequence number from the database files, mysqld segfaults o : from /var/log/mariadb/mariadb.log: 161019 20:49:15 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Cleaning up trx with id B8C9 161019 20:49:16 InnoDB: Waiting for the background threads to start 161019 20:49:16 InnoDB: Assertion failure in thread 139899414460160 in file fsp0fsp.c line 3667 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html InnoDB: about forcing recovery. 161019 20:49:16 [ERROR] mysqld got signal 6 ; I see the following version of mariadb-galera-server: rpm -qa mariadb-galera-server mariadb-galera-server-5.5.42-5.el7ost.x86_64 Alexander, can you tell if there was a previous version prior to the minor update? Yes, [root@overcloud-controller-0 ~]# grep mariadb-galera /var/log/yum.log Oct 19 19:40:08 Updated: 1:mariadb-galera-common-5.5.42-5.el7ost.x86_64 Oct 19 19:40:32 Updated: 1:mariadb-galera-server-5.5.42-5.el7ost.x86_64 Nevermind, on the other two nodes, the command used to retrieve the last seqno from the database is working as expected. [root@overcloud-controller-0 mysql]# mysqld_safe --wsrep-recover 161020 18:54:39 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. /bin/mysqld_safe: line 728: ulimit: -1: invalid option ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit] 161020 18:54:39 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 161020 18:54:39 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.rBnGVp' --pid-file='/var/lib/mysql/overcloud-controller-0.localdomain-recover.pid' 161020 18:54:41 mysqld_safe WSREP: Recovered position 464f2096-9629-11e6-aff7-93a8c75df3db:19683 161020 18:54:43 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended So it's more likely a database corruption on controller-2 So the DB is corrupt because the database was not shutdown normally. a kill -9 for example can do this. just trying to start the machine, the first few lines that are the tell are: 161020 19:06:32 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... to fix that, you need to rebuild the data directory. For customers, that's all we do here. But, we're saying here that this corruption occurred as part of the upgrade process. From the Galera POV we need to just see when/how/if the DB was not cleanly shutdown somewhere. I will try to poke around a bit but we may need some pacemaker / tripleo people to look here. Ok the database corruption seemed to have been caused by a reboot without having stopped the galera node beforehand. I see that on controller2 the mariadb package upgrade happended around 19:52: [root@overcloud-controller-2 lib]# grep mariadb-galera /var/log/yum.log Oct 19 19:52:49 Updated: 1:mariadb-galera-common-5.5.42-5.el7ost.x86_64 Oct 19 19:53:12 Updated: 1:mariadb-galera-server-5.5.42-5.el7ost.x86_64 Before the upgrade the galera server running on the node have been properly shutdown (with journalctl)" Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30974]: INFO: MySQL not running: removing old PID file Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30981]: INFO: MySQL not running: removing old PID file Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30988]: INFO: MySQL stopped Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30999]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[31008]: INFO: Last commit version found: 12153 Oct 19 19:50:19 overcloud-controller-2.localdomain galera(galera)[31584]: INFO: MySQL is not running then after the upgrade, the node restarted properly and was able to fetch the last seqno as expected: Oct 19 19:59:02 overcloud-controller-2.localdomain galera(galera)[5092]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 19:59:02 overcloud-controller-2.localdomain galera(galera)[5103]: INFO: Last commit version found: 12153 Then a reboot happened without galera being stopped properly beforehand, and right after the last seqno couldn't be retrieved anymore (data corruption): # journalctl -t 'galera(galera)' [...] Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8004]: INFO: Waiting for database to sync with the cluster. Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8010]: INFO: Database synced. Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8017]: INFO: Galera started -- Reboot -- Oct 19 20:48:48 overcloud-controller-2.localdomain galera(galera)[6532]: INFO: MySQL is not running Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13229]: ERROR: MySQL is not running Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13275]: INFO: Creating PID dir: /var/run/mysql Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13339]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13350]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover' Oct 19 20:49:16 overcloud-controller-2.localdomain galera(galera)[14489]: ERROR: Unable to detect last known write sequence number At this point, the only solution is to rebuild the datadir for galera to restart properly. Closing that one as it was not caused by the minor update, but it was due to a bad reboot (no shutdown of the cluster before VM reboot). |