Hide Forgot
rhel-osp-director: 9.0 After successful minor update + reboot of OC nodes, galera starts on 2 out of 3 controllers. Environment: openstack-tripleo-heat-templates-2.0.0-35.el7ost.noarch instack-undercloud-4.0.0-14.el7ost.noarch openstack-puppet-modules-8.1.8-2.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-35.el7ost.noarch Steps to reproduce: 1. Deploy overcloud. 2. Minor update with: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml Note: The update includes switch from rhel7.2 to rhel7.3 3. Reboot the setup 4. If needed restore the network connection to controllers according to https://bugzilla.redhat.com/show_bug.cgi?id=1386299#c10 5. run pcs cluster stop --all 6. pcs cluste start --all Check the pcs resources: [root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-1 (version 1.1.15-11.el7-e174ec8) - partition with quorum Last updated: Thu Oct 20 18:26:53 2016 Last change: Thu Oct 20 17:51:07 2016 by root via crm_attribute on overcloud-controller-1 3 nodes and 127 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: ip-192.168.200.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-10.19.94.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-10.19.95.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.168.0.6 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 Master/Slave Set: galera-master [galera] Slaves: [ overcloud-controller-0 overcloud-controller-1 ] Stopped: [ overcloud-controller-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-10.19.94.11 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-10.19.184.180 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-core-clone [openstack-core] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-1 ] Slaves: [ overcloud-controller-0 overcloud-controller-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-api-clone [openstack-sahara-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: httpd-clone [httpd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed Actions: * galera_start_0 on overcloud-controller-2 'unknown error' (1): call=242, status=complete, exitreason='Unable to detect last known write sequence number', last-rc-change='Thu Oct 20 18:07:55 2016', queued=1ms, exec=2012ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Expected result: The galera resource should start on all controllers.
The bootstrap of the galera cluster has not been performed because on controller-2, when the resource agent is calling mysqld_safe to retrieve the last sequence number from the database files, mysqld segfaults o : from /var/log/mariadb/mariadb.log: 161019 20:49:15 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Cleaning up trx with id B8C9 161019 20:49:16 InnoDB: Waiting for the background threads to start 161019 20:49:16 InnoDB: Assertion failure in thread 139899414460160 in file fsp0fsp.c line 3667 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html InnoDB: about forcing recovery. 161019 20:49:16 [ERROR] mysqld got signal 6 ; I see the following version of mariadb-galera-server: rpm -qa mariadb-galera-server mariadb-galera-server-5.5.42-5.el7ost.x86_64 Alexander, can you tell if there was a previous version prior to the minor update?
Yes, [root@overcloud-controller-0 ~]# grep mariadb-galera /var/log/yum.log Oct 19 19:40:08 Updated: 1:mariadb-galera-common-5.5.42-5.el7ost.x86_64 Oct 19 19:40:32 Updated: 1:mariadb-galera-server-5.5.42-5.el7ost.x86_64
Nevermind, on the other two nodes, the command used to retrieve the last seqno from the database is working as expected. [root@overcloud-controller-0 mysql]# mysqld_safe --wsrep-recover 161020 18:54:39 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. /bin/mysqld_safe: line 728: ulimit: -1: invalid option ulimit: usage: ulimit [-SHacdefilmnpqrstuvx] [limit] 161020 18:54:39 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql 161020 18:54:39 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.rBnGVp' --pid-file='/var/lib/mysql/overcloud-controller-0.localdomain-recover.pid' 161020 18:54:41 mysqld_safe WSREP: Recovered position 464f2096-9629-11e6-aff7-93a8c75df3db:19683 161020 18:54:43 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended So it's more likely a database corruption on controller-2
So the DB is corrupt because the database was not shutdown normally. a kill -9 for example can do this. just trying to start the machine, the first few lines that are the tell are: 161020 19:06:32 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... to fix that, you need to rebuild the data directory. For customers, that's all we do here. But, we're saying here that this corruption occurred as part of the upgrade process. From the Galera POV we need to just see when/how/if the DB was not cleanly shutdown somewhere. I will try to poke around a bit but we may need some pacemaker / tripleo people to look here.
Ok the database corruption seemed to have been caused by a reboot without having stopped the galera node beforehand. I see that on controller2 the mariadb package upgrade happended around 19:52: [root@overcloud-controller-2 lib]# grep mariadb-galera /var/log/yum.log Oct 19 19:52:49 Updated: 1:mariadb-galera-common-5.5.42-5.el7ost.x86_64 Oct 19 19:53:12 Updated: 1:mariadb-galera-server-5.5.42-5.el7ost.x86_64 Before the upgrade the galera server running on the node have been properly shutdown (with journalctl)" Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30974]: INFO: MySQL not running: removing old PID file Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30981]: INFO: MySQL not running: removing old PID file Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30988]: INFO: MySQL stopped Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[30999]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 19:50:12 overcloud-controller-2.localdomain galera(galera)[31008]: INFO: Last commit version found: 12153 Oct 19 19:50:19 overcloud-controller-2.localdomain galera(galera)[31584]: INFO: MySQL is not running then after the upgrade, the node restarted properly and was able to fetch the last seqno as expected: Oct 19 19:59:02 overcloud-controller-2.localdomain galera(galera)[5092]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 19:59:02 overcloud-controller-2.localdomain galera(galera)[5103]: INFO: Last commit version found: 12153 Then a reboot happened without galera being stopped properly beforehand, and right after the last seqno couldn't be retrieved anymore (data corruption): # journalctl -t 'galera(galera)' [...] Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8004]: INFO: Waiting for database to sync with the cluster. Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8010]: INFO: Database synced. Oct 19 19:59:21 overcloud-controller-2.localdomain galera(galera)[8017]: INFO: Galera started -- Reboot -- Oct 19 20:48:48 overcloud-controller-2.localdomain galera(galera)[6532]: INFO: MySQL is not running Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13229]: ERROR: MySQL is not running Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13275]: INFO: Creating PID dir: /var/run/mysql Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13339]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Oct 19 20:49:13 overcloud-controller-2.localdomain galera(galera)[13350]: INFO: now attempting to detect last commit version using 'mysqld_safe --wsrep-recover' Oct 19 20:49:16 overcloud-controller-2.localdomain galera(galera)[14489]: ERROR: Unable to detect last known write sequence number At this point, the only solution is to rebuild the datadir for galera to restart properly.
Closing that one as it was not caused by the minor update, but it was due to a bad reboot (no shutdown of the cluster before VM reboot).