Bug 1499677

Summary:	OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade
Product:	Red Hat Enterprise Linux 7	Reporter:	Marius Cornea <mcornea>
Component:	resource-agents	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED ERRATA	QA Contact:	Udi Shkalim <ushkalim>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.4	CC:	agk, aherr, cfeist, chjones, cluster-maint, dbecker, fdinitto, mbayer, mburns, mkrcmari, morazi, oalbrigt, rhel-osp-director-maint, sasha, srevivo
Target Milestone:	pre-dev-freeze	Keywords:	AutomationBlocker, Triaged, ZStream
Target Release:	7.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-113.el7	Doc Type:	If docs needed, set a value
Doc Text:	When a Galera cluster node is running, it keeps track of the last known state of the cluster in the gvwstate.dat temporary file. This file is deleted after the node shuts down. Previously, an ungraceful node shutdown sometimes left an empty gvwstate.dat file on the disk. Consequently, the node failed to join the cluster on recovery. With this update, the resource-agents scripts delete this empty file, and as a result, the described problem no longer occurs.	Story Points:	---
Clone Of:
Clones:	1512586 (view as bug list)		Environment:
Last Closed:	2018-04-10 12:09:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1512586

Description Marius Cornea 2017-10-09 08:49:08 UTC

Description of problem:
/var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade.

Version-Release number of selected component (if applicable):
mariadb-galera-server-5.5.42-6.el7ost.x86_64
mariadb-galera-common-5.5.42-6.el7ost.x86_64
galera-25.3.16-3.el7ost.x86_64

How reproducible:
This is not always reproducible - I've observed it on 1/3 attempts

Steps to Reproduce:
1. Deploy OSP11 with 3 controllers, 2 computes, 3 ceph nodes
2. Upgrade to OSP12
3. Force reboot controller-2, wait for all PCS services to come back up
4. Force reboot controller-0, wait for all PCS services to come back up
5. Force reboot controller-1

Actual results:
Galera cannot start on controller-1:

 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	FAILED Master controller-1 (blocked)
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0

Checking the logs inside the containers we can see:

171009 00:23:30 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
171009 00:23:30 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.P1xlGa' --pid-file='/var/lib/mysql/controller-1-recover.pid'
171009  0:23:30 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
171009 00:23:32 mysqld_safe WSREP: Recovered position 894400c0-ac6d-11e7-9bee-86888296944d:25272
171009  0:23:32 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
171009  0:23:32 [Note] WSREP: wsrep_start_position var submitted: '894400c0-ac6d-11e7-9bee-86888296944d:25272'
171009  0:23:32 [Note] WSREP: Read nil XID from storage engines, skipping position init
171009  0:23:32 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
171009  0:23:32 [Note] WSREP: wsrep_load(): Galera 3.16(r5c765eb) by Codership Oy <info> loaded successfully.
171009  0:23:32 [Note] WSREP: CRC-32C: using hardware acceleration.
171009  0:23:32 [Note] WSREP: Found saved state: 894400c0-ac6d-11e7-9bee-86888296944d:-1
171009  0:23:32 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-1; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.14:4567; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; p
171009  0:23:32 [Note] WSREP: Service thread queue flushed.
171009  0:23:32 [Note] WSREP: Assign initial position for certification: 25272, protocol version: -1
171009  0:23:32 [Note] WSREP: wsrep_sst_grab()
171009  0:23:32 [Note] WSREP: Start replication
171009  0:23:32 [Note] WSREP: Setting initial position to 894400c0-ac6d-11e7-9bee-86888296944d:25272
171009  0:23:32 [Note] WSREP: protonet asio version 0
171009  0:23:32 [Note] WSREP: Using CRC-32C for message checksums.
171009  0:23:32 [Note] WSREP: backend: asio
171009  0:23:32 [Note] WSREP: gcomm thread scheduling priority set to other:0 
171009  0:23:32 [Note] WSREP: restore pc from disk successfully
171009  0:23:32 [Note] WSREP: GMCast version 0
171009  0:23:32 [Note] WSREP: (00000000, 'tcp://172.17.1.14:4567') listening at tcp://172.17.1.14:4567
171009  0:23:32 [Note] WSREP: (00000000, 'tcp://172.17.1.14:4567') multicast: , ttl: 1
171009  0:23:32 [ERROR] WSREP: failed to open gcomm backend connection: 131: invalid UUID: 00000000 (FATAL)
	 at gcomm/src/pc.cpp:PC():271
171009  0:23:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
171009  0:23:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'galera_cluster' at 'gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain': -131 (State not recoverable)
171009  0:23:32 [ERROR] WSREP: gcs connect failed: State not recoverable
171009  0:23:32 [ERROR] WSREP: wsrep::connect() failed: 7
171009  0:23:32 [ERROR] Aborting

171009  0:23:32 [Note] WSREP: Service disconnected.
171009  0:23:33 [Note] WSREP: Some threads may fail to exit.
171009  0:23:33 [Note] /usr/libexec/mysqld: Shutdown complete

Checking /var/lib/mysql/gvwstate.dat on the controller nodes we can see that on controller-1 it's empty:

(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.10 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-0
217 /var/lib/mysql/gvwstate.dat
(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.7 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-1
0 /var/lib/mysql/gvwstate.dat
(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.21 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-2
217 /var/lib/mysql/gvwstate.dat

Expected results:
Reboots can be performed cleanly without Galera getting blocked.

Additional info:

Workaround: rm -f /var/lib/mysql/gvwstate.dat; then reboot the node

Comment 2 Marius Cornea 2017-10-09 08:57:04 UTC

resource-agents and pacemaker versions:

[root@controller-2 heat-admin]# rpm -qa | grep resource-agents
resource-agents-3.9.5-105.el7_4.2.x86_64
[root@controller-2 heat-admin]# rpm -qa | grep pacemaker
puppet-pacemaker-0.6.1-0.20170927162722.d0584c5.el7ost.noarch
pacemaker-libs-devel-1.1.16-12.el7_4.4.x86_64
pacemaker-cli-1.1.16-12.el7_4.4.x86_64
pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-cts-1.1.16-12.el7_4.4.x86_64
pacemaker-doc-1.1.16-12.el7_4.4.x86_64
pacemaker-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-debuginfo-1.1.16-12.el7_4.4.x86_64
pacemaker-remote-1.1.16-12.el7_4.4.x86_64
pacemaker-nagios-plugins-metadata-1.1.16-12.el7_4.4.x86_64
pacemaker-1.1.16-12.el7_4.4.x86_64
ansible-pacemaker-1.0.3-0.20170907130253.1279294.el7ost.noarch

Comment 4 Michael Bayer 2017-10-09 13:53:01 UTC

hey Damien-

this is a known behavior (file that gets cleanly rewritten becomes zeroed out) w/ XFS + non-graceful shutdown, I had the thought maybe the RA could do a quick check on the gvwstate.dat file to remove it if it's zeroed out?

Comment 5 Damien Ciabrini 2017-10-09 14:39:57 UTC

Hey Mike,

Thanks for having pointed out that. I'll work on tackling this specific case of state recovery in the resource agent.

Comment 7 Alexander Chuzhoy 2017-11-01 20:32:32 UTC

*** Bug 1508632 has been marked as a duplicate of this bug. ***

Comment 8 Alexander Chuzhoy 2017-11-01 20:44:14 UTC

The issue reproduces on clean deployment too.

Comment 9 Damien Ciabrini 2017-11-08 14:47:42 UTC

Fix under review in ClusterLabs/resource-agents/pull/1052

Comment 13 Alexander Chuzhoy 2017-11-15 03:56:50 UTC

Environment:
resource-agents-3.9.5-105.el7_4.3.x86_64

The issue didn't reproduce upon reboot on clean deployment.

Comment 14 Damien Ciabrini 2017-11-15 13:45:11 UTC

The issue is intermittent, that's probably why it did not reoccur with the old resource agent package.

In case of re-occurrence, there is now a log in the journal on the affected node:

WARNING: empty /var/lib/mysql/gvwstate.dat detected, removing it to prevent PC recovery failure at next restart

Comment 15 Udi Shkalim 2018-02-12 16:37:28 UTC

Verified on: resource-agents-3.9.5-113.el7 clean deployment

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Mon Feb 12 16:35:46 2018
Last change: Mon Feb 12 15:45:29 2018 by redis-bundle-0 via crm_attribute on controller-0

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 ip-192.168.24.11	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.109	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.18	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.4.10	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 18 errata-xmlrpc 2018-04-10 12:09:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757