1499677 – OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1499677 - OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade

Summary: OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	pre-dev-freeze
Target Release:	7.4
Assignee:	Damien Ciabrini
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1508632 (view as bug list)
Depends On:
Blocks:	1512586
TreeView+	depends on / blocked

Reported:	2017-10-09 08:49 UTC by Marius Cornea
Modified:	2018-04-10 12:10 UTC (History)
CC List:	15 users (show)
Fixed In Version:	resource-agents-3.9.5-113.el7
Doc Type:	If docs needed, set a value
Doc Text:	When a Galera cluster node is running, it keeps track of the last known state of the cluster in the gvwstate.dat temporary file. This file is deleted after the node shuts down. Previously, an ungraceful node shutdown sometimes left an empty gvwstate.dat file on the disk. Consequently, the node failed to join the cluster on recovery. With this update, the resource-agents scripts delete this empty file, and as a result, the described problem no longer occurs.
Clone Of:
Clones:	1512586 (view as bug list)
Environment:
Last Closed:	2018-04-10 12:09:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1052	0	None	closed	galera: recover from empty gvwstate.dat	2020-11-18 06:40:30 UTC
Red Hat Product Errata	RHBA-2018:0757	0	None	None	None	2018-04-10 12:10:40 UTC

Description Marius Cornea 2017-10-09 08:49:08 UTC

Description of problem:
/var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade.

Version-Release number of selected component (if applicable):
mariadb-galera-server-5.5.42-6.el7ost.x86_64
mariadb-galera-common-5.5.42-6.el7ost.x86_64
galera-25.3.16-3.el7ost.x86_64

How reproducible:
This is not always reproducible - I've observed it on 1/3 attempts

Steps to Reproduce:
1. Deploy OSP11 with 3 controllers, 2 computes, 3 ceph nodes
2. Upgrade to OSP12
3. Force reboot controller-2, wait for all PCS services to come back up
4. Force reboot controller-0, wait for all PCS services to come back up
5. Force reboot controller-1

Actual results:
Galera cannot start on controller-1:

 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	FAILED Master controller-1 (blocked)
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-2
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-0

Checking the logs inside the containers we can see:

171009 00:23:30 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
171009 00:23:30 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.P1xlGa' --pid-file='/var/lib/mysql/controller-1-recover.pid'
171009  0:23:30 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
171009 00:23:32 mysqld_safe WSREP: Recovered position 894400c0-ac6d-11e7-9bee-86888296944d:25272
171009  0:23:32 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295
171009  0:23:32 [Note] WSREP: wsrep_start_position var submitted: '894400c0-ac6d-11e7-9bee-86888296944d:25272'
171009  0:23:32 [Note] WSREP: Read nil XID from storage engines, skipping position init
171009  0:23:32 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
171009  0:23:32 [Note] WSREP: wsrep_load(): Galera 3.16(r5c765eb) by Codership Oy <info> loaded successfully.
171009  0:23:32 [Note] WSREP: CRC-32C: using hardware acceleration.
171009  0:23:32 [Note] WSREP: Found saved state: 894400c0-ac6d-11e7-9bee-86888296944d:-1
171009  0:23:32 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-1; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.14:4567; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; p
171009  0:23:32 [Note] WSREP: Service thread queue flushed.
171009  0:23:32 [Note] WSREP: Assign initial position for certification: 25272, protocol version: -1
171009  0:23:32 [Note] WSREP: wsrep_sst_grab()
171009  0:23:32 [Note] WSREP: Start replication
171009  0:23:32 [Note] WSREP: Setting initial position to 894400c0-ac6d-11e7-9bee-86888296944d:25272
171009  0:23:32 [Note] WSREP: protonet asio version 0
171009  0:23:32 [Note] WSREP: Using CRC-32C for message checksums.
171009  0:23:32 [Note] WSREP: backend: asio
171009  0:23:32 [Note] WSREP: gcomm thread scheduling priority set to other:0 
171009  0:23:32 [Note] WSREP: restore pc from disk successfully
171009  0:23:32 [Note] WSREP: GMCast version 0
171009  0:23:32 [Note] WSREP: (00000000, 'tcp://172.17.1.14:4567') listening at tcp://172.17.1.14:4567
171009  0:23:32 [Note] WSREP: (00000000, 'tcp://172.17.1.14:4567') multicast: , ttl: 1
171009  0:23:32 [ERROR] WSREP: failed to open gcomm backend connection: 131: invalid UUID: 00000000 (FATAL)
	 at gcomm/src/pc.cpp:PC():271
171009  0:23:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
171009  0:23:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'galera_cluster' at 'gcomm://controller-0.internalapi.localdomain,controller-1.internalapi.localdomain,controller-2.internalapi.localdomain': -131 (State not recoverable)
171009  0:23:32 [ERROR] WSREP: gcs connect failed: State not recoverable
171009  0:23:32 [ERROR] WSREP: wsrep::connect() failed: 7
171009  0:23:32 [ERROR] Aborting

171009  0:23:32 [Note] WSREP: Service disconnected.
171009  0:23:33 [Note] WSREP: Some threads may fail to exit.
171009  0:23:33 [Note] /usr/libexec/mysqld: Shutdown complete

Checking /var/lib/mysql/gvwstate.dat on the controller nodes we can see that on controller-1 it's empty:

(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.10 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-0
217 /var/lib/mysql/gvwstate.dat
(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.7 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-1
0 /var/lib/mysql/gvwstate.dat
(undercloud) [stack@undercloud-0 ~]$ ssh -q heat-admin.24.21 'cat /etc/hostname; sudo wc -c /var/lib/mysql/gvwstate.dat'
controller-2
217 /var/lib/mysql/gvwstate.dat

Expected results:
Reboots can be performed cleanly without Galera getting blocked.

Additional info:

Workaround: rm -f /var/lib/mysql/gvwstate.dat; then reboot the node

Comment 2 Marius Cornea 2017-10-09 08:57:04 UTC

resource-agents and pacemaker versions:

[root@controller-2 heat-admin]# rpm -qa | grep resource-agents
resource-agents-3.9.5-105.el7_4.2.x86_64
[root@controller-2 heat-admin]# rpm -qa | grep pacemaker
puppet-pacemaker-0.6.1-0.20170927162722.d0584c5.el7ost.noarch
pacemaker-libs-devel-1.1.16-12.el7_4.4.x86_64
pacemaker-cli-1.1.16-12.el7_4.4.x86_64
pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-cts-1.1.16-12.el7_4.4.x86_64
pacemaker-doc-1.1.16-12.el7_4.4.x86_64
pacemaker-libs-1.1.16-12.el7_4.4.x86_64
pacemaker-debuginfo-1.1.16-12.el7_4.4.x86_64
pacemaker-remote-1.1.16-12.el7_4.4.x86_64
pacemaker-nagios-plugins-metadata-1.1.16-12.el7_4.4.x86_64
pacemaker-1.1.16-12.el7_4.4.x86_64
ansible-pacemaker-1.0.3-0.20170907130253.1279294.el7ost.noarch

Comment 4 Michael Bayer 2017-10-09 13:53:01 UTC

hey Damien-

this is a known behavior (file that gets cleanly rewritten becomes zeroed out) w/ XFS + non-graceful shutdown, I had the thought maybe the RA could do a quick check on the gvwstate.dat file to remove it if it's zeroed out?

Comment 5 Damien Ciabrini 2017-10-09 14:39:57 UTC

Hey Mike,

Thanks for having pointed out that. I'll work on tackling this specific case of state recovery in the resource agent.

Comment 7 Alexander Chuzhoy 2017-11-01 20:32:32 UTC

*** Bug 1508632 has been marked as a duplicate of this bug. ***

Comment 8 Alexander Chuzhoy 2017-11-01 20:44:14 UTC

The issue reproduces on clean deployment too.

Comment 9 Damien Ciabrini 2017-11-08 14:47:42 UTC

Fix under review in ClusterLabs/resource-agents/pull/1052

Comment 13 Alexander Chuzhoy 2017-11-15 03:56:50 UTC

Environment:
resource-agents-3.9.5-105.el7_4.3.x86_64

The issue didn't reproduce upon reboot on clean deployment.

Comment 14 Damien Ciabrini 2017-11-15 13:45:11 UTC

The issue is intermittent, that's probably why it did not reoccur with the old resource agent package.

In case of re-occurrence, there is now a log in the journal on the affected node:

WARNING: empty /var/lib/mysql/gvwstate.dat detected, removing it to prevent PC recovery failure at next restart

Comment 15 Udi Shkalim 2018-02-12 16:37:28 UTC

Verified on: resource-agents-3.9.5-113.el7 clean deployment

[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Mon Feb 12 16:35:46 2018
Last change: Mon Feb 12 15:45:29 2018 by redis-bundle-0 via crm_attribute on controller-0

12 nodes configured
37 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started controller-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started controller-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-2
 ip-192.168.24.11	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.109	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.18	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.3.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.4.10	(ocf::heartbeat:IPaddr2):	Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0	(ocf::heartbeat:docker):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 18 errata-xmlrpc 2018-04-10 12:09:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757

Note You need to log in before you can comment on or make changes to this bug.