Bug 1499677
Summary: | OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Marius Cornea <mcornea> | |
Component: | resource-agents | Assignee: | Damien Ciabrini <dciabrin> | |
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.4 | CC: | agk, aherr, cfeist, chjones, cluster-maint, dbecker, fdinitto, mbayer, mburns, mkrcmari, morazi, oalbrigt, rhel-osp-director-maint, sasha, srevivo | |
Target Milestone: | pre-dev-freeze | Keywords: | AutomationBlocker, Triaged, ZStream | |
Target Release: | 7.4 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-3.9.5-113.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: |
When a Galera cluster node is running, it keeps track of the last known state of the cluster in the gvwstate.dat temporary file. This file is deleted after the node shuts down. Previously, an ungraceful node shutdown sometimes left an empty gvwstate.dat file on the disk. Consequently, the node failed to join the cluster on recovery. With this update, the resource-agents scripts delete this empty file, and as a result, the described problem no longer occurs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1512586 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-10 12:09:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1512586 |
Description
Marius Cornea
2017-10-09 08:49:08 UTC
resource-agents and pacemaker versions: [root@controller-2 heat-admin]# rpm -qa | grep resource-agents resource-agents-3.9.5-105.el7_4.2.x86_64 [root@controller-2 heat-admin]# rpm -qa | grep pacemaker puppet-pacemaker-0.6.1-0.20170927162722.d0584c5.el7ost.noarch pacemaker-libs-devel-1.1.16-12.el7_4.4.x86_64 pacemaker-cli-1.1.16-12.el7_4.4.x86_64 pacemaker-cluster-libs-1.1.16-12.el7_4.4.x86_64 pacemaker-cts-1.1.16-12.el7_4.4.x86_64 pacemaker-doc-1.1.16-12.el7_4.4.x86_64 pacemaker-libs-1.1.16-12.el7_4.4.x86_64 pacemaker-debuginfo-1.1.16-12.el7_4.4.x86_64 pacemaker-remote-1.1.16-12.el7_4.4.x86_64 pacemaker-nagios-plugins-metadata-1.1.16-12.el7_4.4.x86_64 pacemaker-1.1.16-12.el7_4.4.x86_64 ansible-pacemaker-1.0.3-0.20170907130253.1279294.el7ost.noarch hey Damien- this is a known behavior (file that gets cleanly rewritten becomes zeroed out) w/ XFS + non-graceful shutdown, I had the thought maybe the RA could do a quick check on the gvwstate.dat file to remove it if it's zeroed out? Hey Mike, Thanks for having pointed out that. I'll work on tackling this specific case of state recovery in the resource agent. *** Bug 1508632 has been marked as a duplicate of this bug. *** The issue reproduces on clean deployment too. Fix under review in ClusterLabs/resource-agents/pull/1052 Environment: resource-agents-3.9.5-105.el7_4.3.x86_64 The issue didn't reproduce upon reboot on clean deployment. The issue is intermittent, that's probably why it did not reoccur with the old resource agent package. In case of re-occurrence, there is now a log in the journal on the affected node: WARNING: empty /var/lib/mysql/gvwstate.dat detected, removing it to prevent PC recovery failure at next restart Verified on: resource-agents-3.9.5-113.el7 clean deployment [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum Last updated: Mon Feb 12 16:35:46 2018 Last change: Mon Feb 12 15:45:29 2018 by redis-bundle-0 via crm_attribute on controller-0 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp13/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.11 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.109 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.18 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.4.10 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp13/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0757 |