Bug 1512586
Summary: | OSP11->12 upgrade and clean deployment: /var/lib/mysql/gvwstate.dat gets corrupted on one of the controller nodes after rebooting them post upgrade [rhel-7.4.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Oneata Mircea Teodor <toneata> |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.4 | CC: | agk, aherr, cfeist, chjones, cluster-maint, dbecker, dciabrin, fdinitto, mbayer, mburns, mcornea, mjuricek, mkrcmari, morazi, oalbrigt, rhel-osp-director-maint, sasha, srevivo |
Target Milestone: | rc | Keywords: | AutomationBlocker, Triaged, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | resource-agents-3.9.5-105.el7_4.3 | Doc Type: | If docs needed, set a value |
Doc Text: |
When a Galera cluster node is running, it keeps track of the last known state of the cluster in the gvwstate.dat temporary file. This file is deleted after the node shuts down. Previously, an ungraceful node shutdown sometimes left an empty gvwstate.dat file on the disk. Consequently, the node failed to join the cluster on recovery. With this update, the resource-agents scripts delete this empty file, and as a result, the described problem no longer occurs
|
Story Points: | --- |
Clone Of: | 1499677 | Environment: | |
Last Closed: | 2017-11-30 16:09:46 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1499677 | ||
Bug Blocks: |
Description
Oneata Mircea Teodor
2017-11-13 15:23:53 UTC
Instructions for testing: . Start a 3-node galera cluster, and let it synchronize fully. . Once the cluster is ready, stop it via pacemaker: pcs resource disable galera-master . On one of the node, simulate a "bad stop" event that would leave an empty view state on disk: sudo touch /var/lib/mysql/gvwstate.dat sudo chown mysql. /var/lib/mysql/gvwstate.dat . Restart the galera cluster and watch it restart properly on the node with the empty view state pcs resource disable galera-master . On the node recovered node, look at the journal log for a recovery string: "WARNING: empty /var/lib/mysql/gvwstate.dat detected, removing it to prevent PC recovery failure at next restart" Verified on: resource-agents-3.9.5-105.el7_4.3 followed steps on comment #4 and resource was recovered successfully var/log/messages:Nov 20 14:14:08 localhost galera(galera)[92]: WARNING: empty /var/lib/mysql/gvwstate.dat detected, removing it to prevent PC recovery failure at next restart [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum Last updated: Mon Nov 20 19:15:36 2017 Last change: Mon Nov 20 19:13:55 2017 by root via cibadmin on controller-0 12 nodes configured 37 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: ip-192.168.24.8 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.16 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.17 (ocf::heartbeat:IPaddr2): Started controller-2 openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-1 galera-bundle-1 (ocf::heartbeat:galera): Master controller-2 galera-bundle-2 (ocf::heartbeat:galera): Master controller-0 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Master controller-0 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@controller-0 ~]# rpm -qa | grep resource-agents-3.9.5-105 resource-agents-3.9.5-105.el7_4.3.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3327 |