Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1382127

Summary:

Upgrade from 8.0 to 9.0 fails on step -major-upgrade-pacemaker-converge.yaml

Product:

Red Hat OpenStack

Reporter:

Randy Perryman <randy_perryman>

Component:

rhosp-director

Assignee:

Michele Baldessari <michele>

Status:

CLOSED DUPLICATE

QA Contact:

Omri Hochman <ohochman>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.0 (Liberty)

CC:

audra_cooper, dbecker, dciabrin, fdinitto, mburns, michele, morazi, randy_perryman, rhel-osp-director-maint, sasha, sbaker, shardy, smerrow, sreichar, srevivo

Target Milestone:

---

Target Release:

9.0 (Mitaka)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-02-24 18:04:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1305654

Attachments:

Description	Flags
heat deployment show	none
Steps to Upgrade	none
New SOS report	none

Description Randy Perryman 2016-10-05 19:44:38 UTC

Created attachment 1207669 [details]
heat deployment show

Doing an upgrade from 8.0 to 9.0 and the last step that has us running 
-e ~/pilot/templates/overcloud/environments/major-upgrade-pacemaker-converge.yaml   
is failing

Command to start the upgrade:

 openstack overcloud deploy --log-file /home/rlp/pilot/finalzie_upgrade_deployment.log -t 120 --templates /home/rlp/pilot/templates/overcloud -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml -e ~/pilot/templates/overcloud/environments/major-upgrade-pacemaker-converge.yaml -e ~/pilot/templates/dell-environment.yaml -e ~/pilot/templates/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --swift-storage-flavor swift-storage --block-storage-flavor block-storage --neutron-public-interface bond1 --neutron-network-type vlan --neutron-disable-tunneling --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --ntp-server 10.127.1.3 --neutron-network-vlan-ranges physint:201:220,physext --neutron-bridge-mappings physint:br-tenantphysext:br-ex

Comment 1 Zane Bitter 2016-10-05 19:48:04 UTC

Looks like galera failed to start.

Comment 2 Randy Perryman 2016-10-05 20:23:37 UTC

Agreed pcs status shows them all in unmanaged state and cntl2 is showing as down.

Comment 3 Randy Perryman 2016-10-06 13:21:28 UTC

Looking at the MySQL process I see all three starting at same start position and seem to be all in sync:

[root@overcloud-controller-1 ~]# ps axx | grep mysql
  9010 ?        S      0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
 10179 ?        Sl     1:04 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=3ddd74fe-8a5d-11e6-96f2-e73e219312b5:138172

Though PCS shows:

 Master/Slave Set: galera-master [galera]
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-1 (unmanaged)
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-0 (unmanaged)
     Masters: [ overcloud-controller-2 ]

and 
* galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=3377, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Wed Oct  5 22:05:13 2016', queued=0ms, exec=8469ms
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=3259, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Wed Oct  5 22:05:22 2016', queued=0ms, exec=8441ms

Comment 4 Mike Burns 2016-10-06 13:26:43 UTC

Randy, can you provide sos_reports from the controllers?

Comment 5 Randy Perryman 2016-10-06 13:35:14 UTC

Where do I put them, as they are greater than 25M?

Comment 6 Fabio Massimo Di Nitto 2016-10-06 13:36:42 UTC

(In reply to Randy Perryman from comment #5)
> Where do I put them, as they are greater than 25M?

anywhere we can download them, otherwise just use tar to split them in 9MB files and attach them in few emails?

Comment 8 Randy Perryman 2016-10-06 14:11:55 UTC

Mike Burns has them now.

Comment 10 Michele Baldessari 2016-10-06 15:16:07 UTC

Thanks Randy, we got the files.

So the reason galera did not start on ctrl-0 and 1 seems to be the following:

Oct  5 22:02:13 overcloud-controller-1 crmd[5194]:  notice: overcloud-controller-1-galera_monitor_0:3355 [ ERROR 1045 (28000): Access denied for user 'clustercheck'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'clustercheck' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.\n ]

So it seems that /etc/sysconfig/clustercheck has not been correctly populated on ctrl-0 and ctrl-1 (ctrl-2 seems to be fine). Now depending on what versions of the templates of mitaka/rhos9 you are using there is some code in place to actually set the root password for the mysql user (which previously was empty).

So a couple of questions from our side:
1) What versions/where did the tripleo-heat-templates from /home/rlp/pilot/templates/overcloud come from? 
2) Can we get the exact full upgrade steps you are using up until this major-upgrade-pacemaker step? I ask because depending on the previous steps the state we are in might be different
3) Can you paste the content of /etc/sysconfig/clustercheck of all three controllers? (it is not captured by sosreport. feel free to obfuscate the password, should there be one). This would help us confirm our theory.

Thanks

Comment 11 Randy Perryman 2016-10-06 15:42:44 UTC

1. Versions came from CDN and are:
openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch

2. Steps Taken are attached
3. clustercheck for cntl0/cntl1/cntl2

MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD=''

MYSQL_HOST=localhost
MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD=''

MYSQL_HOST=localhost
MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD='mMmkVqm9DcyjT9bzmxUXe9Htn'

MYSQL_HOST=localhost

Comment 12 Randy Perryman 2016-10-06 15:46:59 UTC

Created attachment 1207975 [details]
Steps to Upgrade

Comment 13 Randy Perryman 2016-10-06 15:50:06 UTC

I am adding the password to ctl0 and ctl1   
PCS is now reporting success for Galera.

Still see some services stopped.

Comment 14 Randy Perryman 2016-10-07 19:40:03 UTC

Noticed that my triple-o-passwords has the following in it now:

[rlp@paisley-dir ~]$ cat tripleo-overcloud-passwords
NEUTRON_METADATA_PROXY_SHARED_SECRET=pvaPKDbHzkA7wvd9AyHW8GMgr
OVERCLOUD_GLANCE_PASSWORD=Yrw94JDkyCFzwfj7Jxr2m9vPE
OVERCLOUD_NOVA_PASSWORD=6XDXvEp8zWqUzPfYWKwvWtk8M
OVERCLOUD_GNOCCHI_PASSWORD=CGJWjFRdcE3BvdpcdtfWRqjYC
OVERCLOUD_HEAT_PASSWORD=nHpJEm934xswK6G8HnMxnDKsm
OVERCLOUD_RABBITMQ_PASSWORD=86KjeW9QxMtHtWqtrBUryf9zN
OVERCLOUD_REDIS_PASSWORD=VtYsW2Mz2qdKeK89XexThysR9
OVERCLOUD_ADMIN_TOKEN=vdY47CD6QYU2MGsPr9g9jNx6j
OVERCLOUD_CINDER_PASSWORD=GptrjcGhBRjv98YdUs7Fz2AZy
OVERCLOUD_SWIFT_PASSWORD=QJbPsg7YtARaWtWXPHEnzFZnW
OVERCLOUD_SWIFT_HASH=rqJCX6Ud7ndu2XYnXjNGgkkxf
OVERCLOUD_HAPROXY_STATS_PASSWORD=jPpWhWpBm4UqgRrJyf7tdrtd8
OVERCLOUD_SAHARA_PASSWORD=bPMdxTWnjPEmZedzQf8h3UfQx
OVERCLOUD_CEILOMETER_SECRET=fchdktBTdxvJBxWJddN9RWJDE
OVERCLOUD_AODH_PASSWORD=w6CtsvvkqFcBm92UfVq2QHGE8
OVERCLOUD_NEUTRON_PASSWORD=axeRRhJNrRCuWUaeE9jeWRth2
OVERCLOUD_DEMO_PASSWORD=NPX932j3sbp2Wzw3Rf3NPNDvt
OVERCLOUD_CEILOMETER_PASSWORD=meQgaMbqFUxP3RV3regAV69RT
OVERCLOUD_ADMIN_PASSWORD=Um72CKgsafZfd2W4aHFwyKksg
OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=mMmkVqm9DcyjT9bzmxUXe9Htn
MYSQL_CLUSTERCHECK_PASSWORD=MysqlMQv3HZDmCrE8Ug33xxyDw
OVERCLOUD_HEAT_STACK_DOMAIN_PASSWORD=nACeAFzWfNAsWBGcEGnAMsPb7



-------------------------
We added the MYSQL_CLUSTERCHECK_PASSWORD  to it for the OSP 8.0 Minor Update.  Is this a problem with the OSP 8 -> OSP 9 Major Upgrade?

Comment 15 Michele Baldessari 2016-10-11 14:18:32 UTC

Thanks for the data, Randy.

So with python-tripleoclient-0.3.4-6.el7ost when doing the deploy commands as that should generate a OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD password which will then be fed to the deploy commands (we need at least openstack-tripleo-heat-templates-0.8.14-14.el7ost for that to work). With those two we should get the appropriate clustercheck_password in the hiera tree on the nodes.

Can we make sure that /home/rlp/pilot/templates/overcloud is at least 0.8.14-14.el7ost version of the templates and that the python-tripleoclient has been updated to  0.3.4-6.el7ost?

Comment 16 Randy Perryman 2016-10-12 14:56:51 UTC

(In reply to Michele Baldessari from comment #15)
> Thanks for the data, Randy.
> 
> So with python-tripleoclient-0.3.4-6.el7ost when doing the deploy commands
> as that should generate a OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD password
> which will then be fed to the deploy commands (we need at least
> openstack-tripleo-heat-templates-0.8.14-14.el7ost for that to work). With
> those two we should get the appropriate clustercheck_password in the hiera
> tree on the nodes.
> 
> Can we make sure that /home/rlp/pilot/templates/overcloud is at least
> 0.8.14-14.el7ost version of the templates and that the python-tripleoclient
> has been updated to  0.3.4-6.el7ost?

The installation has been wiped out and a new install/update/upgrade is in progress.

Comment 17 Randy Perryman 2016-10-25 08:29:15 UTC

Here are the templates installed:


Last login: Tue Oct 25 08:18:37 2016 from 10.64.199.130
[rlp@paisley-dir ~]$ rpm -qa | grep tripleo
openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch
openstack-tripleo-common-2.0.0-8.el7ost.noarch
python-tripleoclient-2.0.0-3.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-6.el7ost.noarch
openstack-tripleo-0.0.8-0.2.d81bd6dgit.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch

Comment 18 Michele Baldessari 2017-01-25 15:41:55 UTC

Hi Randy,

so Sofer and I discussed this issue here and https://bugzilla.redhat.com/show_bug.cgi?id=1413686 as well and are a bit at loss as to why we see certain phenomena. I'd propose that we do a screen sharing session with an environment where we can reproduce the two bugs. Can you propose a couple of time slots for this for next week or so? (If possible not on Mondays. Both Sofer and I are in CET timezone).

Thanks,
Michele

Comment 19 Randy Perryman 2017-01-30 19:05:40 UTC

We now have a test bed in the state ready to debug.

Comment 20 Randy Perryman 2017-02-01 14:03:23 UTC

Has not been recreated in the last 90 days

Comment 21 Michele Baldessari 2017-02-01 16:33:36 UTC

Ack, thanks for your time today Randy. Let's focus on 1413686 and leave this open a bit longer in case we reproduce it.

Comment 22 Randy Perryman 2017-02-21 21:24:16 UTC

Created attachment 1256256 [details]
New SOS report

Comment 23 Randy Perryman 2017-02-21 21:26:32 UTC

So i think we hit this again today, but not sure. After running the final step our DB on one did not come back live.  Looking at clustercheck password on all three nodes it is ''.  2 of the nodes come up and worked okay.

Does the SOS report show the same issue?

Comment 24 Randy Perryman 2017-02-21 21:26:42 UTC

So i think we hit this again today, but not sure. After running the final step our DB on one did not come back live.  Looking at clustercheck password on all three nodes it is ''.  2 of the nodes come up and worked okay.

Does the SOS report show the same issue?

Comment 25 Randy Perryman 2017-02-22 22:13:31 UTC

We have hit this issue two times in the last two days.
 
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170222 21:56:46 [Note] WSREP: Waiting for SST to complete.
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: sent state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 0 (r8-controller-2.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 1 (r8-controller-0.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 2 (r8-controller-1.localdomain)
170222 21:56:46 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/3 (joined/total),
        act_id     = 183702,
        last_appl. = -1,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 82406f37-f8c1-11e6-8736-cbf6be2f4ddc
170222 21:56:46 [Note] WSREP: Flow-control interval: [28, 28]
170222 21:56:46 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 183702)
170222 21:56:46 [Note] WSREP: State transfer required:
        Group state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702
        Local state: 00000000-0000-0000-0000-000000000000:-1
170222 21:56:46 [Note] WSREP: New cluster view: global state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170222 21:56:46 [Warning] WSREP: Gap in state sequence. Need state transfer.
170222 21:56:46 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '127.0.0.1' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '22287''
170222 21:56:46 [Note] WSREP: Prepared SST request: rsync|127.0.0.1:4444/rsync_sst
170222 21:56:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170222 21:56:46 [Note] WSREP: REPL Protocols: 5 (3, 1)
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Note] WSREP: Assign initial position for certification: 183702, protocol version: 3
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (82406f37-f8c1-11e6-8736-cbf6be2f4ddc): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():447. IST will be unavailable.
170222 21:56:46 [Note] WSREP: Member 2.0 (r8-controller-1.localdomain) requested state transfer from '*any*'. Selected 0.0 (r8-controller-2.localdomain)(SYNCED) as donor.
170222 21:56:46 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 183702)
170222 21:56:46 [Note] WSREP: Requesting state transfer: success, donor: 0
170222 21:56:46 [Warning] WSREP: 0.0 (r8-controller-2.localdomain): State transfer to 2.0 (r8-controller-1.localdomain) failed: -255 (Unknown error 255)
170222 21:56:46 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.
170222 21:56:46 [Note] WSREP: gcomm: terminating thread
170222 21:56:46 [Note] WSREP: gcomm: joining thread
170222 21:56:46 [Note] WSREP: gcomm: closing backend
170222 21:56:46 [Note] WSREP: view(view_id(NON_PRIM,084f5e88-f932-11e6-a8df-56a62a7602e8,13) memb {
        cd4bf88b-f949-11e6-b8ae-2ae85efbe96d,0
} joined {
} left {
} partitioned {
        084f5e88-f932-11e6-a8df-56a62a7602e8,0
        0c45a386-f932-11e6-b884-42e3a6cacc77,0
})
170222 21:56:46 [Note] WSREP: view((empty))
170222 21:56:46 [Note] WSREP: gcomm: closed
170222 21:56:46 [Note] WSREP: /usr/libexec/mysqld: Terminated.
170222 21:56:46 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
WSREP_SST: [ERROR] Parent mysqld process (PID:22287) terminated unexpectedly. (20170222 21:56:46.605)
WSREP_SST: [INFO] Joiner cleanup. (20170222 21:56:46.608)
WSREP_SST: [INFO] Joiner cleanup done. (20170222 21:56:47.121)

Comment 26 Randy Perryman 2017-02-22 22:13:50 UTC

We have hit this issue two times in the last two days.
 
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170222 21:56:46 [Note] WSREP: Waiting for SST to complete.
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: sent state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 0 (r8-controller-2.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 1 (r8-controller-0.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 2 (r8-controller-1.localdomain)
170222 21:56:46 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/3 (joined/total),
        act_id     = 183702,
        last_appl. = -1,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 82406f37-f8c1-11e6-8736-cbf6be2f4ddc
170222 21:56:46 [Note] WSREP: Flow-control interval: [28, 28]
170222 21:56:46 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 183702)
170222 21:56:46 [Note] WSREP: State transfer required:
        Group state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702
        Local state: 00000000-0000-0000-0000-000000000000:-1
170222 21:56:46 [Note] WSREP: New cluster view: global state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170222 21:56:46 [Warning] WSREP: Gap in state sequence. Need state transfer.
170222 21:56:46 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '127.0.0.1' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '22287''
170222 21:56:46 [Note] WSREP: Prepared SST request: rsync|127.0.0.1:4444/rsync_sst
170222 21:56:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170222 21:56:46 [Note] WSREP: REPL Protocols: 5 (3, 1)
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Note] WSREP: Assign initial position for certification: 183702, protocol version: 3
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (82406f37-f8c1-11e6-8736-cbf6be2f4ddc): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():447. IST will be unavailable.
170222 21:56:46 [Note] WSREP: Member 2.0 (r8-controller-1.localdomain) requested state transfer from '*any*'. Selected 0.0 (r8-controller-2.localdomain)(SYNCED) as donor.
170222 21:56:46 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 183702)
170222 21:56:46 [Note] WSREP: Requesting state transfer: success, donor: 0
170222 21:56:46 [Warning] WSREP: 0.0 (r8-controller-2.localdomain): State transfer to 2.0 (r8-controller-1.localdomain) failed: -255 (Unknown error 255)
170222 21:56:46 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.
170222 21:56:46 [Note] WSREP: gcomm: terminating thread
170222 21:56:46 [Note] WSREP: gcomm: joining thread
170222 21:56:46 [Note] WSREP: gcomm: closing backend
170222 21:56:46 [Note] WSREP: view(view_id(NON_PRIM,084f5e88-f932-11e6-a8df-56a62a7602e8,13) memb {
        cd4bf88b-f949-11e6-b8ae-2ae85efbe96d,0
} joined {
} left {
} partitioned {
        084f5e88-f932-11e6-a8df-56a62a7602e8,0
        0c45a386-f932-11e6-b884-42e3a6cacc77,0
})
170222 21:56:46 [Note] WSREP: view((empty))
170222 21:56:46 [Note] WSREP: gcomm: closed
170222 21:56:46 [Note] WSREP: /usr/libexec/mysqld: Terminated.
170222 21:56:46 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
WSREP_SST: [ERROR] Parent mysqld process (PID:22287) terminated unexpectedly. (20170222 21:56:46.605)
WSREP_SST: [INFO] Joiner cleanup. (20170222 21:56:46.608)
WSREP_SST: [INFO] Joiner cleanup done. (20170222 21:56:47.121)

Comment 27 Michele Baldessari 2017-02-23 21:18:35 UTC

So, from a first cursory look I suspect that this and https://bugzilla.redhat.com/show_bug.cgi?id=1426253 are at least partially related. In the sense that also in the last uploaded sosreport I see that we have:
$ grep -ir bind etc/my.cnf.d/server.cnf
bind-address = 127.0.0.1

Which will effectively make galera not listening and give us this. The missing password might be okay because it would have been populated a bit later in the puppet convergence run.

Comment 28 Michele Baldessari 2017-02-23 22:10:39 UTC

Randy, could you upload the full template directory that you are using (after all the additional patches have been applied?). Thanks

Comment 29 Randy Perryman 2017-02-23 22:31:05 UTC

The Templates from the openstack-tripleo-templates from yesterday.

Comment 30 Michele Baldessari 2017-02-23 22:46:26 UTC

Randy, did you forget to attach them? It is just that I am not 100% sure what the starting point here is and the theory is that some older versions of the three patches were applied. Thanks, Michele

Comment 31 Randy Perryman 2017-02-23 22:50:24 UTC

Looks like one of the patches may of not been properly applied.  We are going to "Validate" they are all applied and proper, then rerun the Upgrade.

Thank You for the information.

Comment 32 Michele Baldessari 2017-02-23 22:57:28 UTC

Thanks, Randy. Just so we are all 100% sure the needed changes are described here: https://bugzilla.redhat.com/show_bug.cgi?id=1413686#c33


so to have this working you need to apply those patch.  Assuming the templates are in /usr/share/openstack/tripleo-heat-templates, the necessary commands are:

curl https://review.openstack.org/changes/408669/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

curl https://review.openstack.org/changes/422837/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

curl https://review.openstack.org/changes/428093/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

The reviews are merged upstream and won't change anymore.

Sofer and I are in Atlanta until tomorrow, so we are in your timezone. Let us know if there are any questions.

Comment 33 Audra Cooper 2017-02-24 18:01:12 UTC

(In reply to Michele Baldessari from comment #32)
> Thanks, Randy. Just so we are all 100% sure the needed changes are described
> here: https://bugzilla.redhat.com/show_bug.cgi?id=1413686#c33
> 
> 
> so to have this working you need to apply those patch.  Assuming the
> templates are in /usr/share/openstack/tripleo-heat-templates, the necessary
> commands are:
> 
> curl
> https://review.openstack.org/changes/408669/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> curl
> https://review.openstack.org/changes/422837/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> curl
> https://review.openstack.org/changes/428093/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> The reviews are merged upstream and won't change anymore.
> 
> Sofer and I are in Atlanta until tomorrow, so we are in your timezone. Let
> us know if there are any questions.

We didn't have the last one applied.  After applying all 3 Upgrade is now successful.

Comment 34 Michele Baldessari 2017-02-24 18:04:21 UTC

Thanks Audra, am closing as duplicate of 1413686

*** This bug has been marked as a duplicate of bug 1413686 ***