Bug 1382127 - Upgrade from 8.0 to 9.0 fails on step -major-upgrade-pacemaker-converge.yaml
Summary: Upgrade from 8.0 to 9.0 fails on step -major-upgrade-pacemaker-converge.yaml
Keywords:
Status: CLOSED DUPLICATE of bug 1413686
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 9.0 (Mitaka)
Assignee: Michele Baldessari
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks: 1305654
TreeView+ depends on / blocked
 
Reported: 2016-10-05 19:44 UTC by Randy Perryman
Modified: 2017-02-24 18:04 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-24 18:04:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
heat deployment show (36.69 KB, text/plain)
2016-10-05 19:44 UTC, Randy Perryman
no flags Details
Steps to Upgrade (10.39 KB, text/plain)
2016-10-06 15:46 UTC, Randy Perryman
no flags Details
New SOS report (19.00 MB, application/x-xz)
2017-02-21 21:24 UTC, Randy Perryman
no flags Details

Description Randy Perryman 2016-10-05 19:44:38 UTC
Created attachment 1207669 [details]
heat deployment show

Doing an upgrade from 8.0 to 9.0 and the last step that has us running 
-e ~/pilot/templates/overcloud/environments/major-upgrade-pacemaker-converge.yaml   
is failing

Command to start the upgrade:

 openstack overcloud deploy --log-file /home/rlp/pilot/finalzie_upgrade_deployment.log -t 120 --templates /home/rlp/pilot/templates/overcloud -e ~/pilot/templates/overcloud/environments/network-isolation.yaml -e ~/pilot/templates/overcloud/environments/storage-environment.yaml -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml -e ~/pilot/templates/overcloud/environments/major-upgrade-pacemaker-converge.yaml -e ~/pilot/templates/dell-environment.yaml -e ~/pilot/templates/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --swift-storage-flavor swift-storage --block-storage-flavor block-storage --neutron-public-interface bond1 --neutron-network-type vlan --neutron-disable-tunneling --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --ntp-server 10.127.1.3 --neutron-network-vlan-ranges physint:201:220,physext --neutron-bridge-mappings physint:br-tenantphysext:br-ex

Comment 1 Zane Bitter 2016-10-05 19:48:04 UTC
Looks like galera failed to start.

Comment 2 Randy Perryman 2016-10-05 20:23:37 UTC
Agreed pcs status shows them all in unmanaged state and cntl2 is showing as down.

Comment 3 Randy Perryman 2016-10-06 13:21:28 UTC
Looking at the MySQL process I see all three starting at same start position and seem to be all in sync:

[root@overcloud-controller-1 ~]# ps axx | grep mysql
  9010 ?        S      0:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --datadir=/var/lib/mysql --log-error=/var/log/mysqld.log --user=mysql --open-files-limit=16384 --wsrep-cluster-address=gcomm://overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
 10179 ?        Sl     1:04 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera/libgalera_smm.so --wsrep-cluster-address=gcomm://overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 --log-error=/var/log/mysqld.log --open-files-limit=16384 --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306 --wsrep_start_position=3ddd74fe-8a5d-11e6-96f2-e73e219312b5:138172

Though PCS shows:

 Master/Slave Set: galera-master [galera]
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-1 (unmanaged)
     galera     (ocf::heartbeat:galera):        FAILED Master overcloud-controller-0 (unmanaged)
     Masters: [ overcloud-controller-2 ]

and 
* galera_promote_0 on overcloud-controller-1 'unknown error' (1): call=3377, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Wed Oct  5 22:05:13 2016', queued=0ms, exec=8469ms
* galera_promote_0 on overcloud-controller-0 'unknown error' (1): call=3259, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Wed Oct  5 22:05:22 2016', queued=0ms, exec=8441ms

Comment 4 Mike Burns 2016-10-06 13:26:43 UTC
Randy, can you provide sos_reports from the controllers?

Comment 5 Randy Perryman 2016-10-06 13:35:14 UTC
Where do I put them, as they are greater than 25M?

Comment 6 Fabio Massimo Di Nitto 2016-10-06 13:36:42 UTC
(In reply to Randy Perryman from comment #5)
> Where do I put them, as they are greater than 25M?

anywhere we can download them, otherwise just use tar to split them in 9MB files and attach them in few emails?

Comment 8 Randy Perryman 2016-10-06 14:11:55 UTC
Mike Burns has them now.

Comment 10 Michele Baldessari 2016-10-06 15:16:07 UTC
Thanks Randy, we got the files.

So the reason galera did not start on ctrl-0 and 1 seems to be the following:

Oct  5 22:02:13 overcloud-controller-1 crmd[5194]:  notice: overcloud-controller-1-galera_monitor_0:3355 [ ERROR 1045 (28000): Access denied for user 'clustercheck'@'localhost' (using password: NO)\nocf-exit-reason:Unable to retrieve wsrep_cluster_status, verify check_user 'clustercheck' has permissions to view status\nocf-exit-reason:local node <overcloud-controller-1> is started, but not in primary mode. Unknown state.\n ]

So it seems that /etc/sysconfig/clustercheck has not been correctly populated on ctrl-0 and ctrl-1 (ctrl-2 seems to be fine). Now depending on what versions of the templates of mitaka/rhos9 you are using there is some code in place to actually set the root password for the mysql user (which previously was empty).

So a couple of questions from our side:
1) What versions/where did the tripleo-heat-templates from /home/rlp/pilot/templates/overcloud come from? 
2) Can we get the exact full upgrade steps you are using up until this major-upgrade-pacemaker step? I ask because depending on the previous steps the state we are in might be different
3) Can you paste the content of /etc/sysconfig/clustercheck of all three controllers? (it is not captured by sosreport. feel free to obfuscate the password, should there be one). This would help us confirm our theory.

Thanks

Comment 11 Randy Perryman 2016-10-06 15:42:44 UTC
1. Versions came from CDN and are:
openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch

2. Steps Taken are attached
3. clustercheck for cntl0/cntl1/cntl2

MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD=''

MYSQL_HOST=localhost
MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD=''

MYSQL_HOST=localhost
MYSQL_USERNAME=clustercheck

MYSQL_PASSWORD='mMmkVqm9DcyjT9bzmxUXe9Htn'

MYSQL_HOST=localhost

Comment 12 Randy Perryman 2016-10-06 15:46:59 UTC
Created attachment 1207975 [details]
Steps to Upgrade

Comment 13 Randy Perryman 2016-10-06 15:50:06 UTC
I am adding the password to ctl0 and ctl1   
PCS is now reporting success for Galera.

Still see some services stopped.

Comment 14 Randy Perryman 2016-10-07 19:40:03 UTC
Noticed that my triple-o-passwords has the following in it now:

[rlp@paisley-dir ~]$ cat tripleo-overcloud-passwords
NEUTRON_METADATA_PROXY_SHARED_SECRET=pvaPKDbHzkA7wvd9AyHW8GMgr
OVERCLOUD_GLANCE_PASSWORD=Yrw94JDkyCFzwfj7Jxr2m9vPE
OVERCLOUD_NOVA_PASSWORD=6XDXvEp8zWqUzPfYWKwvWtk8M
OVERCLOUD_GNOCCHI_PASSWORD=CGJWjFRdcE3BvdpcdtfWRqjYC
OVERCLOUD_HEAT_PASSWORD=nHpJEm934xswK6G8HnMxnDKsm
OVERCLOUD_RABBITMQ_PASSWORD=86KjeW9QxMtHtWqtrBUryf9zN
OVERCLOUD_REDIS_PASSWORD=VtYsW2Mz2qdKeK89XexThysR9
OVERCLOUD_ADMIN_TOKEN=vdY47CD6QYU2MGsPr9g9jNx6j
OVERCLOUD_CINDER_PASSWORD=GptrjcGhBRjv98YdUs7Fz2AZy
OVERCLOUD_SWIFT_PASSWORD=QJbPsg7YtARaWtWXPHEnzFZnW
OVERCLOUD_SWIFT_HASH=rqJCX6Ud7ndu2XYnXjNGgkkxf
OVERCLOUD_HAPROXY_STATS_PASSWORD=jPpWhWpBm4UqgRrJyf7tdrtd8
OVERCLOUD_SAHARA_PASSWORD=bPMdxTWnjPEmZedzQf8h3UfQx
OVERCLOUD_CEILOMETER_SECRET=fchdktBTdxvJBxWJddN9RWJDE
OVERCLOUD_AODH_PASSWORD=w6CtsvvkqFcBm92UfVq2QHGE8
OVERCLOUD_NEUTRON_PASSWORD=axeRRhJNrRCuWUaeE9jeWRth2
OVERCLOUD_DEMO_PASSWORD=NPX932j3sbp2Wzw3Rf3NPNDvt
OVERCLOUD_CEILOMETER_PASSWORD=meQgaMbqFUxP3RV3regAV69RT
OVERCLOUD_ADMIN_PASSWORD=Um72CKgsafZfd2W4aHFwyKksg
OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD=mMmkVqm9DcyjT9bzmxUXe9Htn
MYSQL_CLUSTERCHECK_PASSWORD=MysqlMQv3HZDmCrE8Ug33xxyDw
OVERCLOUD_HEAT_STACK_DOMAIN_PASSWORD=nACeAFzWfNAsWBGcEGnAMsPb7



-------------------------
We added the MYSQL_CLUSTERCHECK_PASSWORD  to it for the OSP 8.0 Minor Update.  Is this a problem with the OSP 8 -> OSP 9 Major Upgrade?

Comment 15 Michele Baldessari 2016-10-11 14:18:32 UTC
Thanks for the data, Randy.

So with python-tripleoclient-0.3.4-6.el7ost when doing the deploy commands as that should generate a OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD password which will then be fed to the deploy commands (we need at least openstack-tripleo-heat-templates-0.8.14-14.el7ost for that to work). With those two we should get the appropriate clustercheck_password in the hiera tree on the nodes.

Can we make sure that /home/rlp/pilot/templates/overcloud is at least 0.8.14-14.el7ost version of the templates and that the python-tripleoclient has been updated to  0.3.4-6.el7ost?

Comment 16 Randy Perryman 2016-10-12 14:56:51 UTC
(In reply to Michele Baldessari from comment #15)
> Thanks for the data, Randy.
> 
> So with python-tripleoclient-0.3.4-6.el7ost when doing the deploy commands
> as that should generate a OVERCLOUD_MYSQL_CLUSTERCHECK_PASSWORD password
> which will then be fed to the deploy commands (we need at least
> openstack-tripleo-heat-templates-0.8.14-14.el7ost for that to work). With
> those two we should get the appropriate clustercheck_password in the hiera
> tree on the nodes.
> 
> Can we make sure that /home/rlp/pilot/templates/overcloud is at least
> 0.8.14-14.el7ost version of the templates and that the python-tripleoclient
> has been updated to  0.3.4-6.el7ost?

The installation has been wiped out and a new install/update/upgrade is in progress.

Comment 17 Randy Perryman 2016-10-25 08:29:15 UTC
Here are the templates installed:


Last login: Tue Oct 25 08:18:37 2016 from 10.64.199.130
[rlp@paisley-dir ~]$ rpm -qa | grep tripleo
openstack-tripleo-heat-templates-liberty-2.0.0-34.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-34.el7ost.noarch
openstack-tripleo-common-2.0.0-8.el7ost.noarch
python-tripleoclient-2.0.0-3.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-6.el7ost.noarch
openstack-tripleo-0.0.8-0.2.d81bd6dgit.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch

Comment 18 Michele Baldessari 2017-01-25 15:41:55 UTC
Hi Randy,

so Sofer and I discussed this issue here and https://bugzilla.redhat.com/show_bug.cgi?id=1413686 as well and are a bit at loss as to why we see certain phenomena. I'd propose that we do a screen sharing session with an environment where we can reproduce the two bugs. Can you propose a couple of time slots for this for next week or so? (If possible not on Mondays. Both Sofer and I are in CET timezone).

Thanks,
Michele

Comment 19 Randy Perryman 2017-01-30 19:05:40 UTC
We now have a test bed in the state ready to debug.

Comment 20 Randy Perryman 2017-02-01 14:03:23 UTC
Has not been recreated in the last 90 days

Comment 21 Michele Baldessari 2017-02-01 16:33:36 UTC
Ack, thanks for your time today Randy. Let's focus on 1413686 and leave this open a bit longer in case we reproduce it.

Comment 22 Randy Perryman 2017-02-21 21:24:16 UTC
Created attachment 1256256 [details]
New SOS report

Comment 23 Randy Perryman 2017-02-21 21:26:32 UTC
So i think we hit this again today, but not sure. After running the final step our DB on one did not come back live.  Looking at clustercheck password on all three nodes it is ''.  2 of the nodes come up and worked okay.

Does the SOS report show the same issue?

Comment 24 Randy Perryman 2017-02-21 21:26:42 UTC
So i think we hit this again today, but not sure. After running the final step our DB on one did not come back live.  Looking at clustercheck password on all three nodes it is ''.  2 of the nodes come up and worked okay.

Does the SOS report show the same issue?

Comment 25 Randy Perryman 2017-02-22 22:13:31 UTC
We have hit this issue two times in the last two days.
 
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170222 21:56:46 [Note] WSREP: Waiting for SST to complete.
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: sent state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 0 (r8-controller-2.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 1 (r8-controller-0.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 2 (r8-controller-1.localdomain)
170222 21:56:46 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/3 (joined/total),
        act_id     = 183702,
        last_appl. = -1,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 82406f37-f8c1-11e6-8736-cbf6be2f4ddc
170222 21:56:46 [Note] WSREP: Flow-control interval: [28, 28]
170222 21:56:46 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 183702)
170222 21:56:46 [Note] WSREP: State transfer required:
        Group state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702
        Local state: 00000000-0000-0000-0000-000000000000:-1
170222 21:56:46 [Note] WSREP: New cluster view: global state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170222 21:56:46 [Warning] WSREP: Gap in state sequence. Need state transfer.
170222 21:56:46 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '127.0.0.1' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '22287''
170222 21:56:46 [Note] WSREP: Prepared SST request: rsync|127.0.0.1:4444/rsync_sst
170222 21:56:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170222 21:56:46 [Note] WSREP: REPL Protocols: 5 (3, 1)
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Note] WSREP: Assign initial position for certification: 183702, protocol version: 3
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (82406f37-f8c1-11e6-8736-cbf6be2f4ddc): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():447. IST will be unavailable.
170222 21:56:46 [Note] WSREP: Member 2.0 (r8-controller-1.localdomain) requested state transfer from '*any*'. Selected 0.0 (r8-controller-2.localdomain)(SYNCED) as donor.
170222 21:56:46 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 183702)
170222 21:56:46 [Note] WSREP: Requesting state transfer: success, donor: 0
170222 21:56:46 [Warning] WSREP: 0.0 (r8-controller-2.localdomain): State transfer to 2.0 (r8-controller-1.localdomain) failed: -255 (Unknown error 255)
170222 21:56:46 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.
170222 21:56:46 [Note] WSREP: gcomm: terminating thread
170222 21:56:46 [Note] WSREP: gcomm: joining thread
170222 21:56:46 [Note] WSREP: gcomm: closing backend
170222 21:56:46 [Note] WSREP: view(view_id(NON_PRIM,084f5e88-f932-11e6-a8df-56a62a7602e8,13) memb {
        cd4bf88b-f949-11e6-b8ae-2ae85efbe96d,0
} joined {
} left {
} partitioned {
        084f5e88-f932-11e6-a8df-56a62a7602e8,0
        0c45a386-f932-11e6-b884-42e3a6cacc77,0
})
170222 21:56:46 [Note] WSREP: view((empty))
170222 21:56:46 [Note] WSREP: gcomm: closed
170222 21:56:46 [Note] WSREP: /usr/libexec/mysqld: Terminated.
170222 21:56:46 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
WSREP_SST: [ERROR] Parent mysqld process (PID:22287) terminated unexpectedly. (20170222 21:56:46.605)
WSREP_SST: [INFO] Joiner cleanup. (20170222 21:56:46.608)
WSREP_SST: [INFO] Joiner cleanup done. (20170222 21:56:47.121)

Comment 26 Randy Perryman 2017-02-22 22:13:50 UTC
We have hit this issue two times in the last two days.
 
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170222 21:56:46 [Note] WSREP: Waiting for SST to complete.
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: sent state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 0 (r8-controller-2.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 1 (r8-controller-0.localdomain)
170222 21:56:46 [Note] WSREP: STATE EXCHANGE: got state msg: cd98eac3-f949-11e6-be0b-bfcb19c69342 from 2 (r8-controller-1.localdomain)
170222 21:56:46 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/3 (joined/total),
        act_id     = 183702,
        last_appl. = -1,
        protocols  = 0/5/3 (gcs/repl/appl),
        group UUID = 82406f37-f8c1-11e6-8736-cbf6be2f4ddc
170222 21:56:46 [Note] WSREP: Flow-control interval: [28, 28]
170222 21:56:46 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 183702)
170222 21:56:46 [Note] WSREP: State transfer required:
        Group state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702
        Local state: 00000000-0000-0000-0000-000000000000:-1
170222 21:56:46 [Note] WSREP: New cluster view: global state: 82406f37-f8c1-11e6-8736-cbf6be2f4ddc:183702, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170222 21:56:46 [Warning] WSREP: Gap in state sequence. Need state transfer.
170222 21:56:46 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '127.0.0.1' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '22287''
170222 21:56:46 [Note] WSREP: Prepared SST request: rsync|127.0.0.1:4444/rsync_sst
170222 21:56:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170222 21:56:46 [Note] WSREP: REPL Protocols: 5 (3, 1)
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Note] WSREP: Assign initial position for certification: 183702, protocol version: 3
170222 21:56:46 [Note] WSREP: Service thread queue flushed.
170222 21:56:46 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (82406f37-f8c1-11e6-8736-cbf6be2f4ddc): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():447. IST will be unavailable.
170222 21:56:46 [Note] WSREP: Member 2.0 (r8-controller-1.localdomain) requested state transfer from '*any*'. Selected 0.0 (r8-controller-2.localdomain)(SYNCED) as donor.
170222 21:56:46 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 183702)
170222 21:56:46 [Note] WSREP: Requesting state transfer: success, donor: 0
170222 21:56:46 [Warning] WSREP: 0.0 (r8-controller-2.localdomain): State transfer to 2.0 (r8-controller-1.localdomain) failed: -255 (Unknown error 255)
170222 21:56:46 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.
170222 21:56:46 [Note] WSREP: gcomm: terminating thread
170222 21:56:46 [Note] WSREP: gcomm: joining thread
170222 21:56:46 [Note] WSREP: gcomm: closing backend
170222 21:56:46 [Note] WSREP: view(view_id(NON_PRIM,084f5e88-f932-11e6-a8df-56a62a7602e8,13) memb {
        cd4bf88b-f949-11e6-b8ae-2ae85efbe96d,0
} joined {
} left {
} partitioned {
        084f5e88-f932-11e6-a8df-56a62a7602e8,0
        0c45a386-f932-11e6-b884-42e3a6cacc77,0
})
170222 21:56:46 [Note] WSREP: view((empty))
170222 21:56:46 [Note] WSREP: gcomm: closed
170222 21:56:46 [Note] WSREP: /usr/libexec/mysqld: Terminated.
170222 21:56:46 mysqld_safe mysqld from pid file /var/run/mysql/mysqld.pid ended
WSREP_SST: [ERROR] Parent mysqld process (PID:22287) terminated unexpectedly. (20170222 21:56:46.605)
WSREP_SST: [INFO] Joiner cleanup. (20170222 21:56:46.608)
WSREP_SST: [INFO] Joiner cleanup done. (20170222 21:56:47.121)

Comment 27 Michele Baldessari 2017-02-23 21:18:35 UTC
So, from a first cursory look I suspect that this and https://bugzilla.redhat.com/show_bug.cgi?id=1426253 are at least partially related. In the sense that also in the last uploaded sosreport I see that we have:
$ grep -ir bind etc/my.cnf.d/server.cnf
bind-address = 127.0.0.1

Which will effectively make galera not listening and give us this. The missing password might be okay because it would have been populated a bit later in the puppet convergence run.

Comment 28 Michele Baldessari 2017-02-23 22:10:39 UTC
Randy, could you upload the full template directory that you are using (after all the additional patches have been applied?). Thanks

Comment 29 Randy Perryman 2017-02-23 22:31:05 UTC
The Templates from the openstack-tripleo-templates from yesterday.

Comment 30 Michele Baldessari 2017-02-23 22:46:26 UTC
Randy, did you forget to attach them? It is just that I am not 100% sure what the starting point here is and the theory is that some older versions of the three patches were applied. Thanks, Michele

Comment 31 Randy Perryman 2017-02-23 22:50:24 UTC
Looks like one of the patches may of not been properly applied.  We are going to "Validate" they are all applied and proper, then rerun the Upgrade.

Thank You for the information.

Comment 32 Michele Baldessari 2017-02-23 22:57:28 UTC
Thanks, Randy. Just so we are all 100% sure the needed changes are described here: https://bugzilla.redhat.com/show_bug.cgi?id=1413686#c33


so to have this working you need to apply those patch.  Assuming the templates are in /usr/share/openstack/tripleo-heat-templates, the necessary commands are:

curl https://review.openstack.org/changes/408669/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

curl https://review.openstack.org/changes/422837/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

curl https://review.openstack.org/changes/428093/revisions/current/patch?download | \
    base64 -d | \
    sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1

The reviews are merged upstream and won't change anymore.

Sofer and I are in Atlanta until tomorrow, so we are in your timezone. Let us know if there are any questions.

Comment 33 Audra Cooper 2017-02-24 18:01:12 UTC
(In reply to Michele Baldessari from comment #32)
> Thanks, Randy. Just so we are all 100% sure the needed changes are described
> here: https://bugzilla.redhat.com/show_bug.cgi?id=1413686#c33
> 
> 
> so to have this working you need to apply those patch.  Assuming the
> templates are in /usr/share/openstack/tripleo-heat-templates, the necessary
> commands are:
> 
> curl
> https://review.openstack.org/changes/408669/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> curl
> https://review.openstack.org/changes/422837/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> curl
> https://review.openstack.org/changes/428093/revisions/current/patch?download
> | \
>     base64 -d | \
>     sudo patch -d /usr/share/openstack-tripleo-heat-templates -p1
> 
> The reviews are merged upstream and won't change anymore.
> 
> Sofer and I are in Atlanta until tomorrow, so we are in your timezone. Let
> us know if there are any questions.

We didn't have the last one applied.  After applying all 3 Upgrade is now successful.

Comment 34 Michele Baldessari 2017-02-24 18:04:21 UTC
Thanks Audra, am closing as duplicate of 1413686

*** This bug has been marked as a duplicate of bug 1413686 ***


Note You need to log in before you can comment on or make changes to this bug.