Bug 1306510 - rhel-osp-director: update from 7.2->7.3 stuck on nova-scheduler restart in ControllerOvercloudServicesDeployment Step4
rhel-osp-director: update from 7.2->7.3 stuck on nova-scheduler restart in Co...
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
x86_64 Linux
urgent Severity urgent
: y3
: 7.0 (Kilo)
Assigned To: Emilien Macchi
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-11 00:27 EST by Omri Hochman
Modified: 2016-04-18 03:01 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-15 23:10:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages from controller-0 (293.60 KB, application/x-gzip)
2016-02-11 09:28 EST, Zane Bitter
no flags Details
resource-list (79.36 KB, text/plain)
2016-02-13 18:05 EST, Omri Hochman
no flags Details
yamls (30.00 KB, application/x-tar)
2016-02-14 09:14 EST, Omri Hochman
no flags Details

  None (edit)
Description Omri Hochman 2016-02-11 00:27:10 EST
rhel-osp-director: update from 7.2->7.3 failed over DBAPIError  (on: ControllerOvercloudServicesDeployment Step4)

Environment :
--------------
instack-undercloud-2.1.2-39.el7ost.noarch
openstack-puppet-modules-2015.1.8-49.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-8.el7ost.noarch
openstack-heat-engine-2015.1.2-8.el7ost.noarch
openstack-heat-api-2015.1.2-8.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch
puppet-3.6.2-2.el7.noarch

scenario:
---------
(1)deploy overcloud using ospd7.2 GA
(2)Attempt to update to 7.3

results:
---------
overcloud Update Failed 


[stack@undercloud72 ~]$ heat resource-list -n 5 overcloud | grep -v COMPLETE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| ControllerNodesPostDeployment                 | 1ece9e2b-6fe1-46c1-bd2d-fae30fb59246          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-02-10T21:28:03Z |                                               |
| ControllerOvercloudServicesDeployment_Step4   | 39b2a9c3-7804-4e4c-9137-99474876543d          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2016-02-10T21:31:54Z | ControllerNodesPostDeployment                 |
| 1                                             | f51be98f-2b7d-403a-9ced-cc3bab757eb5          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-10T21:31:57Z | ControllerOvercloudServicesDeployment_Step4   |
| 2                                             | a17b8422-2ecf-4d82-96ca-29c8e825f38b          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-10T21:31:58Z | ControllerOvercloudServicesDeployment_Step4   |
| 0                                             | d2db29b9-a2d6-464d-b72b-0b642c373b0f          | OS::Heat::StructuredDeployment                    | UPDATE_FAILED   | 2016-02-10T21:31:59Z | ControllerOvercloudServicesDeployment_Step4   |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+


heat.log (on the director) :
--------------------------
2016-02-10 14:55:35.918 3803 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (_mysql_exceptions.DataError) (1406, "Data too long for column 'resource_properties' at row 1") [SQL: u'INSERT INTO event (created_at, updated_at, stack_id, uuid, resource_action, resource_status, resource_name, physical_resource_id, resource_status_reason, resource_type, resource_properties) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: (datetime.datetime(2016, 2, 10, 19, 55, 35, 911623), None, 'f8244e8f-46b7-4118-ae64-f5c97afc099d', 'ce439229-37d7-4626-bd75-88b78f73ba96', 'CREATE', 'IN_PROGRESS', 'ControllerPuppetConfigImpl', None, 'state changed', 'OS::Heat::SoftwareConfig', '\x80\x02}q\x01(U\x06inputsq\x02NU\x06configq\x03X\x7f\x0c\x01\x00# Copyright 2015 Red Hat, Inc.\n# All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the "License"); you may\n# not use this file except in compliance with the License. You may obtain\n# a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT\n# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the\n# License for the specific language governing permissions and limitations\n# under the License.\n\nPcmk_resource <| |> {\n  tries     => 10,\n  try_sleep => 3,\n}\n\ninclude tripleo::packages\n\nif $::hostname == downcase(hiera(\'bootstrap_nodeid\')) {\n  $pacemaker_master = true\n  $sync_db = true\n} else {\n  $pacemaker_master = false\n  $sync_db = false\n}\n\n$enable_load_balancer = hiera(\'enable_load_balancer\', true)\n\n# When to start and enable services which haven\'t been Pacemakerized\n# FIXME: remove when we start all OpenStack services using Pacemaker\n# (occurences of this variable will be gradually replaced with false)\n$non_pcmk_start = hiera(\'step\') >= 4\n\nif hiera(\'step\') >= 1 {\n\n  create_resources
....
...
....
....

2016-02-10 16:27:20.697 23757 ERROR oslo_db.api [req-f266b7c3-8fff-4249-b062-37e4d53710c2 01ac795bbf3d4a1997ab98fbd546ccab 896498ead29b441aaada8dca280df4fc] D
B error.
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api Traceback (most recent call last):
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 131, in wrapper
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api     return f(*args, **kwargs)
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api   File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 89, in _push_metadata_software_deployments
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api     exception.DeploymentConcurrentTransaction(server=server_id))
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api RetryRequest
2016-02-10 16:27:20.697 23757 TRACE oslo_db.api 



 Controller_0:
-------------------

looks that puppet got stuck : 

    1  5730  5730  5730 ?           -1 Ss       0   0:45 /usr/bin/python /usr/bin/os-collect-config
 5730 30948  5730  5730 ?           -1 S        0   0:00  \_ /usr/bin/python /usr/bin/os-refresh-config
30948 31145  5730  5730 ?           -1 S        0   0:00      \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d
31145 35150  5730  5730 ?           -1 S        0   0:00          \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config
35150 35155  5730  5730 ?           -1 S        0   0:00              \_ python /var/lib/heat-config/hooks/puppet
35155 35157  5730  5730 ?           -1 Sl       0   0:37                  \_ /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/7e81b8d3-8bf8-4
35157 45130 45130 45130 ?           -1 Ss       0   0:05                      \_ /usr/bin/python /usr/bin/nova-manage db sync
35157 48325 48325 48325 ?           -1 Ss       0   0:00                      \_ /usr/bin/systemctl restart openstack-nova-scheduler



Messages shows problem connect to DB :
----------------------------------------

[root@overcloud-controller-0 ~]# journalctl | grep os-collect-config | grep ERROR
Feb 10 14:58:35 overcloud-controller-0.localdomain os-collect-config[5730]: haproxy/haproxy.cfg]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy::Config/Concat[/etc/haproxy/haproxy.cfg]/Exec[concat_/etc/haproxy/haproxy.cfg]: Triggered 'refresh' from 41 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy::Config/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/content: content changed '{md5}1f337186b0e1ba5ee82760cb437fb810' to '{md5}aa6d80f5d912fe10a33c377b07d7c90f'\u001b[0m\n\u001b[mNotice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: Finished catalog run in 57.86 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_ex\" does not exist.\nDevice \"br_nic2\" does not exist.\nDevice \"br_nic4\" does not exist.\nDevice \"ovs_system\" does not exist.\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0}
Feb 10 14:58:36 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Feb 10 14:58:36 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: der constraint storage_vip-then-haproxy]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 90.44 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_ex\" does not exist.\nDevice \"br_nic2\" does not exist.\nDevice \"br_nic4\" does not exist.\nDevice \"ovs_system\" does not exist.\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0}
Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)


Messages:
----------
NCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0}



Services are down on the controller:
-------------------------------------
== Support services ==
mysqld:                                 inactive  (disabled on boot)
libvirtd:                               active
openvswitch:                            active
dbus:                                   active
rabbitmq-server:                        inactive  (disabled on boot)
memcached:                              active    (disabled on boot)
Comment 1 Zane Bitter 2016-02-11 00:31:53 EST
Heat log message is normal. It's puppet being stuck trying to restart nova-scheduler that is the proximate cause. MySQL being down is a likely cause of that.
Comment 2 Marios Andreou 2016-02-11 06:27:09 EST
Hi Omri, as Zane points out the 'DBAPIError' isn't the cause of the issue you're seeing; see https://bugzilla.redhat.com/show_bug.cgi?id=1261512#c19 (@Zane this has been the source of confusion before do we really have to always live with that message?)

Having said that there isn't enough in the BZ for me to be able to see what actually went wrong. Ideally I'd like to see /var/log/messages from the controller. WRT the services can you also check the state of the pacemaker cluster (pcs status) to see what has been started.
Comment 3 Zane Bitter 2016-02-11 09:23:10 EST
(In reply to marios from comment #2)
> (@Zane this has been the source of confusion before do we really have to
> always live with that message?)

Not forever, Crag has a patch up to make the log message go away at https://review.openstack.org/#/c/259274 and he's working on a proper fix to eliminate the problem completely in Mitaka (requires a database migration though, so not backportable).
Comment 4 Zane Bitter 2016-02-11 09:28 EST
Created attachment 1123164 [details]
/var/log/messages from controller-0

Here is the log file you requested. It's from controller-0, but they were all stuck in the same place.
Comment 5 Angus Thomas 2016-02-11 10:06:47 EST
Hi Fabio,

Can you please ask someone from your team to investigate what's happening around the upgrade of galera?


Angus
Comment 6 Fabio Massimo Di Nitto 2016-02-11 10:12:29 EST
Damien / Mike, can you please take a look at this bz? it's a blocker for OSPd delivery.

thanks
Fabio
Comment 7 Michael Bayer 2016-02-11 10:24:18 EST
OK as mentioned above, the data error is some kind of known quirk in Heat that is internally caught.  The problem is that it can't connect to the database.

So can we get database logs?   a systemd start of galera would be in /var/log/mariadb/mariadb.log and a pacemaker start would be in /var/log/mysqld.log.  /etc/my.cnf* would be nice also.
Comment 8 Damien Ciabrini 2016-02-11 10:32:09 EST
Can we also get sosreport of the overcloud nodes to have pacemaker logs? We need that to understand how the galera resource agent got called during the scenario
Comment 9 Michael Bayer 2016-02-11 11:00:51 EST
So I've been given a connection to a host that calls itself "undercloud72" and MySQL is running just fine.  However these logs refer to a machine that calls itself "overcloud-controller-0", yet it's attempting to make a local unix socket connection?  That won't work if that's not where MySQL is running.  Shouldn't that be connecting to a hostname / port ?  (note that the hostname of "localhost" will tell MySQL client to connect to a local unix socket as well).
Comment 10 Michael Bayer 2016-02-11 11:04:41 EST
OK, apparently this is not the host I'm supposed to be looking at, but still, Heat looks like it's not connecting over a host / port nevertheless.
Comment 11 Michael Bayer 2016-02-11 11:09:39 EST
ok disregard my spam I understand now.
Comment 12 Michael Bayer 2016-02-11 11:34:27 EST
Well MySQL was up and running when that last error happened, per /var/log/mysqld.log on overcloud-controller-0 galera started up from a process perspective at 15:00:32 and was fully synced by 15:00:39.  So the connect error at 14:58:35 makes sense but not the one at 15:01:29.
Comment 13 Angus Thomas 2016-02-11 15:07:44 EST
Can we reproduce this bug for additional diagnostics, and to establish that it wasn't a transitory issue in the test environment?
Comment 14 Omri Hochman 2016-02-11 18:56:28 EST
(In reply to Angus Thomas from comment #13)
> Can we reproduce this bug for additional diagnostics, and to establish that
> it wasn't a transitory issue in the test environment?

Angus the environment is still up and running available for dev to debug.  I'm answering to the need of sosreport from all the overcloud nodes : 


 (In reply to Damien Ciabrini from comment #8)
> Can we also get sosreport of the overcloud nodes to have pacemaker logs? We
> need that to understand how the galera resource agent got called during the
> scenario


The sosreport should be available here:
 http://rhos-release.virt.bos.redhat.com/mburns/log/bz1306510
Comment 15 Dan Yasny 2016-02-12 10:41:42 EST
Reproduced on another setup (rhos-compute-12)

Same db errors, same heat errors, services on controllers down as follows:
[root@overcloud-controller-0 ~]# systemctl |grep failed
● cloud-final.service                                                                      loaded failed     failed          Execute cloud user/final scripts
● dhcp-interface@br-ex.service                                                             loaded failed     failed          DHCP interface br/ex
● dhcp-interface@br-int.service                                                            loaded failed     failed          DHCP interface br/int
● dhcp-interface@br-tun.service                                                            loaded failed     failed          DHCP interface br/tun
● dhcp-interface@ovs-system.service                                                        loaded failed     failed          DHCP interface ovs/system
● neutron-server.service                                                                   loaded failed     failed          Cluster Controlled neutron-server
● openstack-cinder-api.service                                                             loaded failed     failed          Cluster Controlled openstack-cinder-api
● openstack-cinder-scheduler.service                                                       loaded failed     failed          Cluster Controlled openstack-cinder-scheduler

[root@overcloud-controller-0 ~]# openstack-status
== Nova services ==
openstack-nova-api:                     active    (disabled on boot)
openstack-nova-cert:                    inactive  (disabled on boot)
openstack-nova-compute:                 inactive  (disabled on boot)
openstack-nova-network:                 inactive  (disabled on boot)
openstack-nova-scheduler:               activating(disabled on boot)
openstack-nova-conductor:               active    (disabled on boot)
== Glance services ==
openstack-glance-api:                   active    (disabled on boot)
openstack-glance-registry:              active    (disabled on boot)
== Keystone service ==
openstack-keystone:                     active    (disabled on boot)
== Horizon service ==
openstack-dashboard:                    uncontactable
== neutron services ==
neutron-server:                         failed    (disabled on boot)
neutron-dhcp-agent:                     active    (disabled on boot)
neutron-l3-agent:                       active    (disabled on boot)
neutron-metadata-agent:                 active    (disabled on boot)
neutron-lbaas-agent:                    inactive  (disabled on boot)
neutron-openvswitch-agent:              active    (disabled on boot)
neutron-metering-agent:                 inactive  (disabled on boot)
== Swift services ==
openstack-swift-proxy:                  active    (disabled on boot)
openstack-swift-account:                active    (disabled on boot)
openstack-swift-container:              active    (disabled on boot)
openstack-swift-object:                 active    (disabled on boot)
== Cinder services ==
openstack-cinder-api:                   failed    (disabled on boot)
openstack-cinder-scheduler:             failed    (disabled on boot)
openstack-cinder-volume:                inactive  (disabled on boot)
openstack-cinder-backup:                inactive  (disabled on boot)
== Ceilometer services ==
openstack-ceilometer-api:               active    (disabled on boot)
openstack-ceilometer-central:           active    (disabled on boot)
openstack-ceilometer-compute:           inactive  (disabled on boot)
openstack-ceilometer-collector:         active    (disabled on boot)
openstack-ceilometer-alarm-notifier:    active    (disabled on boot)
openstack-ceilometer-alarm-evaluator:   active    (disabled on boot)
openstack-ceilometer-notification:      active    (disabled on boot)
== Heat services ==
openstack-heat-api:                     active    (disabled on boot)
openstack-heat-api-cfn:                 active    (disabled on boot)
openstack-heat-api-cloudwatch:          active    (disabled on boot)
openstack-heat-engine:                  active    (disabled on boot)
== Support services ==
mysqld:                                 inactive  (disabled on boot)
libvirtd:                               active
openvswitch:                            active
dbus:                                   active
rabbitmq-server:                        inactive  (disabled on boot)
memcached:                              active    (disabled on boot)
Comment 16 Zane Bitter 2016-02-12 10:56:07 EST
nova-scheduler log is full of stuff like:

2016-02-12 10:50:04.505 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -11910 attempts left.

It started at -1 attempts left and has counted down from there. Clearly there's a bug in oslo_db where it's checking the equivalent of "while --retry_attempts != 0" when it should check for <= 0. That may be the whole problem, but it seems surprising that none of these attempts to connect to the DB is succeeding if the DB is indeed up.
Comment 17 Damien Ciabrini 2016-02-12 11:04:32 EST
I've checked the sosreport and the reoccurence of the same bug on two other deployment today.

  . galera is running in all deployments, on all controllers
  . when connecting locally on a unix socket (ctrl0), I can see that the galera cluster is Synced and all 3 nodes are up
  . pacemaker sees the three galera nodes as "Master" as well

So clearly the DB is there and responding.

Concerning the "HY000 logs" we see the same pattern on all deployements. First attempt to connect while galera wasn't started yet.
What's interesting is that these log happen at the very beginning of the deployment, i.e. probably way before the update has started.
Comment 18 Zane Bitter 2016-02-12 11:34:41 EST
Here's a snippet from the end of the useful part of the nova-scheduler log:

2016-02-10 16:21:16.417 34585 INFO oslo_messaging._drivers.impl_rabbit [req-c22aaae4-9478-46a0-8beb-7b37ef5ab975 - - - - -] Connected to AMQP server on 10.19.104.16:5672
2016-02-10 16:24:49.003 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.19.104.16:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.
2016-02-10 16:24:50.029 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.19.104.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2016-02-10 16:24:51.081 34585 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 10.19.104.13:5672
2016-02-10 16:50:35.591 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 
2016-02-10 16:50:39.935 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -1 attempts left.
2016-02-10 16:50:52.951 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -2 attempts left.

(it continues like this forever)

Here's the "journalctl -u openstack-nova-scheduler" output in its entirety:

Feb 10 15:08:21 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
Feb 10 15:08:24 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
Feb 10 15:08:24 overcloud-controller-0.localdomain nova-scheduler[22087]: /usr/lib64/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:153: SAW
Feb 10 15:08:24 overcloud-controller-0.localdomain nova-scheduler[22087]: 'strategies for improved performance.' % expr)
Feb 10 15:19:35 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
Feb 10 15:19:39 overcloud-controller-0.localdomain systemd[1]: Stopped Cluster Controlled openstack-nova-scheduler.
Feb 10 15:23:20 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
Feb 10 15:23:22 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
Feb 10 16:19:38 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Nova Scheduler Server...
Feb 10 16:19:42 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server.
Feb 10 16:21:14 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
Feb 10 16:21:16 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
Feb 10 16:50:27 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
Feb 10 16:50:35 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...

And the interesting parts of "ps axjf":

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
...
    1  5730  5730  5730 ?           -1 Ss       0   0:45 /usr/bin/python /usr/bin/os-collect-config
 5730 30948  5730  5730 ?           -1 S        0   0:00  \_ /usr/bin/python /usr/bin/os-refresh-config
30948 31145  5730  5730 ?           -1 S        0   0:00      \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d
31145 35150  5730  5730 ?           -1 S        0   0:00          \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config
35150 35155  5730  5730 ?           -1 S        0   0:00              \_ python /var/lib/heat-config/hooks/puppet
35155 35157  5730  5730 ?           -1 Sl       0   0:37                  \_ /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config
/heat-config-puppet/7e81b8d3-8bf8-4ee9-8480-c77d6e1ca24a.pp
35157 45130 45130 45130 ?           -1 Ss       0   0:23                      \_ /usr/bin/python /usr/bin/nova-manage db sync
35157 48325 48325 48325 ?           -1 Ss       0   0:02                      \_ /usr/bin/systemctl restart openstack-nova-scheduler
...
    1 48403 48403 48403 ?           -1 Ss     162   0:22 /usr/bin/python /usr/bin/nova-scheduler

And the systemd status:

● openstack-nova-scheduler.service - Cluster Controlled openstack-nova-scheduler
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-scheduler.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/openstack-nova-scheduler.service.d
           └─50-pacemaker.conf
   Active: activating (start) since Wed 2016-02-10 16:50:35 EST; 1 day 18h ago
 Main PID: 48403 (nova-scheduler)
   CGroup: /system.slice/openstack-nova-scheduler.service
           └─48403 /usr/bin/python /usr/bin/nova-scheduler

So although we never saw "Stopped Cluster Controlled openstack-nova-scheduler" in the journal, it looks very much like the newly-started nova-scheduler (pid 48403) launched at 2016-02-10 16:50:35 is the one that cannot connect to the DB, and it will never give up and just fail until -Infinity == 0, but nor will it ever work unless we stop getting the "SQL connection failed" errors.
Comment 19 Zane Bitter 2016-02-12 11:50:56 EST
Is the pcs output about the state of Galera still accurate even when pcs is in maintenance mode? Because "pcs status" is also showing me:

 Clone Set: openstack-cinder-api-clone [openstack-cinder-api] (unmanaged)
     openstack-cinder-api	(systemd:openstack-cinder-api):	Started overcloud-controller-0 (unmanaged)
     openstack-cinder-api	(systemd:openstack-cinder-api):	Started overcloud-controller-1 (unmanaged)
     openstack-cinder-api	(systemd:openstack-cinder-api):	Started overcloud-controller-2 (unmanaged)

yet:

● openstack-cinder-api.service - Cluster Controlled openstack-cinder-api
   Loaded: loaded (/usr/lib/systemd/system/openstack-cinder-api.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/openstack-cinder-api.service.d
           └─50-pacemaker.conf
   Active: failed (Result: exit-code) since Wed 2016-02-10 16:38:25 EST; 1 day 19h ago
 Main PID: 38791 (code=exited, status=1/FAILURE)

(presumably this failed as a consequence on nova-sheduler not coming up)

systemd does report that mysqld is running at least though (managed by pacemaker):

● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-02-10 16:20:24 EST; 1 day 19h ago
 Main PID: 29614 (pacemakerd)
   CGroup: /system.slice/pacemaker.service
           ├─29614 /usr/sbin/pacemakerd -f
           ├─29621 /usr/libexec/pacemaker/cib
           ├─29622 /usr/libexec/pacemaker/stonithd
           ├─29623 /usr/libexec/pacemaker/lrmd
           ├─29624 /usr/libexec/pacemaker/attrd
           ├─29625 /usr/libexec/pacemaker/pengine
           ├─29626 /usr/libexec/pacemaker/crmd
           ├─30829 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err
           ├─30835 /bin/sh /usr/sbin/rabbitmq-server
           ├─32091 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --d...
           └─33020 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=...
Comment 20 Zane Bitter 2016-02-12 11:56:22 EST
I tried killing nova-scheduler manually and then starting it myself. That unblocked the rest of puppet, of course, but the attempt to start nova-scheduler again had the same result, just starting back at a retry count of -1.
Comment 21 Damien Ciabrini 2016-02-12 12:10:52 EST
(In reply to Zane Bitter from comment #19)

> systemd does report that mysqld is running at least though (managed by
> pacemaker):
> 

It's up and running, plus haproxy seems to have no problem proxying mysqld through the vip on controller0

[root@overcloud-controller-0 nova]# systemctl status haproxy
● haproxy.service - Cluster Controlled haproxy
   Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/haproxy.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2016-02-11 14:31:36 EST; 21h ago
 Main PID: 18088 (haproxy-systemd)
   CGroup: /system.slice/haproxy.service
           ├─18088 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
           ├─18091 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
           └─18095 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds

Feb 12 11:39:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52111 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:39:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52112 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:39:44 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52695 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:39:44 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52696 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53703 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53704 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53705 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53706 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 11:44:51 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.14:60191 to 192.168.100.11:3306 (mysql/TCP)
Feb 12 12:01:41 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:59317 to 192.168.100.11:3306 (mysql/TCP)

[root@overcloud-controller-0 nova]# netstat -tnlp | grep 3306
tcp        0      0 192.168.100.12:3306     0.0.0.0:*               LISTEN      19989/mysqld                                                                                                          |
tcp        0      0 192.168.100.11:3306     0.0.0.0:*               LISTEN      18095/haproxy
Comment 22 Zane Bitter 2016-02-12 13:19:52 EST
With help from Dan Smith I discovered that _none_ of the services can connect to the DB (nova-scheduler happens to be the only one that will block systemd as a result). Going through some other log files until we find a more useful one reveals this (from glance-api):

2016-02-11 18:09:41.275 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 10 attempts left.
2016-02-11 18:09:53.667 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 9 attempts left.
2016-02-11 18:10:06.685 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 8 attempts left.
2016-02-11 18:10:19.698 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 7 attempts left.
2016-02-11 18:10:32.708 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 6 attempts left.
2016-02-11 18:10:45.716 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 5 attempts left.
2016-02-11 18:10:58.730 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 4 attempts left.
2016-02-11 18:11:11.737 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 3 attempts left.
2016-02-11 18:11:24.754 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 2 attempts left.
2016-02-11 18:11:37.770 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 1 attempts left.
2016-02-11 18:11:47.782 40856 CRITICAL glance [-] DBConnectionError: (_mysql_exceptions.OperationalError) (2003, "Can't connect to MySQL server on '10.19.104.17' (113)")

Predictably, that appears to either be the wrong address or a VIP that is broken:

# ping 10.19.104.17
PING 10.19.104.17 (10.19.104.17) 56(84) bytes of data.
From 10.19.104.15 icmp_seq=1 Destination Host Unreachable
...


# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# HEAT_HOSTS_START - Do not edit manually within this section!
10.19.104.14 overcloud-compute-0.localdomain overcloud-compute-0
10.19.104.12 overcloud-compute-1.localdomain overcloud-compute-1
10.19.104.15 overcloud-controller-0.localdomain overcloud-controller-0
10.19.104.13 overcloud-controller-1.localdomain overcloud-controller-1
10.19.104.16 overcloud-controller-2.localdomain overcloud-controller-2


10.19.105.14 overcloud-cephstorage-0.localdomain overcloud-cephstorage-0
# HEAT_HOSTS_END

Same deal on all three controllers.
Comment 23 Zane Bitter 2016-02-12 13:24:48 EST
Definitely a broken VIP (from pcs status):

 ip-10.19.104.17	(ocf::heartbeat:IPaddr2):	Stopped (unmanaged)
 ip-10.19.105.17	(ocf::heartbeat:IPaddr2):	Stopped (unmanaged)
 ip-10.19.184.225	(ocf::heartbeat:IPaddr2):	Stopped (unmanaged)
 ip-192.168.201.15	(ocf::heartbeat:IPaddr2):	Stopped (unmanaged)
Comment 24 James Slagle 2016-02-12 13:57:45 EST
omri, can you provide the initial deployment command you used to deploy 7.2. (please confirm 7.2 is what you initially deployed as well).

and then the full update command.

for both commands, please provide any custom environment files, or any other commands that may have been run between the initial deployment and the update.
Comment 25 Dan Sneddon 2016-02-12 13:59:22 EST
(In reply to Zane Bitter from comment #23)

Here are some commands that would give interesting output for troubleshooting:

ip addr
sudo netstat -anp
cat /etc/puppet/hieradata/vip_data.yaml  (IIRC the filename)
Comment 26 Dan Sneddon 2016-02-12 14:10:07 EST
(In reply to Dan Sneddon from comment #25)

> ip addr

(just the relevant interface)
8: br-nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether c8:1f:66:c9:4b:20 brd ff:ff:ff:ff:ff:ff
    inet 10.19.104.13/24 brd 10.19.104.255 scope global br-nic2
       valid_lft forever preferred_lft forever
    inet 10.19.104.11/32 brd 10.19.104.255 scope global br-nic2
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1368:ca1f:66ff:fec9:4b20/64 scope global mngtmpaddr dynamic 
       valid_lft 2591952sec preferred_lft 604752sec
    inet6 fe80::ca1f:66ff:fec9:4b20/64 scope link 
       valid_lft forever preferred_lft forever

> sudo netstat -anp

$ sudo netstat -anp | grep 3306 | grep LISTEN
tcp        0      0 10.19.104.13:3306       0.0.0.0:*               LISTEN      11790/mysqld        
tcp        0      0 10.19.104.11:3306       0.0.0.0:*               LISTEN      9298/haproxy

> cat /etc/puppet/hieradata/vip_data.yaml  (IIRC the filename)

ceilometer_api_vip: 10.19.104.17
cinder_api_vip: 10.19.104.17
glance_api_vip: 10.19.105.17
glance_registry_vip: 10.19.104.17
heat_api_vip: 10.19.104.17
horizon_vip: 10.19.104.17
keystone_admin_api_vip: 192.168.0.6
keystone_public_api_vip: 10.19.104.17
mysql_vip: 10.19.104.17
neutron_api_vip: 10.19.104.17
nova_api_vip: 10.19.104.17
nova_metadata_vip: 10.19.104.17
redis_vip: 10.19.104.10
swift_proxy_vip: 10.19.105.17
tripleo::loadbalancer::controller_virtual_ip: 192.168.0.6
tripleo::loadbalancer::internal_api_virtual_ip: 10.19.104.17
tripleo::loadbalancer::public_virtual_ip: 10.19.184.225
tripleo::loadbalancer::storage_mgmt_virtual_ip: 192.168.201.15
tripleo::loadbalancer::storage_virtual_ip: 10.19.105.17
tripleo::redis_notification::haproxy_monitor_ip: 192.168.0.6


So based on these commands, it appears that the VIP as far as Puppet is concerned is 10.19.104.17.

However, looking at the relevant stanza in /etc/haproxy/haproxy.cfg, haproxy is configured to be listening on 10.19.104.11:

listen mysql
  bind 10.19.104.11:3306 transparent
  option tcpka
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 0
  timeout server 0
  server overcloud-controller-0 10.19.104.15:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 10.19.104.13:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 10.19.104.16:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2


So how did these get out of sync? Would running Puppet update the haproxy config to use .17 as the VIP?
Comment 27 Zane Bitter 2016-02-12 14:12:29 EST
From /var/log/messages:

Feb 10 16:33:01 localhost os-collect-config: [2016-02-10 16:33:01,400] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/7e81b8d3-8bf8-4ee9-8480-c77d6e1ca24a.json
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-novncproxy.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-compute.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-xvpvncproxy.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-api.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/dhcp-interface@.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-scheduler.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-console.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-cert.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-consoleauth.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/wpa_supplicant.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-metadata-api.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-conductor.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/ebtables.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Feb 10 16:33:30 localhost crmd[29626]:  notice: Operation ip-10.19.104.17_monitor_0: not running (node=overcloud-controller-0, call=333, rc=7, cib-update=179, confirmed=true)
Feb 10 16:33:31 localhost crmd[29626]:  notice: Operation ip-10.19.105.17_monitor_0: not running (node=overcloud-controller-0, call=337, rc=7, cib-update=180, confirmed=true)
Feb 10 16:33:32 localhost crmd[29626]:  notice: Operation ip-10.19.184.225_monitor_0: not running (node=overcloud-controller-0, call=341, rc=7, cib-update=181, confirmed=true)
Feb 10 16:33:34 localhost crmd[29626]:  notice: Operation ip-192.168.201.15_monitor_0: not running (node=overcloud-controller-0, call=345, rc=7, cib-update=182, confirmed=true)
Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532436]#012: No such file or directory
Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532436]#012: No such file or directory
Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532624]#012: No such file or directory
Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532624]#012: No such file or directory
Feb 10 16:36:14 localhost systemd: Stopping Cluster Controlled openstack-cinder-api...
Feb 10 16:36:14 localhost systemd: Started Cluster Controlled openstack-cinder-api.
Feb 10 16:36:14 localhost systemd: Starting Cluster Controlled openstack-cinder-api...

So it looks like something in Puppet is disabling the VIPs about before beginning to restart all the services.
Comment 28 Michele Baldessari 2016-02-12 15:10:17 EST
From the sosreport of overcloud-controller-0 we see the following ip addresses:
$ pcs -f cib.xml resource  |grep IPaddr2    
 ip-10.19.184.221       (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-192.168.0.6 (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.105.10        (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.104.11        (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-192.168.201.10      (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.104.10        (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.104.17        (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.105.17        (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-10.19.184.225       (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
 ip-192.168.201.15      (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)

From a quick look the above ip addresses seem too many...

In terms of constraints we have:
$ pcs -f cib.xml constraint order |grep ip-
  start ip-10.19.104.11 then start haproxy-clone (kind:Optional)
  start ip-192.168.0.6 then start haproxy-clone (kind:Optional)
  start ip-192.168.201.10 then start haproxy-clone (kind:Optional)
  start ip-10.19.184.221 then start haproxy-clone (kind:Optional)
  start ip-10.19.104.10 then start haproxy-clone (kind:Optional)
  start ip-10.19.105.10 then start haproxy-clone (kind:Optional)

So my first impression here is that at a certain point (after first 
installation, likely) the main VIP (galera one) was 10.19.104.11.

This is reflected in pcs (because there are the IP resources
and the constraints pointing to that IP) and also is still reflected in 
haproxy.cf The above IP is on controller-1 (at the time of sosreport collection)

Then I guess the update took place and only part of the configuration was
enforced (i.e. .104.17 was changed in most openstack configuration files, but 
.104.11 stayed in the pcs resources and haproxy.cfg)

Am still looking if I can figure out the exact steps that took place, but
I think the request for info at comment#24 is still very relevant
Comment 29 James Slagle 2016-02-13 10:35:57 EST
(In reply to James Slagle from comment #24)
> omri, can you provide the initial deployment command you used to deploy 7.2.
> (please confirm 7.2 is what you initially deployed as well).
> 
> and then the full update command.
> 
> for both commands, please provide any custom environment files, or any other
> commands that may have been run between the initial deployment and the
> update.

i was able to update from 7.2 (openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch) to 7.3 latest (openstack-tripleo-heat-templates-0.8.6-120.el7ost) using a virt setup, HA, net-iso, single-nic-with-vlans, so that's why i'm thinking it might be something in the specific environment files or commands being used.

In addition to the previous needinfo item from comment #24, it'd also be helpful to see the output of the following before and after a failed update attempt:

neutron port-list
heat resource-list -n 10 overcloud
Comment 30 Omri Hochman 2016-02-13 18:05:05 EST
neutron port-list :

+--------------------------------------+-------------------------------+-------------------+---------------------------------------------------------------------------------------+
| id                                   | name                          | mac_address       | fixed_ips                                                                             |
+--------------------------------------+-------------------------------+-------------------+---------------------------------------------------------------------------------------+
| 05ba55ae-7202-4e53-8bcd-ebc007bea3b8 |                               | fa:16:3e:da:00:4d | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.12"} |
| 132ed304-0b53-44f5-9d31-50af5a38e894 | control_virtual_ip            | fa:16:3e:79:33:2d | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.6"}    |
| 2681fd72-aae5-470e-9898-17fd3531af88 |                               | 00:0a:f7:7c:29:28 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.7"}    |
| 26d0cdb4-222d-4e76-bf77-1fd94e08a72c | storage_virtual_ip            | fa:16:3e:d2:86:1c | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.17"}   |
| 35c4cb79-07de-4b86-b2c2-d291768ac8f5 |                               | fa:16:3e:e9:7c:d2 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.13"} |
| 3b4783db-20aa-4d89-abd4-85516b87e5a5 |                               | fa:16:3e:5f:e1:57 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.14"} |
| 3beb4f55-83f1-4301-85ff-1ea682b48ba2 |                               | fa:16:3e:70:64:77 | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.12"}   |
| 44c80b48-e354-437d-8199-1704e0992b94 |                               | 00:0a:f7:79:93:70 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.9"}    |
| 47a83681-5fa4-41a8-b892-6a8edcbf630a |                               | fa:16:3e:f1:24:4e | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.11"} |
| 56ca0386-8b90-4aa9-b75f-200d19dd8dbf |                               | fa:16:3e:af:0f:56 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.11"} |
| 5847c202-5b57-43c4-9804-ebb5afa8fc7e |                               | 00:0a:f7:79:93:80 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.8"}    |
| 5a72688e-dc50-4352-9fac-48b3e03ab505 |                               | fa:16:3e:8d:28:e3 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.222"}  |
| 657ef5f6-05dd-43d1-9b5d-fe5d612bce57 |                               | 00:0a:f7:79:93:cc | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.12"}   |
| 70da84ec-c49f-4f00-8e88-bc05671ee82f |                               | fa:16:3e:ff:e3:14 | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.16"}   |
| 77413399-025e-4671-935c-999a768bb1c7 |                               | fa:16:3e:6b:82:9b | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.11"}   |
| 83f2f014-2d39-4080-be73-26f9fc22e099 |                               | fa:16:3e:fd:11:13 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address
": "10.19.104.15"}   |
| 888ff481-afd1-459f-9099-d98231b0cf63 |                               | fa:16:3e:d5:cc:21 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.14"}   |
| 98121b60-9474-4895-88e3-5c246fd5e03a |                               | fa:16:3e:dc:f0:b2 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.10"} |
| a4d3094d-d77d-4700-a943-538402879030 |                               | fa:16:3e:cb:6c:22 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.224"}  |
| ac647535-c052-4c91-af35-e4edffe71aee |                               | fa:16:3e:51:03:f8 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.5"}    |
| aef54162-b2cb-47e4-afd8-e0c1828dda0e |                               | fa:16:3e:38:93:4a | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.13"}   |
| b194b378-f39b-443b-a3c2-c609a4ac2695 |                               | fa:16:3e:64:69:81 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.13"} |
| c830ce86-abe5-4de6-9df9-4d251a6a23da |                               | fa:16:3e:c6:b5:99 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.12"}   |
| cbe15c01-4cd7-4c8a-b640-196419e6157e |                               | 00:0a:f7:79:92:0a | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.10"}   |
| d30db48d-b6a7-42e6-80da-20eba1611555 |                               | fa:16:3e:7a:b8:0f | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.14"}   |
| d4ce33c3-b1b8-4152-b617-d3904bd9e4a8 |                               | fa:16:3e:e4:71:f1 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.12"} |
| d7e6440c-5f96-4190-a6b6-e1fa0b4b8add | internal_api_virtual_ip       | fa:16:3e:62:54:a7 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.17"}   |
| d958bfd5-8924-4323-be82-d40fb639cb67 |                               | fa:16:3e:3b:e2:c6 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.16"}   |
| e2e7264c-b0e0-457c-941c-c6668eb8c7ff |                               | fa:16:3e:0d:40:ca | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.15"}   |
| e82661ec-a76e-4af7-88e4-5f56b0acb81f |                               | fa:16:3e:90:14:e5 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.223"}  |
| e992b763-bac3-4494-947e-1f4ed59672db |                               | 00:0a:f7:79:93:de | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.11"}   |
| ee9fc026-7682-4cca-ad5c-3a669e12a403 | public_virtual_ip             | fa:16:3e:2d:eb:e6 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.225"}  |
| eedbd05e-8570-43f5-89e4-da0068e1b520 | redis_virtual_ip              | fa:16:3e:0d:46:59 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.10"}   |
| f364aba6-18dd-4e04-909e-eee825c9e439 | storage_management_virtual_ip | fa:16:3e:b9:d1:7d | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address
": "192.168.201.15"} |
| f8991864-a8bf-444f-aa17-45c8823325a7 |                               | fa:16:3e:e0:9c:d8 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.14"} |
| fd4c9b44-0a65-409e-abc3-928ec76e7039 |                               | fa:16:3e:f5:49:ea | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.13"}   |
+--------------------------------------+-------------------------------+-------------------+------------------------------------------------------------------




heat resource-list -n 10 overcloud (attached file)
Comment 31 Omri Hochman 2016-02-13 18:05 EST
Created attachment 1126881 [details]
resource-list
Comment 32 James Slagle 2016-02-13 19:49:09 EST
(In reply to James Slagle from comment #24)
> omri, can you provide the initial deployment command you used to deploy 7.2.
> (please confirm 7.2 is what you initially deployed as well).
> 
> and then the full update command.
> 
> for both commands, please provide any custom environment files, or any other
> commands that may have been run between the initial deployment and the
> update.

omri, we still need the above info as well.
Comment 33 Omri Hochman 2016-02-14 09:14:25 EST
(In reply to James Slagle from comment #32)
> (In reply to James Slagle from comment #24)
> > omri, can you provide the initial deployment command you used to deploy 7.2.
> > (please confirm 7.2 is what you initially deployed as well).
> > 
> > and then the full update command.
> > 
> > for both commands, please provide any custom environment files, or any other
> > commands that may have been run between the initial deployment and the
> > update.
> 
> omri, we still need the above info as well.

confirm-versions 7.2 -> 7.3 : /var/log/yum.log: 

Feb 10 12:41:27 Installed: instack-undercloud-2.1.2-36.el7ost.noarch
Feb 10 16:02:52 Updated: instack-undercloud-2.1.2-39.el7ost.noarch
------------------------------------------------------------------------

deployment-command:
 openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml

------------------------------------------------------------------------
update-command:
yes "" | openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml  -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip.yaml -e network-environment.yaml 


Attaching relevnat .yamls
Comment 34 Omri Hochman 2016-02-14 09:14 EST
Created attachment 1127041 [details]
yamls
Comment 35 James Slagle 2016-02-15 07:14:41 EST
(In reply to Omri Hochman from comment #33)
> (In reply to James Slagle from comment #32)
> > (In reply to James Slagle from comment #24)
> > > omri, can you provide the initial deployment command you used to deploy 7.2.
> > > (please confirm 7.2 is what you initially deployed as well).
> > > 
> > > and then the full update command.
> > > 
> > > for both commands, please provide any custom environment files, or any other
> > > commands that may have been run between the initial deployment and the
> > > update.
> > 
> > omri, we still need the above info as well.
> 
> confirm-versions 7.2 -> 7.3 : /var/log/yum.log: 
> 
> Feb 10 12:41:27 Installed: instack-undercloud-2.1.2-36.el7ost.noarch
> Feb 10 16:02:52 Updated: instack-undercloud-2.1.2-39.el7ost.noarch
> ------------------------------------------------------------------------
> 
> deployment-command:
>  openstack overcloud deploy --templates --control-scale 3 --compute-scale 2
> --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types
> vxlan  --ntp-server 10.5.26.10 --timeout 90 -e
> /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml -e network-environment.yaml
> 
> ------------------------------------------------------------------------
> update-command:
> yes "" | openstack overcloud update stack overcloud -i --templates -e
> /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-
> puppet.yaml  -e
> /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.
> yaml-e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-
> vip.yaml -e network-environment.yaml 

You should not be passing update-from-vip.yaml when updating from 7.2. That should only be used if you initially deployed 7.1. That's likely the root cause of the issue here.
Comment 36 Omri Hochman 2016-02-15 23:10:04 EST
> You should not be passing update-from-vip.yaml when updating from 7.2. That
> should only be used if you initially deployed 7.1. That's likely the root
> cause of the issue here.

CLOSED on NOTABUG.  Seems that this miss-configuration caused by sending  : 
"-e update-from-vip.yaml" in the update command. this option should be used *only* when updating 7.1 to 7.3 .

(and Not when updating: 7.2 ->7.3 )

Note You need to log in before you can comment on or make changes to this bug.