rhel-osp-director: update from 7.2->7.3 failed over DBAPIError (on: ControllerOvercloudServicesDeployment Step4) Environment : -------------- instack-undercloud-2.1.2-39.el7ost.noarch openstack-puppet-modules-2015.1.8-49.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch openstack-heat-templates-0-0.8.20150605git.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-api-cfn-2015.1.2-8.el7ost.noarch openstack-heat-engine-2015.1.2-8.el7ost.noarch openstack-heat-api-2015.1.2-8.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch puppet-3.6.2-2.el7.noarch scenario: --------- (1)deploy overcloud using ospd7.2 GA (2)Attempt to update to 7.3 results: --------- overcloud Update Failed [stack@undercloud72 ~]$ heat resource-list -n 5 overcloud | grep -v COMPLETE +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | ControllerNodesPostDeployment | 1ece9e2b-6fe1-46c1-bd2d-fae30fb59246 | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2016-02-10T21:28:03Z | | | ControllerOvercloudServicesDeployment_Step4 | 39b2a9c3-7804-4e4c-9137-99474876543d | OS::Heat::StructuredDeployments | UPDATE_FAILED | 2016-02-10T21:31:54Z | ControllerNodesPostDeployment | | 1 | f51be98f-2b7d-403a-9ced-cc3bab757eb5 | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2016-02-10T21:31:57Z | ControllerOvercloudServicesDeployment_Step4 | | 2 | a17b8422-2ecf-4d82-96ca-29c8e825f38b | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2016-02-10T21:31:58Z | ControllerOvercloudServicesDeployment_Step4 | | 0 | d2db29b9-a2d6-464d-b72b-0b642c373b0f | OS::Heat::StructuredDeployment | UPDATE_FAILED | 2016-02-10T21:31:59Z | ControllerOvercloudServicesDeployment_Step4 | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ heat.log (on the director) : -------------------------- 2016-02-10 14:55:35.918 3803 ERROR oslo_db.sqlalchemy.exc_filters [-] DBAPIError exception wrapped from (_mysql_exceptions.DataError) (1406, "Data too long for column 'resource_properties' at row 1") [SQL: u'INSERT INTO event (created_at, updated_at, stack_id, uuid, resource_action, resource_status, resource_name, physical_resource_id, resource_status_reason, resource_type, resource_properties) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: (datetime.datetime(2016, 2, 10, 19, 55, 35, 911623), None, 'f8244e8f-46b7-4118-ae64-f5c97afc099d', 'ce439229-37d7-4626-bd75-88b78f73ba96', 'CREATE', 'IN_PROGRESS', 'ControllerPuppetConfigImpl', None, 'state changed', 'OS::Heat::SoftwareConfig', '\x80\x02}q\x01(U\x06inputsq\x02NU\x06configq\x03X\x7f\x0c\x01\x00# Copyright 2015 Red Hat, Inc.\n# All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the "License"); you may\n# not use this file except in compliance with the License. You may obtain\n# a copy of the License at\n#\n# http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT\n# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the\n# License for the specific language governing permissions and limitations\n# under the License.\n\nPcmk_resource <| |> {\n tries => 10,\n try_sleep => 3,\n}\n\ninclude tripleo::packages\n\nif $::hostname == downcase(hiera(\'bootstrap_nodeid\')) {\n $pacemaker_master = true\n $sync_db = true\n} else {\n $pacemaker_master = false\n $sync_db = false\n}\n\n$enable_load_balancer = hiera(\'enable_load_balancer\', true)\n\n# When to start and enable services which haven\'t been Pacemakerized\n# FIXME: remove when we start all OpenStack services using Pacemaker\n# (occurences of this variable will be gradually replaced with false)\n$non_pcmk_start = hiera(\'step\') >= 4\n\nif hiera(\'step\') >= 1 {\n\n create_resources .... ... .... .... 2016-02-10 16:27:20.697 23757 ERROR oslo_db.api [req-f266b7c3-8fff-4249-b062-37e4d53710c2 01ac795bbf3d4a1997ab98fbd546ccab 896498ead29b441aaada8dca280df4fc] D B error. 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api Traceback (most recent call last): 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 131, in wrapper 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api return f(*args, **kwargs) 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 89, in _push_metadata_software_deployments 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api exception.DeploymentConcurrentTransaction(server=server_id)) 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api RetryRequest 2016-02-10 16:27:20.697 23757 TRACE oslo_db.api Controller_0: ------------------- looks that puppet got stuck : 1 5730 5730 5730 ? -1 Ss 0 0:45 /usr/bin/python /usr/bin/os-collect-config 5730 30948 5730 5730 ? -1 S 0 0:00 \_ /usr/bin/python /usr/bin/os-refresh-config 30948 31145 5730 5730 ? -1 S 0 0:00 \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d 31145 35150 5730 5730 ? -1 S 0 0:00 \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config 35150 35155 5730 5730 ? -1 S 0 0:00 \_ python /var/lib/heat-config/hooks/puppet 35155 35157 5730 5730 ? -1 Sl 0 0:37 \_ /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/7e81b8d3-8bf8-4 35157 45130 45130 45130 ? -1 Ss 0 0:05 \_ /usr/bin/python /usr/bin/nova-manage db sync 35157 48325 48325 48325 ? -1 Ss 0 0:00 \_ /usr/bin/systemctl restart openstack-nova-scheduler Messages shows problem connect to DB : ---------------------------------------- [root@overcloud-controller-0 ~]# journalctl | grep os-collect-config | grep ERROR Feb 10 14:58:35 overcloud-controller-0.localdomain os-collect-config[5730]: haproxy/haproxy.cfg]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy::Config/Concat[/etc/haproxy/haproxy.cfg]/Exec[concat_/etc/haproxy/haproxy.cfg]: Triggered 'refresh' from 41 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy::Config/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/content: content changed '{md5}1f337186b0e1ba5ee82760cb437fb810' to '{md5}aa6d80f5d912fe10a33c377b07d7c90f'\u001b[0m\n\u001b[mNotice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: Finished catalog run in 57.86 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_ex\" does not exist.\nDevice \"br_nic2\" does not exist.\nDevice \"br_nic4\" does not exist.\nDevice \"ovs_system\" does not exist.\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0} Feb 10 14:58:36 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) Feb 10 14:58:36 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: der constraint storage_vip-then-haproxy]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 90.44 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_ex\" does not exist.\nDevice \"br_nic2\" does not exist.\nDevice \"br_nic4\" does not exist.\nDevice \"ovs_system\" does not exist.\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0} Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) Feb 10 15:01:29 overcloud-controller-0.localdomain os-collect-config[5730]: Error: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) Messages: ---------- NCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n", "deploy_status_code": 0} Services are down on the controller: ------------------------------------- == Support services == mysqld: inactive (disabled on boot) libvirtd: active openvswitch: active dbus: active rabbitmq-server: inactive (disabled on boot) memcached: active (disabled on boot)
Heat log message is normal. It's puppet being stuck trying to restart nova-scheduler that is the proximate cause. MySQL being down is a likely cause of that.
Hi Omri, as Zane points out the 'DBAPIError' isn't the cause of the issue you're seeing; see https://bugzilla.redhat.com/show_bug.cgi?id=1261512#c19 (@Zane this has been the source of confusion before do we really have to always live with that message?) Having said that there isn't enough in the BZ for me to be able to see what actually went wrong. Ideally I'd like to see /var/log/messages from the controller. WRT the services can you also check the state of the pacemaker cluster (pcs status) to see what has been started.
(In reply to marios from comment #2) > (@Zane this has been the source of confusion before do we really have to > always live with that message?) Not forever, Crag has a patch up to make the log message go away at https://review.openstack.org/#/c/259274 and he's working on a proper fix to eliminate the problem completely in Mitaka (requires a database migration though, so not backportable).
Created attachment 1123164 [details] /var/log/messages from controller-0 Here is the log file you requested. It's from controller-0, but they were all stuck in the same place.
Hi Fabio, Can you please ask someone from your team to investigate what's happening around the upgrade of galera? Angus
Damien / Mike, can you please take a look at this bz? it's a blocker for OSPd delivery. thanks Fabio
OK as mentioned above, the data error is some kind of known quirk in Heat that is internally caught. The problem is that it can't connect to the database. So can we get database logs? a systemd start of galera would be in /var/log/mariadb/mariadb.log and a pacemaker start would be in /var/log/mysqld.log. /etc/my.cnf* would be nice also.
Can we also get sosreport of the overcloud nodes to have pacemaker logs? We need that to understand how the galera resource agent got called during the scenario
So I've been given a connection to a host that calls itself "undercloud72" and MySQL is running just fine. However these logs refer to a machine that calls itself "overcloud-controller-0", yet it's attempting to make a local unix socket connection? That won't work if that's not where MySQL is running. Shouldn't that be connecting to a hostname / port ? (note that the hostname of "localhost" will tell MySQL client to connect to a local unix socket as well).
OK, apparently this is not the host I'm supposed to be looking at, but still, Heat looks like it's not connecting over a host / port nevertheless.
ok disregard my spam I understand now.
Well MySQL was up and running when that last error happened, per /var/log/mysqld.log on overcloud-controller-0 galera started up from a process perspective at 15:00:32 and was fully synced by 15:00:39. So the connect error at 14:58:35 makes sense but not the one at 15:01:29.
Can we reproduce this bug for additional diagnostics, and to establish that it wasn't a transitory issue in the test environment?
(In reply to Angus Thomas from comment #13) > Can we reproduce this bug for additional diagnostics, and to establish that > it wasn't a transitory issue in the test environment? Angus the environment is still up and running available for dev to debug. I'm answering to the need of sosreport from all the overcloud nodes : (In reply to Damien Ciabrini from comment #8) > Can we also get sosreport of the overcloud nodes to have pacemaker logs? We > need that to understand how the galera resource agent got called during the > scenario The sosreport should be available here: http://rhos-release.virt.bos.redhat.com/mburns/log/bz1306510
Reproduced on another setup (rhos-compute-12) Same db errors, same heat errors, services on controllers down as follows: [root@overcloud-controller-0 ~]# systemctl |grep failed ● cloud-final.service loaded failed failed Execute cloud user/final scripts ● dhcp-interface loaded failed failed DHCP interface br/ex ● dhcp-interface loaded failed failed DHCP interface br/int ● dhcp-interface loaded failed failed DHCP interface br/tun ● dhcp-interface loaded failed failed DHCP interface ovs/system ● neutron-server.service loaded failed failed Cluster Controlled neutron-server ● openstack-cinder-api.service loaded failed failed Cluster Controlled openstack-cinder-api ● openstack-cinder-scheduler.service loaded failed failed Cluster Controlled openstack-cinder-scheduler [root@overcloud-controller-0 ~]# openstack-status == Nova services == openstack-nova-api: active (disabled on boot) openstack-nova-cert: inactive (disabled on boot) openstack-nova-compute: inactive (disabled on boot) openstack-nova-network: inactive (disabled on boot) openstack-nova-scheduler: activating(disabled on boot) openstack-nova-conductor: active (disabled on boot) == Glance services == openstack-glance-api: active (disabled on boot) openstack-glance-registry: active (disabled on boot) == Keystone service == openstack-keystone: active (disabled on boot) == Horizon service == openstack-dashboard: uncontactable == neutron services == neutron-server: failed (disabled on boot) neutron-dhcp-agent: active (disabled on boot) neutron-l3-agent: active (disabled on boot) neutron-metadata-agent: active (disabled on boot) neutron-lbaas-agent: inactive (disabled on boot) neutron-openvswitch-agent: active (disabled on boot) neutron-metering-agent: inactive (disabled on boot) == Swift services == openstack-swift-proxy: active (disabled on boot) openstack-swift-account: active (disabled on boot) openstack-swift-container: active (disabled on boot) openstack-swift-object: active (disabled on boot) == Cinder services == openstack-cinder-api: failed (disabled on boot) openstack-cinder-scheduler: failed (disabled on boot) openstack-cinder-volume: inactive (disabled on boot) openstack-cinder-backup: inactive (disabled on boot) == Ceilometer services == openstack-ceilometer-api: active (disabled on boot) openstack-ceilometer-central: active (disabled on boot) openstack-ceilometer-compute: inactive (disabled on boot) openstack-ceilometer-collector: active (disabled on boot) openstack-ceilometer-alarm-notifier: active (disabled on boot) openstack-ceilometer-alarm-evaluator: active (disabled on boot) openstack-ceilometer-notification: active (disabled on boot) == Heat services == openstack-heat-api: active (disabled on boot) openstack-heat-api-cfn: active (disabled on boot) openstack-heat-api-cloudwatch: active (disabled on boot) openstack-heat-engine: active (disabled on boot) == Support services == mysqld: inactive (disabled on boot) libvirtd: active openvswitch: active dbus: active rabbitmq-server: inactive (disabled on boot) memcached: active (disabled on boot)
nova-scheduler log is full of stuff like: 2016-02-12 10:50:04.505 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -11910 attempts left. It started at -1 attempts left and has counted down from there. Clearly there's a bug in oslo_db where it's checking the equivalent of "while --retry_attempts != 0" when it should check for <= 0. That may be the whole problem, but it seems surprising that none of these attempts to connect to the DB is succeeding if the DB is indeed up.
I've checked the sosreport and the reoccurence of the same bug on two other deployment today. . galera is running in all deployments, on all controllers . when connecting locally on a unix socket (ctrl0), I can see that the galera cluster is Synced and all 3 nodes are up . pacemaker sees the three galera nodes as "Master" as well So clearly the DB is there and responding. Concerning the "HY000 logs" we see the same pattern on all deployements. First attempt to connect while galera wasn't started yet. What's interesting is that these log happen at the very beginning of the deployment, i.e. probably way before the update has started.
Here's a snippet from the end of the useful part of the nova-scheduler log: 2016-02-10 16:21:16.417 34585 INFO oslo_messaging._drivers.impl_rabbit [req-c22aaae4-9478-46a0-8beb-7b37ef5ab975 - - - - -] Connected to AMQP server on 10.19.104.16:5672 2016-02-10 16:24:49.003 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.19.104.16:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds. 2016-02-10 16:24:50.029 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.19.104.16:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2016-02-10 16:24:51.081 34585 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 10.19.104.13:5672 2016-02-10 16:50:35.591 34585 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 2016-02-10 16:50:39.935 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -1 attempts left. 2016-02-10 16:50:52.951 48403 WARNING oslo_db.sqlalchemy.session [req-635b84b2-bd1b-40f2-a06c-9925d2ace206 - - - - -] SQL connection failed. -2 attempts left. (it continues like this forever) Here's the "journalctl -u openstack-nova-scheduler" output in its entirety: Feb 10 15:08:21 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... Feb 10 15:08:24 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. Feb 10 15:08:24 overcloud-controller-0.localdomain nova-scheduler[22087]: /usr/lib64/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:153: SAW Feb 10 15:08:24 overcloud-controller-0.localdomain nova-scheduler[22087]: 'strategies for improved performance.' % expr) Feb 10 15:19:35 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler... Feb 10 15:19:39 overcloud-controller-0.localdomain systemd[1]: Stopped Cluster Controlled openstack-nova-scheduler. Feb 10 15:23:20 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... Feb 10 15:23:22 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. Feb 10 16:19:38 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Nova Scheduler Server... Feb 10 16:19:42 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server. Feb 10 16:21:14 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... Feb 10 16:21:16 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. Feb 10 16:50:27 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler... Feb 10 16:50:35 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... And the interesting parts of "ps axjf": PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND ... 1 5730 5730 5730 ? -1 Ss 0 0:45 /usr/bin/python /usr/bin/os-collect-config 5730 30948 5730 5730 ? -1 S 0 0:00 \_ /usr/bin/python /usr/bin/os-refresh-config 30948 31145 5730 5730 ? -1 S 0 0:00 \_ /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d 31145 35150 5730 5730 ? -1 S 0 0:00 \_ python /usr/libexec/os-refresh-config/configure.d/55-heat-config 35150 35155 5730 5730 ? -1 S 0 0:00 \_ python /var/lib/heat-config/hooks/puppet 35155 35157 5730 5730 ? -1 Sl 0 0:37 \_ /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config /heat-config-puppet/7e81b8d3-8bf8-4ee9-8480-c77d6e1ca24a.pp 35157 45130 45130 45130 ? -1 Ss 0 0:23 \_ /usr/bin/python /usr/bin/nova-manage db sync 35157 48325 48325 48325 ? -1 Ss 0 0:02 \_ /usr/bin/systemctl restart openstack-nova-scheduler ... 1 48403 48403 48403 ? -1 Ss 162 0:22 /usr/bin/python /usr/bin/nova-scheduler And the systemd status: ● openstack-nova-scheduler.service - Cluster Controlled openstack-nova-scheduler Loaded: loaded (/usr/lib/systemd/system/openstack-nova-scheduler.service; disabled; vendor preset: disabled) Drop-In: /run/systemd/system/openstack-nova-scheduler.service.d └─50-pacemaker.conf Active: activating (start) since Wed 2016-02-10 16:50:35 EST; 1 day 18h ago Main PID: 48403 (nova-scheduler) CGroup: /system.slice/openstack-nova-scheduler.service └─48403 /usr/bin/python /usr/bin/nova-scheduler So although we never saw "Stopped Cluster Controlled openstack-nova-scheduler" in the journal, it looks very much like the newly-started nova-scheduler (pid 48403) launched at 2016-02-10 16:50:35 is the one that cannot connect to the DB, and it will never give up and just fail until -Infinity == 0, but nor will it ever work unless we stop getting the "SQL connection failed" errors.
Is the pcs output about the state of Galera still accurate even when pcs is in maintenance mode? Because "pcs status" is also showing me: Clone Set: openstack-cinder-api-clone [openstack-cinder-api] (unmanaged) openstack-cinder-api (systemd:openstack-cinder-api): Started overcloud-controller-0 (unmanaged) openstack-cinder-api (systemd:openstack-cinder-api): Started overcloud-controller-1 (unmanaged) openstack-cinder-api (systemd:openstack-cinder-api): Started overcloud-controller-2 (unmanaged) yet: ● openstack-cinder-api.service - Cluster Controlled openstack-cinder-api Loaded: loaded (/usr/lib/systemd/system/openstack-cinder-api.service; disabled; vendor preset: disabled) Drop-In: /run/systemd/system/openstack-cinder-api.service.d └─50-pacemaker.conf Active: failed (Result: exit-code) since Wed 2016-02-10 16:38:25 EST; 1 day 19h ago Main PID: 38791 (code=exited, status=1/FAILURE) (presumably this failed as a consequence on nova-sheduler not coming up) systemd does report that mysqld is running at least though (managed by pacemaker): ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-02-10 16:20:24 EST; 1 day 19h ago Main PID: 29614 (pacemakerd) CGroup: /system.slice/pacemaker.service ├─29614 /usr/sbin/pacemakerd -f ├─29621 /usr/libexec/pacemaker/cib ├─29622 /usr/libexec/pacemaker/stonithd ├─29623 /usr/libexec/pacemaker/lrmd ├─29624 /usr/libexec/pacemaker/attrd ├─29625 /usr/libexec/pacemaker/pengine ├─29626 /usr/libexec/pacemaker/crmd ├─30829 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err ├─30835 /bin/sh /usr/sbin/rabbitmq-server ├─32091 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sock --d... └─33020 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=...
I tried killing nova-scheduler manually and then starting it myself. That unblocked the rest of puppet, of course, but the attempt to start nova-scheduler again had the same result, just starting back at a retry count of -1.
(In reply to Zane Bitter from comment #19) > systemd does report that mysqld is running at least though (managed by > pacemaker): > It's up and running, plus haproxy seems to have no problem proxying mysqld through the vip on controller0 [root@overcloud-controller-0 nova]# systemctl status haproxy ● haproxy.service - Cluster Controlled haproxy Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled) Drop-In: /run/systemd/system/haproxy.service.d └─50-pacemaker.conf Active: active (running) since Thu 2016-02-11 14:31:36 EST; 21h ago Main PID: 18088 (haproxy-systemd) CGroup: /system.slice/haproxy.service ├─18088 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid ├─18091 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds └─18095 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds Feb 12 11:39:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52111 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:39:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52112 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:39:44 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52695 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:39:44 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:52696 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53703 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53704 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53705 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:40:22 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:53706 to 192.168.100.11:3306 (mysql/TCP) Feb 12 11:44:51 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.14:60191 to 192.168.100.11:3306 (mysql/TCP) Feb 12 12:01:41 overcloud-controller-0.localdomain haproxy[18095]: Connect from 192.168.100.13:59317 to 192.168.100.11:3306 (mysql/TCP) [root@overcloud-controller-0 nova]# netstat -tnlp | grep 3306 tcp 0 0 192.168.100.12:3306 0.0.0.0:* LISTEN 19989/mysqld | tcp 0 0 192.168.100.11:3306 0.0.0.0:* LISTEN 18095/haproxy
With help from Dan Smith I discovered that _none_ of the services can connect to the DB (nova-scheduler happens to be the only one that will block systemd as a result). Going through some other log files until we find a more useful one reveals this (from glance-api): 2016-02-11 18:09:41.275 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 10 attempts left. 2016-02-11 18:09:53.667 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 9 attempts left. 2016-02-11 18:10:06.685 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 8 attempts left. 2016-02-11 18:10:19.698 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 7 attempts left. 2016-02-11 18:10:32.708 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 6 attempts left. 2016-02-11 18:10:45.716 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 5 attempts left. 2016-02-11 18:10:58.730 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 4 attempts left. 2016-02-11 18:11:11.737 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 3 attempts left. 2016-02-11 18:11:24.754 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 2 attempts left. 2016-02-11 18:11:37.770 40856 WARNING oslo_db.sqlalchemy.session [-] SQL connection failed. 1 attempts left. 2016-02-11 18:11:47.782 40856 CRITICAL glance [-] DBConnectionError: (_mysql_exceptions.OperationalError) (2003, "Can't connect to MySQL server on '10.19.104.17' (113)") Predictably, that appears to either be the wrong address or a VIP that is broken: # ping 10.19.104.17 PING 10.19.104.17 (10.19.104.17) 56(84) bytes of data. From 10.19.104.15 icmp_seq=1 Destination Host Unreachable ... # cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # HEAT_HOSTS_START - Do not edit manually within this section! 10.19.104.14 overcloud-compute-0.localdomain overcloud-compute-0 10.19.104.12 overcloud-compute-1.localdomain overcloud-compute-1 10.19.104.15 overcloud-controller-0.localdomain overcloud-controller-0 10.19.104.13 overcloud-controller-1.localdomain overcloud-controller-1 10.19.104.16 overcloud-controller-2.localdomain overcloud-controller-2 10.19.105.14 overcloud-cephstorage-0.localdomain overcloud-cephstorage-0 # HEAT_HOSTS_END Same deal on all three controllers.
Definitely a broken VIP (from pcs status): ip-10.19.104.17 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.105.17 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.184.225 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-192.168.201.15 (ocf::heartbeat:IPaddr2): Stopped (unmanaged)
omri, can you provide the initial deployment command you used to deploy 7.2. (please confirm 7.2 is what you initially deployed as well). and then the full update command. for both commands, please provide any custom environment files, or any other commands that may have been run between the initial deployment and the update.
(In reply to Zane Bitter from comment #23) Here are some commands that would give interesting output for troubleshooting: ip addr sudo netstat -anp cat /etc/puppet/hieradata/vip_data.yaml (IIRC the filename)
(In reply to Dan Sneddon from comment #25) > ip addr (just the relevant interface) 8: br-nic2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether c8:1f:66:c9:4b:20 brd ff:ff:ff:ff:ff:ff inet 10.19.104.13/24 brd 10.19.104.255 scope global br-nic2 valid_lft forever preferred_lft forever inet 10.19.104.11/32 brd 10.19.104.255 scope global br-nic2 valid_lft forever preferred_lft forever inet6 2620:52:0:1368:ca1f:66ff:fec9:4b20/64 scope global mngtmpaddr dynamic valid_lft 2591952sec preferred_lft 604752sec inet6 fe80::ca1f:66ff:fec9:4b20/64 scope link valid_lft forever preferred_lft forever > sudo netstat -anp $ sudo netstat -anp | grep 3306 | grep LISTEN tcp 0 0 10.19.104.13:3306 0.0.0.0:* LISTEN 11790/mysqld tcp 0 0 10.19.104.11:3306 0.0.0.0:* LISTEN 9298/haproxy > cat /etc/puppet/hieradata/vip_data.yaml (IIRC the filename) ceilometer_api_vip: 10.19.104.17 cinder_api_vip: 10.19.104.17 glance_api_vip: 10.19.105.17 glance_registry_vip: 10.19.104.17 heat_api_vip: 10.19.104.17 horizon_vip: 10.19.104.17 keystone_admin_api_vip: 192.168.0.6 keystone_public_api_vip: 10.19.104.17 mysql_vip: 10.19.104.17 neutron_api_vip: 10.19.104.17 nova_api_vip: 10.19.104.17 nova_metadata_vip: 10.19.104.17 redis_vip: 10.19.104.10 swift_proxy_vip: 10.19.105.17 tripleo::loadbalancer::controller_virtual_ip: 192.168.0.6 tripleo::loadbalancer::internal_api_virtual_ip: 10.19.104.17 tripleo::loadbalancer::public_virtual_ip: 10.19.184.225 tripleo::loadbalancer::storage_mgmt_virtual_ip: 192.168.201.15 tripleo::loadbalancer::storage_virtual_ip: 10.19.105.17 tripleo::redis_notification::haproxy_monitor_ip: 192.168.0.6 So based on these commands, it appears that the VIP as far as Puppet is concerned is 10.19.104.17. However, looking at the relevant stanza in /etc/haproxy/haproxy.cfg, haproxy is configured to be listening on 10.19.104.11: listen mysql bind 10.19.104.11:3306 transparent option tcpka option httpchk stick on dst stick-table type ip size 1000 timeout client 0 timeout server 0 server overcloud-controller-0 10.19.104.15:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-1 10.19.104.13:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-2 10.19.104.16:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 So how did these get out of sync? Would running Puppet update the haproxy config to use .17 as the VIP?
From /var/log/messages: Feb 10 16:33:01 localhost os-collect-config: [2016-02-10 16:33:01,400] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/7e81b8d3-8bf8-4ee9-8480-c77d6e1ca24a.json Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-novncproxy.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-compute.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-xvpvncproxy.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-api.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/dhcp-interface@.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-scheduler.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-console.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-cert.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-consoleauth.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/wpa_supplicant.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-metadata-api.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/openstack-nova-conductor.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:04 localhost systemd: Configuration file /usr/lib/systemd/system/ebtables.service is marked executable. Please remove executable permission bits. Proceeding anyway. Feb 10 16:33:30 localhost crmd[29626]: notice: Operation ip-10.19.104.17_monitor_0: not running (node=overcloud-controller-0, call=333, rc=7, cib-update=179, confirmed=true) Feb 10 16:33:31 localhost crmd[29626]: notice: Operation ip-10.19.105.17_monitor_0: not running (node=overcloud-controller-0, call=337, rc=7, cib-update=180, confirmed=true) Feb 10 16:33:32 localhost crmd[29626]: notice: Operation ip-10.19.184.225_monitor_0: not running (node=overcloud-controller-0, call=341, rc=7, cib-update=181, confirmed=true) Feb 10 16:33:34 localhost crmd[29626]: notice: Operation ip-192.168.201.15_monitor_0: not running (node=overcloud-controller-0, call=345, rc=7, cib-update=182, confirmed=true) Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532436]#012: No such file or directory Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532436]#012: No such file or directory Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532624]#012: No such file or directory Feb 10 16:34:44 localhost snmpd[8193]: Cannot statfs net:[4026532624]#012: No such file or directory Feb 10 16:36:14 localhost systemd: Stopping Cluster Controlled openstack-cinder-api... Feb 10 16:36:14 localhost systemd: Started Cluster Controlled openstack-cinder-api. Feb 10 16:36:14 localhost systemd: Starting Cluster Controlled openstack-cinder-api... So it looks like something in Puppet is disabling the VIPs about before beginning to restart all the services.
From the sosreport of overcloud-controller-0 we see the following ip addresses: $ pcs -f cib.xml resource |grep IPaddr2 ip-10.19.184.221 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-192.168.0.6 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.105.10 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.104.11 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-192.168.201.10 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.104.10 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.104.17 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.105.17 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-10.19.184.225 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) ip-192.168.201.15 (ocf::heartbeat:IPaddr2): Stopped (unmanaged) From a quick look the above ip addresses seem too many... In terms of constraints we have: $ pcs -f cib.xml constraint order |grep ip- start ip-10.19.104.11 then start haproxy-clone (kind:Optional) start ip-192.168.0.6 then start haproxy-clone (kind:Optional) start ip-192.168.201.10 then start haproxy-clone (kind:Optional) start ip-10.19.184.221 then start haproxy-clone (kind:Optional) start ip-10.19.104.10 then start haproxy-clone (kind:Optional) start ip-10.19.105.10 then start haproxy-clone (kind:Optional) So my first impression here is that at a certain point (after first installation, likely) the main VIP (galera one) was 10.19.104.11. This is reflected in pcs (because there are the IP resources and the constraints pointing to that IP) and also is still reflected in haproxy.cf The above IP is on controller-1 (at the time of sosreport collection) Then I guess the update took place and only part of the configuration was enforced (i.e. .104.17 was changed in most openstack configuration files, but .104.11 stayed in the pcs resources and haproxy.cfg) Am still looking if I can figure out the exact steps that took place, but I think the request for info at comment#24 is still very relevant
(In reply to James Slagle from comment #24) > omri, can you provide the initial deployment command you used to deploy 7.2. > (please confirm 7.2 is what you initially deployed as well). > > and then the full update command. > > for both commands, please provide any custom environment files, or any other > commands that may have been run between the initial deployment and the > update. i was able to update from 7.2 (openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch) to 7.3 latest (openstack-tripleo-heat-templates-0.8.6-120.el7ost) using a virt setup, HA, net-iso, single-nic-with-vlans, so that's why i'm thinking it might be something in the specific environment files or commands being used. In addition to the previous needinfo item from comment #24, it'd also be helpful to see the output of the following before and after a failed update attempt: neutron port-list heat resource-list -n 10 overcloud
neutron port-list : +--------------------------------------+-------------------------------+-------------------+---------------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+-------------------------------+-------------------+---------------------------------------------------------------------------------------+ | 05ba55ae-7202-4e53-8bcd-ebc007bea3b8 | | fa:16:3e:da:00:4d | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.12"} | | 132ed304-0b53-44f5-9d31-50af5a38e894 | control_virtual_ip | fa:16:3e:79:33:2d | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.6"} | | 2681fd72-aae5-470e-9898-17fd3531af88 | | 00:0a:f7:7c:29:28 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.7"} | | 26d0cdb4-222d-4e76-bf77-1fd94e08a72c | storage_virtual_ip | fa:16:3e:d2:86:1c | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.17"} | | 35c4cb79-07de-4b86-b2c2-d291768ac8f5 | | fa:16:3e:e9:7c:d2 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.13"} | | 3b4783db-20aa-4d89-abd4-85516b87e5a5 | | fa:16:3e:5f:e1:57 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.14"} | | 3beb4f55-83f1-4301-85ff-1ea682b48ba2 | | fa:16:3e:70:64:77 | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.12"} | | 44c80b48-e354-437d-8199-1704e0992b94 | | 00:0a:f7:79:93:70 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.9"} | | 47a83681-5fa4-41a8-b892-6a8edcbf630a | | fa:16:3e:f1:24:4e | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.11"} | | 56ca0386-8b90-4aa9-b75f-200d19dd8dbf | | fa:16:3e:af:0f:56 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.11"} | | 5847c202-5b57-43c4-9804-ebb5afa8fc7e | | 00:0a:f7:79:93:80 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.8"} | | 5a72688e-dc50-4352-9fac-48b3e03ab505 | | fa:16:3e:8d:28:e3 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.222"} | | 657ef5f6-05dd-43d1-9b5d-fe5d612bce57 | | 00:0a:f7:79:93:cc | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.12"} | | 70da84ec-c49f-4f00-8e88-bc05671ee82f | | fa:16:3e:ff:e3:14 | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.16"} | | 77413399-025e-4671-935c-999a768bb1c7 | | fa:16:3e:6b:82:9b | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.11"} | | 83f2f014-2d39-4080-be73-26f9fc22e099 | | fa:16:3e:fd:11:13 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address ": "10.19.104.15"} | | 888ff481-afd1-459f-9099-d98231b0cf63 | | fa:16:3e:d5:cc:21 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.14"} | | 98121b60-9474-4895-88e3-5c246fd5e03a | | fa:16:3e:dc:f0:b2 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.10"} | | a4d3094d-d77d-4700-a943-538402879030 | | fa:16:3e:cb:6c:22 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.224"} | | ac647535-c052-4c91-af35-e4edffe71aee | | fa:16:3e:51:03:f8 | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.5"} | | aef54162-b2cb-47e4-afd8-e0c1828dda0e | | fa:16:3e:38:93:4a | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.13"} | | b194b378-f39b-443b-a3c2-c609a4ac2695 | | fa:16:3e:64:69:81 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.13"} | | c830ce86-abe5-4de6-9df9-4d251a6a23da | | fa:16:3e:c6:b5:99 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.12"} | | cbe15c01-4cd7-4c8a-b640-196419e6157e | | 00:0a:f7:79:92:0a | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.10"} | | d30db48d-b6a7-42e6-80da-20eba1611555 | | fa:16:3e:7a:b8:0f | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.14"} | | d4ce33c3-b1b8-4152-b617-d3904bd9e4a8 | | fa:16:3e:e4:71:f1 | {"subnet_id": "ef82706a-3a49-4463-8438-ff834749c12c", "ip_address": "192.168.151.12"} | | d7e6440c-5f96-4190-a6b6-e1fa0b4b8add | internal_api_virtual_ip | fa:16:3e:62:54:a7 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.17"} | | d958bfd5-8924-4323-be82-d40fb639cb67 | | fa:16:3e:3b:e2:c6 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.16"} | | e2e7264c-b0e0-457c-941c-c6668eb8c7ff | | fa:16:3e:0d:40:ca | {"subnet_id": "cf9b259d-88db-4dd4-ae61-503d8f305de4", "ip_address": "10.19.105.15"} | | e82661ec-a76e-4af7-88e4-5f56b0acb81f | | fa:16:3e:90:14:e5 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.223"} | | e992b763-bac3-4494-947e-1f4ed59672db | | 00:0a:f7:79:93:de | {"subnet_id": "760323c3-4d31-415b-b3be-5c97229f7388", "ip_address": "192.168.0.11"} | | ee9fc026-7682-4cca-ad5c-3a669e12a403 | public_virtual_ip | fa:16:3e:2d:eb:e6 | {"subnet_id": "18d5c7c6-319d-4d58-a6e7-6cd995164cf0", "ip_address": "10.19.184.225"} | | eedbd05e-8570-43f5-89e4-da0068e1b520 | redis_virtual_ip | fa:16:3e:0d:46:59 | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.10"} | | f364aba6-18dd-4e04-909e-eee825c9e439 | storage_management_virtual_ip | fa:16:3e:b9:d1:7d | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address ": "192.168.201.15"} | | f8991864-a8bf-444f-aa17-45c8823325a7 | | fa:16:3e:e0:9c:d8 | {"subnet_id": "0f70ab3a-df02-46dc-a4aa-7a10e033c8ed", "ip_address": "192.168.201.14"} | | fd4c9b44-0a65-409e-abc3-928ec76e7039 | | fa:16:3e:f5:49:ea | {"subnet_id": "2cc7544c-c704-4b1c-8ef0-f11c37ead3bd", "ip_address": "10.19.104.13"} | +--------------------------------------+-------------------------------+-------------------+------------------------------------------------------------------ heat resource-list -n 10 overcloud (attached file)
Created attachment 1126881 [details] resource-list
(In reply to James Slagle from comment #24) > omri, can you provide the initial deployment command you used to deploy 7.2. > (please confirm 7.2 is what you initially deployed as well). > > and then the full update command. > > for both commands, please provide any custom environment files, or any other > commands that may have been run between the initial deployment and the > update. omri, we still need the above info as well.
(In reply to James Slagle from comment #32) > (In reply to James Slagle from comment #24) > > omri, can you provide the initial deployment command you used to deploy 7.2. > > (please confirm 7.2 is what you initially deployed as well). > > > > and then the full update command. > > > > for both commands, please provide any custom environment files, or any other > > commands that may have been run between the initial deployment and the > > update. > > omri, we still need the above info as well. confirm-versions 7.2 -> 7.3 : /var/log/yum.log: Feb 10 12:41:27 Installed: instack-undercloud-2.1.2-36.el7ost.noarch Feb 10 16:02:52 Updated: instack-undercloud-2.1.2-39.el7ost.noarch ------------------------------------------------------------------------ deployment-command: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml ------------------------------------------------------------------------ update-command: yes "" | openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip.yaml -e network-environment.yaml Attaching relevnat .yamls
Created attachment 1127041 [details] yamls
(In reply to Omri Hochman from comment #33) > (In reply to James Slagle from comment #32) > > (In reply to James Slagle from comment #24) > > > omri, can you provide the initial deployment command you used to deploy 7.2. > > > (please confirm 7.2 is what you initially deployed as well). > > > > > > and then the full update command. > > > > > > for both commands, please provide any custom environment files, or any other > > > commands that may have been run between the initial deployment and the > > > update. > > > > omri, we still need the above info as well. > > confirm-versions 7.2 -> 7.3 : /var/log/yum.log: > > Feb 10 12:41:27 Installed: instack-undercloud-2.1.2-36.el7ost.noarch > Feb 10 16:02:52 Updated: instack-undercloud-2.1.2-39.el7ost.noarch > ------------------------------------------------------------------------ > > deployment-command: > openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 > --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types > vxlan --ntp-server 10.5.26.10 --timeout 90 -e > /usr/share/openstack-tripleo-heat-templates/environments/storage-environment. > yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/network-isolation. > yaml -e network-environment.yaml > > ------------------------------------------------------------------------ > update-command: > yes "" | openstack overcloud update stack overcloud -i --templates -e > /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry- > puppet.yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/storage-environment. > yaml-e > /usr/share/openstack-tripleo-heat-templates/environments/network-isolation. > yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from- > vip.yaml -e network-environment.yaml You should not be passing update-from-vip.yaml when updating from 7.2. That should only be used if you initially deployed 7.1. That's likely the root cause of the issue here.
> You should not be passing update-from-vip.yaml when updating from 7.2. That > should only be used if you initially deployed 7.1. That's likely the root > cause of the issue here. CLOSED on NOTABUG. Seems that this miss-configuration caused by sending : "-e update-from-vip.yaml" in the update command. this option should be used *only* when updating 7.1 to 7.3 . (and Not when updating: 7.2 ->7.3 )