Description of problem: Replacement of controller with corrupted disk in OSP14 overcloud with Ceph starage was fail: Stack overcloud/4037cdb7-df1e-4638-9a0b-455d313a9b27 UPDATE_FAILED overcloud.CephStorageServiceChain.ServiceChain.21: resource_type: OS::TripleO::Services::Timezone physical_resource_id: e5252096-fc58-49de-b3e3-0bf92e27a8ab status: UPDATE_FAILED status_reason: | DBConnectionError: resources[21]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8) overcloud.ObjectStorageServiceChain.ServiceChain.13: resource_type: OS::TripleO::Services::ContainersLogrotateCrond physical_resource_id: 8f0796a6-3a7b-4873-ba33-772f1d2c858c status: UPDATE_FAILED status_reason: | DBConnectionError: resources[13]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8) Heat Stack update failed. Heat Stack update failed. (undercloud) [stack@undercloud-0 ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead /usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+ | id | stack_name | stack_status | creation_time | updated_time | project | +--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+ | 4037cdb7-df1e-4638-9a0b-455d313a9b27 | overcloud | UPDATE_FAILED | 2018-09-27T13:07:34Z | 2018-09-27T15:51:32Z | 4323ccaecaa340cb9d8a64866c614370 | +--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+ (undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud overcloud.CephStorageServiceChain.ServiceChain.21: resource_type: OS::TripleO::Services::Timezone physical_resource_id: e5252096-fc58-49de-b3e3-0bf92e27a8ab status: UPDATE_FAILED status_reason: | DBConnectionError: resources[21]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8) overcloud.ControllerServiceChain.ServiceChain.55.MySQLClient: resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/database/mysql-client.yaml physical_resource_id: 25155f4c-aa7f-49d1-b802-e384235d5f66 status: UPDATE_FAILED status_reason: | resources.MySQLClient: Stack UPDATE cancelled overcloud.ControllerServiceChain.ServiceChain.115: resource_type: OS::TripleO::Services::Ntp physical_resource_id: 6ad4700a-4621-4cf4-91b4-92599373acec status: UPDATE_FAILED status_reason: | MessagingTimeout: resources[115]: Timed out waiting for a reply to message ID 07c14b5ac448451a99b96f1a7a8cb23b overcloud.ControllerServiceChain.ServiceChain.12.CeilometerAgentCentralBase: resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/ceilometer-agent-central.yaml physical_resource_id: 7019040c-a495-4785-a51a-457849eb6de9 status: UPDATE_FAILED status_reason: | MessagingTimeout: resources.CeilometerAgentCentralBase: Timed out waiting for a reply to message ID 0c15bfd9b3744afab6c744084a631d07 overcloud.ControllerServiceChain.ServiceChain.71.KeystoneBase: resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/keystone.yaml physical_resource_id: 7813e146-4877-42ab-ab62-4cd0a23dc559 status: UPDATE_FAILED status_reason: | resources.KeystoneBase: Stack UPDATE cancelled overcloud.ObjectStorageServiceChain.ServiceChain.13: resource_type: OS::TripleO::Services::ContainersLogrotateCrond physical_resource_id: 8f0796a6-3a7b-4873-ba33-772f1d2c858c status: UPDATE_FAILED status_reason: | DBConnectionError: resources[13]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8) (undercloud) [stack@undercloud-0 ~]$ nova list /usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +--------------------------------------+--------------+---------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+---------+------------+-------------+------------------------+ | f99495be-1de7-4c8f-9437-7a446b08f3cc | ceph-0 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | 34750643-302c-4fe1-851a-fd1fd68ccf8e | ceph-1 | ACTIVE | - | Running | ctlplane=192.168.24.13 | | 1b39d687-6cf2-499d-a17e-9a6c6375afbc | ceph-2 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | b81c5ce5-df19-41da-a885-8373fd1ff2b9 | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | e21fe795-c0f9-4791-86e3-b102f037fbc2 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.7 | | 8104c7b3-f045-4f0c-94c6-13f8d46e99b5 | controller-1 | SHUTOFF | - | Shutdown | ctlplane=192.168.24.11 | | 5853e518-165e-4db3-9d6f-b9228fa58ebe | controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.8 | +--------------------------------------+--------------+---------+------------+-------------+------------------------+ Version-Release number of selected component (if applicable): OSP14 puddle - 2018-09-06.1 openstack-tripleo-common-9.3.1-0.20180831204016.bb0582a.el7ost.noarch openstack-tripleo-puppet-elements-9.0.0-0.20180831205939.0641fdc.el7ost.noarch openstack-heat-monolith-11.0.1-0.20180901130821.680a515.el7ost.noarch puppet-openstacklib-13.3.1-0.20180822220049.72521cd.el7ost.noarch python2-openstackclient-3.16.0-0.20180809175603.f77ca68.el7ost.noarch openstack-selinux-0.8.15-0.20180823061238.b63283a.el7ost.noarch openstack-tripleo-validations-9.3.1-0.20180831205305.fbfd253.el7ost.noarch openstack-heat-api-11.0.1-0.20180901130821.680a515.el7ost.noarch openstack-tripleo-image-elements-9.0.0-0.20180831210308.2dc678a.el7ost.noarch python2-openstacksdk-0.17.2-0.20180809182656.3ad9dab.el7ost.noarch openstack-heat-common-11.0.1-0.20180901130821.680a515.el7ost.noarch openstack-tripleo-common-containers-9.3.1-0.20180831204016.bb0582a.el7ost.noarch python-openstackclient-lang-3.16.0-0.20180809175603.f77ca68.el7ost.noarch openstack-heat-agents-1.7.1-0.20180829044839.24f9e9c.el7ost.noarch openstack-tripleo-heat-templates-9.0.0-0.20180831204457.17bb71e.0rc1.el7ost.noarch puppet-openstack_extras-13.3.1-0.20180831173811.9fc5de6.el7ost.noarch openstack-heat-engine-11.0.1-0.20180901130821.680a515.el7ost.noarch How reproducible: Steps to Reproduce: 1.Deploy OSP14 3ctr+3comp+3ceph with enabled fencing 2.corrupt disk node 3.check that overcloud is operable 4.try to replace controller using new ironic node https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Preliminary_Checks Actual results: failed with lot of failures by timeout error Expected results: failed on expected stage - UPDATE_FAILED error at ControllerDeployment_Step1.x Additional info:
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1634005
please provide heat logs from when the error actually occurred. we are missing heat-engine logs from 9/27 in the sosreport.
Created attachment 1488875 [details] heat-engine logs
we need the logs from when the error actually occurred. not today's log.
i'm closing this one as a dupe of bug 1635664 if you're able to reproduce it even after increasing undercloud hw resources, please reopen it and provide the requested data. *** This bug has been marked as a duplicate of bug 1635664 ***