Bug 1634005 - [OSP14] replace controller with corrupted disk failed with "overcloud.CephStorageServiceChain.ServiceChain.21:resource_type: OS::TripleO::Services::Timezone DBConnectionError: resources[21]: (pymysql.err.OperationalError) (2013, 'Lost connection to DB)
Summary: [OSP14] replace controller with corrupted disk failed with "overcloud.CephSto...
Keywords:
Status: CLOSED DUPLICATE of bug 1635664
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: ---
Assignee: RHOS Maint
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-28 12:02 UTC by Artem Hrechanychenko
Modified: 2020-01-08 18:04 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-17 21:12:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
heat-engine logs (3.47 MB, application/x-gzip)
2018-10-01 06:24 UTC, Artem Hrechanychenko
no flags Details

Description Artem Hrechanychenko 2018-09-28 12:02:09 UTC
Description of problem:
Replacement of controller with corrupted disk  in OSP14 overcloud with Ceph starage was fail:

Stack overcloud/4037cdb7-df1e-4638-9a0b-455d313a9b27 UPDATE_FAILED

overcloud.CephStorageServiceChain.ServiceChain.21:
  resource_type: OS::TripleO::Services::Timezone
  physical_resource_id: e5252096-fc58-49de-b3e3-0bf92e27a8ab
  status: UPDATE_FAILED
  status_reason: |
    DBConnectionError: resources[21]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
overcloud.ObjectStorageServiceChain.ServiceChain.13:
  resource_type: OS::TripleO::Services::ContainersLogrotateCrond
  physical_resource_id: 8f0796a6-3a7b-4873-ba33-772f1d2c858c
  status: UPDATE_FAILED
  status_reason: |
    DBConnectionError: resources[13]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
Heat Stack update failed.
Heat Stack update failed.

(undercloud) [stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
/usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+
| id                                   | stack_name | stack_status  | creation_time        | updated_time         | project                          |
+--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+
| 4037cdb7-df1e-4638-9a0b-455d313a9b27 | overcloud  | UPDATE_FAILED | 2018-09-27T13:07:34Z | 2018-09-27T15:51:32Z | 4323ccaecaa340cb9d8a64866c614370 |
+--------------------------------------+------------+---------------+----------------------+----------------------+----------------------------------+


(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list overcloud
overcloud.CephStorageServiceChain.ServiceChain.21:
  resource_type: OS::TripleO::Services::Timezone
  physical_resource_id: e5252096-fc58-49de-b3e3-0bf92e27a8ab
  status: UPDATE_FAILED
  status_reason: |
    DBConnectionError: resources[21]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
overcloud.ControllerServiceChain.ServiceChain.55.MySQLClient:
  resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/database/mysql-client.yaml
  physical_resource_id: 25155f4c-aa7f-49d1-b802-e384235d5f66
  status: UPDATE_FAILED
  status_reason: |
    resources.MySQLClient: Stack UPDATE cancelled
overcloud.ControllerServiceChain.ServiceChain.115:
  resource_type: OS::TripleO::Services::Ntp
  physical_resource_id: 6ad4700a-4621-4cf4-91b4-92599373acec
  status: UPDATE_FAILED
  status_reason: |
    MessagingTimeout: resources[115]: Timed out waiting for a reply to message ID 07c14b5ac448451a99b96f1a7a8cb23b
overcloud.ControllerServiceChain.ServiceChain.12.CeilometerAgentCentralBase:
  resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/ceilometer-agent-central.yaml
  physical_resource_id: 7019040c-a495-4785-a51a-457849eb6de9
  status: UPDATE_FAILED
  status_reason: |
    MessagingTimeout: resources.CeilometerAgentCentralBase: Timed out waiting for a reply to message ID 0c15bfd9b3744afab6c744084a631d07
overcloud.ControllerServiceChain.ServiceChain.71.KeystoneBase:
  resource_type: https://192.168.24.2:13808/v1/AUTH_4323ccaecaa340cb9d8a64866c614370/overcloud/puppet/services/keystone.yaml
  physical_resource_id: 7813e146-4877-42ab-ab62-4cd0a23dc559
  status: UPDATE_FAILED
  status_reason: |
    resources.KeystoneBase: Stack UPDATE cancelled
overcloud.ObjectStorageServiceChain.ServiceChain.13:
  resource_type: OS::TripleO::Services::ContainersLogrotateCrond
  physical_resource_id: 8f0796a6-3a7b-4873-ba33-772f1d2c858c
  status: UPDATE_FAILED
  status_reason: |
    DBConnectionError: resources[13]: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)


(undercloud) [stack@undercloud-0 ~]$ nova list
/usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+--------------+---------+------------+-------------+------------------------+
| ID                                   | Name         | Status  | Task State | Power State | Networks               |
+--------------------------------------+--------------+---------+------------+-------------+------------------------+
| f99495be-1de7-4c8f-9437-7a446b08f3cc | ceph-0       | ACTIVE  | -          | Running     | ctlplane=192.168.24.15 |
| 34750643-302c-4fe1-851a-fd1fd68ccf8e | ceph-1       | ACTIVE  | -          | Running     | ctlplane=192.168.24.13 |
| 1b39d687-6cf2-499d-a17e-9a6c6375afbc | ceph-2       | ACTIVE  | -          | Running     | ctlplane=192.168.24.9  |
| b81c5ce5-df19-41da-a885-8373fd1ff2b9 | compute-0    | ACTIVE  | -          | Running     | ctlplane=192.168.24.12 |
| e21fe795-c0f9-4791-86e3-b102f037fbc2 | controller-0 | ACTIVE  | -          | Running     | ctlplane=192.168.24.7  |
| 8104c7b3-f045-4f0c-94c6-13f8d46e99b5 | controller-1 | SHUTOFF | -          | Shutdown    | ctlplane=192.168.24.11 |
| 5853e518-165e-4db3-9d6f-b9228fa58ebe | controller-2 | ACTIVE  | -          | Running     | ctlplane=192.168.24.8  |
+--------------------------------------+--------------+---------+------------+-------------+------------------------+


Version-Release number of selected component (if applicable):
OSP14 puddle - 2018-09-06.1
openstack-tripleo-common-9.3.1-0.20180831204016.bb0582a.el7ost.noarch
openstack-tripleo-puppet-elements-9.0.0-0.20180831205939.0641fdc.el7ost.noarch
openstack-heat-monolith-11.0.1-0.20180901130821.680a515.el7ost.noarch
puppet-openstacklib-13.3.1-0.20180822220049.72521cd.el7ost.noarch
python2-openstackclient-3.16.0-0.20180809175603.f77ca68.el7ost.noarch
openstack-selinux-0.8.15-0.20180823061238.b63283a.el7ost.noarch
openstack-tripleo-validations-9.3.1-0.20180831205305.fbfd253.el7ost.noarch
openstack-heat-api-11.0.1-0.20180901130821.680a515.el7ost.noarch
openstack-tripleo-image-elements-9.0.0-0.20180831210308.2dc678a.el7ost.noarch
python2-openstacksdk-0.17.2-0.20180809182656.3ad9dab.el7ost.noarch
openstack-heat-common-11.0.1-0.20180901130821.680a515.el7ost.noarch
openstack-tripleo-common-containers-9.3.1-0.20180831204016.bb0582a.el7ost.noarch
python-openstackclient-lang-3.16.0-0.20180809175603.f77ca68.el7ost.noarch
openstack-heat-agents-1.7.1-0.20180829044839.24f9e9c.el7ost.noarch
openstack-tripleo-heat-templates-9.0.0-0.20180831204457.17bb71e.0rc1.el7ost.noarch
puppet-openstack_extras-13.3.1-0.20180831173811.9fc5de6.el7ost.noarch
openstack-heat-engine-11.0.1-0.20180901130821.680a515.el7ost.noarch


How reproducible:


Steps to Reproduce:
1.Deploy OSP14 3ctr+3comp+3ceph with enabled fencing
2.corrupt disk node 
3.check that overcloud is operable
4.try to replace controller using new ironic node https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes-Preliminary_Checks

Actual results:
failed with lot of failures by timeout error

Expected results:
failed on expected stage - UPDATE_FAILED error at ControllerDeployment_Step1.x

Additional info:

Comment 2 Artem Hrechanychenko 2018-09-28 12:54:32 UTC
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1634005

Comment 3 James Slagle 2018-09-28 18:43:44 UTC
please provide heat logs from when the error actually occurred. we are missing heat-engine logs from 9/27 in the sosreport.

Comment 4 Artem Hrechanychenko 2018-10-01 06:24:39 UTC
Created attachment 1488875 [details]
heat-engine logs

Comment 5 James Slagle 2018-10-01 14:56:59 UTC
we need the logs from when the error actually occurred. not today's log.

Comment 6 James Slagle 2018-10-17 21:12:24 UTC
i'm closing this one as a dupe of bug 1635664

if you're able to reproduce it even after increasing undercloud hw resources, please reopen it and provide the requested data.

*** This bug has been marked as a duplicate of bug 1635664 ***


Note You need to log in before you can comment on or make changes to this bug.