Description of problem: simulated disk outage on controller node [root@seal33 ~ ]# dd if=/dev/zero of=/var/lib/libvirt/images/controller-1-disk1.qcow2 bs=600M count=5 | 36657447-9ca3-482d-9134-e62322e055ba | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.10 | (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.10 -vvv OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug2: resolving "192.168.24.10" port 22 debug2: ssh_connect_direct: needpriv 0 debug1: Connecting to 192.168.24.10 [192.168.24.10] port 22. debug1: Connection established. debug1: identity file /home/stack/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_rsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_dsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_dsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ecdsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ed25519 type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ed25519-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_7.4 So node became unreachible. After that we power off node using ironic and we not able to remove ceph monitor from the controller-1 https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes 11.4.2. Removing a Ceph Monitor Daemon This procedure removes a ceph-mon daemon from the storage cluster. If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon. This procedure assumes the Controller is reachable. Note A new Ceph monitor daemon will be added after a new Controller is added to the cluster. Connect to the controller to be replaced and become root: # ssh heat-admin.0.47 # sudo su - As root, stop the monitor: # systemctl stop ceph-mon@<monitor_hostname> For example: # systemctl stop ceph-mon@overcloud-controller-2 Remove the monitor from the cluster: # ceph mon remove <mon_id> as result replacement procedure was fail (undercloud) [stack@undercloud-0 ~]$ openstack stack failures list --long overcloud overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: resource_type: OS::TripleO::WorkflowSteps physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674 status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: ERROR Version-Release number of selected component (if applicable): core_puddle_version = 2018-07-06.1 openstack-nova-compute-17.0.3-0.20180420001142.el7ost.noarch openstack-glance-16.0.1-2.el7ost.noarch openstack-nova-common-17.0.3-0.20180420001142.el7ost.noarch openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch python-openstackclient-lang-3.14.1-1.el7ost.noarch openstack-ironic-conductor-10.1.2-4.el7ost.noarch openstack-tripleo-validations-8.4.1-5.el7ost.noarch openstack-nova-api-17.0.3-0.20180420001142.el7ost.noarch openstack-nova-conductor-17.0.3-0.20180420001142.el7ost.noarch openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-mistral-engine-6.0.2-1.el7ost.noarch python2-openstackclient-3.14.1-1.el7ost.noarch puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch openstack-selinux-0.8.14-12.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-api-10.1.2-4.el7ost.noarch openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch openstack-tempest-18.0.0-2.el7ost.noarch openstack-mistral-common-6.0.2-1.el7ost.noarch openstack-tripleo-ui-8.3.1-3.el7ost.noarch openstack-zaqar-6.0.1-1.el7ost.noarch openstack-nova-placement-api-17.0.3-0.20180420001142.el7ost.noarch openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch openstack-mistral-api-6.0.2-1.el7ost.noarch openstack-tripleo-common-8.6.1-23.el7ost.noarch openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch python2-openstacksdk-0.11.3-1.el7ost.noarch openstack-nova-scheduler-17.0.3-0.20180420001142.el7ost.noarch openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-ironic-common-10.1.2-4.el7ost.noarch openstack-mistral-executor-6.0.2-1.el7ost.noarch How reproducible: always Steps to Reproduce: 1. Deploy OSP13 with latest passed_phase2 puddle 2. go to hypervisor, find qcow disk of controller and corrupt it 3. set failed node to off state using Ironic 4. try to replace controller using official docs https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes Actual results: overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: resource_type: OS::TripleO::WorkflowSteps physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674 status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: ERROR Expected results: controller was replace and overcloud is operable Additional info:
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1600202
I rather not close this BZ unless we verify that by fixing bug 1548026 *** we can also test the replace controller with a failed HD. Adding depend on #1548026
changing the Target release according to 1548026 osp13-z2
The patch which should fix this, tracked in 1548026, has merged upstream https://review.openstack.org/#/c/583229
Giving this a more meaningful subject (removing unnecessary text around it)
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible. If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-". To add draft documentation text: * Select the documentation type from the "Doc Type" drop down field. * A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.
Omri, Are you planning to verify this for R13Z2 release? Gal.
(In reply to Gal Amado from comment #17) > Omri, > Are you planning to verify this for R13Z2 release? > Gal. yes, we're aiming to test this for osp13.z.
VERIFIED used https://docs.google.com/document/d/1738ZeETl3f1-0ieOSBDjVBWqC8xfruI-F4HPa7eixhk/edit for removing ceph monitor and https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes openstack-tripleo-common-8.6.3-10.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2574