Bug 1600202 - Cannot replace controller after disk outage
Summary: Cannot replace controller after disk outage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: z2
: 13.0 (Queens)
Assignee: John Fulton
QA Contact: Artem Hrechanychenko
URL:
Whiteboard: DFG:Storage
Depends On: 1548026
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-11 16:46 UTC by Artem Hrechanychenko
Modified: 2022-03-13 15:13 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-common-8.6.3-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-29 16:37:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2574 0 None None None 2018-08-29 16:38:58 UTC

Description Artem Hrechanychenko 2018-07-11 16:46:49 UTC
Description of problem:
 simulated disk outage on controller node
[root@seal33 ~ ]# dd if=/dev/zero of=/var/lib/libvirt/images/controller-1-disk1.qcow2  bs=600M count=5


| 36657447-9ca3-482d-9134-e62322e055ba | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |


(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.10 -vvv
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 58: Applying options for *
debug2: resolving "192.168.24.10" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to 192.168.24.10 [192.168.24.10] port 22.
debug1: Connection established.
debug1: identity file /home/stack/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4

So node became unreachible. 
After that we power off node using ironic and we not able to remove ceph monitor from the controller-1

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes

11.4.2. Removing a Ceph Monitor Daemon

This procedure removes a ceph-mon daemon from the storage cluster. If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon. This procedure assumes the Controller is reachable.
Note

A new Ceph monitor daemon will be added after a new Controller is added to the cluster.

    Connect to the controller to be replaced and become root:

    # ssh heat-admin.0.47
    # sudo su -

    As root, stop the monitor:

    # systemctl stop ceph-mon@<monitor_hostname>

    For example:

    # systemctl stop ceph-mon@overcloud-controller-2

    Remove the monitor from the cluster:

    # ceph mon remove <mon_id>


as result replacement procedure was fail


(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR



Version-Release number of selected component (if applicable):
core_puddle_version = 2018-07-06.1

openstack-nova-compute-17.0.3-0.20180420001142.el7ost.noarch
openstack-glance-16.0.1-2.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001142.el7ost.noarch
openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch
python-openstackclient-lang-3.14.1-1.el7ost.noarch
openstack-ironic-conductor-10.1.2-4.el7ost.noarch
openstack-tripleo-validations-8.4.1-5.el7ost.noarch
openstack-nova-api-17.0.3-0.20180420001142.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001142.el7ost.noarch
openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-mistral-engine-6.0.2-1.el7ost.noarch
python2-openstackclient-3.14.1-1.el7ost.noarch
puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch
openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch
openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch
openstack-selinux-0.8.14-12.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch
openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-api-10.1.2-4.el7ost.noarch
openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch
openstack-tempest-18.0.0-2.el7ost.noarch
openstack-mistral-common-6.0.2-1.el7ost.noarch
openstack-tripleo-ui-8.3.1-3.el7ost.noarch
openstack-zaqar-6.0.1-1.el7ost.noarch
openstack-nova-placement-api-17.0.3-0.20180420001142.el7ost.noarch
openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch
puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch
openstack-mistral-api-6.0.2-1.el7ost.noarch
openstack-tripleo-common-8.6.1-23.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch
openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch
python2-openstacksdk-0.11.3-1.el7ost.noarch
openstack-nova-scheduler-17.0.3-0.20180420001142.el7ost.noarch
openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-ironic-common-10.1.2-4.el7ost.noarch
openstack-mistral-executor-6.0.2-1.el7ost.noarch




How reproducible:
always

Steps to Reproduce:
1. Deploy OSP13 with latest passed_phase2 puddle
2. go to hypervisor, find qcow disk of controller and corrupt it
3. set failed node to off state using Ironic
4. try to replace controller using official docs
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes

Actual results:

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR

Expected results:
controller was replace and overcloud is operable
 
Additional info:

Comment 4 Artem Hrechanychenko 2018-07-11 17:41:41 UTC
The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1600202

Comment 7 Omri Hochman 2018-07-12 15:38:42 UTC
I rather not close this BZ unless we verify that by fixing  bug 1548026 *** we can also test the replace controller with a failed HD. 

Adding depend on #1548026

Comment 8 Omri Hochman 2018-07-12 15:39:41 UTC
changing the Target release according to 1548026  osp13-z2

Comment 11 John Fulton 2018-07-18 15:24:46 UTC
The patch which should fix this, tracked in 1548026, has merged upstream

 https://review.openstack.org/#/c/583229

Comment 12 John Fulton 2018-08-09 13:29:59 UTC
Giving this a more meaningful subject (removing unnecessary text around it)

Comment 16 Joanne O'Flynn 2018-08-14 10:49:03 UTC
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 17 Gal Amado 2018-08-14 19:20:14 UTC
Omri,
Are you planning to verify this for R13Z2 release?
Gal.

Comment 18 Omri Hochman 2018-08-15 23:11:53 UTC
(In reply to Gal Amado from comment #17)
> Omri,
> Are you planning to verify this for R13Z2 release?
> Gal.

yes, we're aiming to test this for osp13.z.

Comment 22 errata-xmlrpc 2018-08-29 16:37:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2574


Note You need to log in before you can comment on or make changes to this bug.