Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600202

Summary:	Cannot replace controller after disk outage
Product:	Red Hat OpenStack	Reporter:	Artem Hrechanychenko <ahrechan>
Component:	openstack-tripleo-common	Assignee:	John Fulton <johfulto>
Status:	CLOSED ERRATA	QA Contact:	Artem Hrechanychenko <ahrechan>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	gamado, gfidente, jjoyce, joflynn, johfulto, mburns, mcornea, ohochman, slinaber, tshefi, yrabl
Target Milestone:	z2	Keywords:	Reopened, Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	All
OS:	Linux
Whiteboard:	DFG:Storage
Fixed In Version:	openstack-tripleo-common-8.6.3-4.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-29 16:37:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1548026
Bug Blocks:

Description Artem Hrechanychenko 2018-07-11 16:46:49 UTC

Description of problem:
 simulated disk outage on controller node
[root@seal33 ~ ]# dd if=/dev/zero of=/var/lib/libvirt/images/controller-1-disk1.qcow2  bs=600M count=5


| 36657447-9ca3-482d-9134-e62322e055ba | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |


(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.10 -vvv
OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 58: Applying options for *
debug2: resolving "192.168.24.10" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to 192.168.24.10 [192.168.24.10] port 22.
debug1: Connection established.
debug1: identity file /home/stack/.ssh/id_rsa type 1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/stack/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4

So node became unreachible. 
After that we power off node using ironic and we not able to remove ceph monitor from the controller-1

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes

11.4.2. Removing a Ceph Monitor Daemon

This procedure removes a ceph-mon daemon from the storage cluster. If your Controller node is running a Ceph monitor service, complete the following steps to remove the ceph-mon daemon. This procedure assumes the Controller is reachable.
Note

A new Ceph monitor daemon will be added after a new Controller is added to the cluster.

    Connect to the controller to be replaced and become root:

    # ssh heat-admin.0.47
    # sudo su -

    As root, stop the monitor:

    # systemctl stop ceph-mon@<monitor_hostname>

    For example:

    # systemctl stop ceph-mon@overcloud-controller-2

    Remove the monitor from the cluster:

    # ceph mon remove <mon_id>


as result replacement procedure was fail


(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR



Version-Release number of selected component (if applicable):
core_puddle_version = 2018-07-06.1

openstack-nova-compute-17.0.3-0.20180420001142.el7ost.noarch
openstack-glance-16.0.1-2.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001142.el7ost.noarch
openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch
python-openstackclient-lang-3.14.1-1.el7ost.noarch
openstack-ironic-conductor-10.1.2-4.el7ost.noarch
openstack-tripleo-validations-8.4.1-5.el7ost.noarch
openstack-nova-api-17.0.3-0.20180420001142.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001142.el7ost.noarch
openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-mistral-engine-6.0.2-1.el7ost.noarch
python2-openstackclient-3.14.1-1.el7ost.noarch
puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch
openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch
openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch
openstack-selinux-0.8.14-12.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch
openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch
openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch
openstack-ironic-api-10.1.2-4.el7ost.noarch
openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch
openstack-tempest-18.0.0-2.el7ost.noarch
openstack-mistral-common-6.0.2-1.el7ost.noarch
openstack-tripleo-ui-8.3.1-3.el7ost.noarch
openstack-zaqar-6.0.1-1.el7ost.noarch
openstack-nova-placement-api-17.0.3-0.20180420001142.el7ost.noarch
openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch
puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch
openstack-mistral-api-6.0.2-1.el7ost.noarch
openstack-tripleo-common-8.6.1-23.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch
openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch
python2-openstacksdk-0.11.3-1.el7ost.noarch
openstack-nova-scheduler-17.0.3-0.20180420001142.el7ost.noarch
openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch
openstack-ironic-common-10.1.2-4.el7ost.noarch
openstack-mistral-executor-6.0.2-1.el7ost.noarch




How reproducible:
always

Steps to Reproduce:
1. Deploy OSP13 with latest passed_phase2 puddle
2. go to hypervisor, find qcow disk of controller and corrupt it
3. set failed node to off state using Ironic
4. try to replace controller using official docs
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes

Actual results:

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::TripleO::WorkflowSteps
  physical_resource_id: 80325942-11a8-4919-98a9-7f53ce670674
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR

Expected results:
controller was replace and overcloud is operable
 
Additional info:

Comment 4 Artem Hrechanychenko 2018-07-11 17:41:41 UTC

The reports should be available here: http://rhos-release.virt.bos.redhat.com/log/bz1600202

Comment 7 Omri Hochman 2018-07-12 15:38:42 UTC

I rather not close this BZ unless we verify that by fixing  bug 1548026 *** we can also test the replace controller with a failed HD. 

Adding depend on #1548026

Comment 8 Omri Hochman 2018-07-12 15:39:41 UTC

changing the Target release according to 1548026  osp13-z2

Comment 11 John Fulton 2018-07-18 15:24:46 UTC

The patch which should fix this, tracked in 1548026, has merged upstream

 https://review.openstack.org/#/c/583229

Comment 12 John Fulton 2018-08-09 13:29:59 UTC

Giving this a more meaningful subject (removing unnecessary text around it)

Comment 16 Joanne O'Flynn 2018-08-14 10:49:03 UTC

This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 17 Gal Amado 2018-08-14 19:20:14 UTC

Omri,
Are you planning to verify this for R13Z2 release?
Gal.

Comment 18 Omri Hochman 2018-08-15 23:11:53 UTC

(In reply to Gal Amado from comment #17)
> Omri,
> Are you planning to verify this for R13Z2 release?
> Gal.

yes, we're aiming to test this for osp13.z.

Comment 19 Artem Hrechanychenko 2018-08-22 14:02:08 UTC

VERIFIED

used https://docs.google.com/document/d/1738ZeETl3f1-0ieOSBDjVBWqC8xfruI-F4HPa7eixhk/edit for removing ceph monitor

and https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes 


openstack-tripleo-common-8.6.3-10.el7ost.noarch

Comment 22 errata-xmlrpc 2018-08-29 16:37:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2574