Description of problem: Takeover does not work in case of pxe boot. the problem happened when you deployed a bare metal Node, and that node is be provisioned by the ironic-conductor on controller1. at this time, if controller1 totally down, the ironic-conductor on other controller nodes should take over the bare metal server. Upstream Bug regarding the same issue : https://bugs.launchpad.net/ironic/+bug/1559138 Duplicate of https://bugs.launchpad.net/ironic/+bug/1516816 Version-Release number of selected component (if applicable): RHEL OSP 9 How reproducible: Everytime for Cu. Steps to Reproduce: 1. 2. 3. Actual results: Failover is not happening properly. Expected results: Failover should have happen without any issue. Additional info:
Hi! The patch for the bug has merged upstream. We're looking into possibility of backporting it. As a work around, use local boot with bare metal instances.
Backported to Newton, pending OSP 10 rebase. Lucas is looking into the possibility of a backport to Mitaka (OSP 9).
Hi VIKRANT, Thanks for reporting this. I was looking at the effort required for backporting this fix all the way down to Mitaka and it's not trivial. A couple of methods in the pxe.py module has been re-written since and the fix no longer merges correctly, we would need to rewrite part of the fix. Therefore, for now we won't consider this backport to OSP-9. I'm changing the target of this bug to OSP-10 where the fix has been backported already. Please let us know if it's OK with you. Cheers, Lucas
Hi Dan, Here are the steps which I got from Cu. ~~~ The problem happend when I do not uselocal boot in ironic(it will net boot every time when the baremetal reboot) The steps: 1. boot a baremetal instance (notice the instance is not local boot). 2. check which ironic-conductor it belong to, then shutdown the physical node of ironic-conductor (it is one of the controller) 3. reboot the baremetal instance. ~~~
The BZ failed QA. It looks like takeover doesn't work in any case, not just PXE. the flow employed: - have a BM node under OSP10 managed by ironic - try to power the ironic node off (or on) the idea is to catch the moment when the node's "reservation" field is populated by the current conductor - kill the controller node that holds the reservation results: - ironic node-show remains stuck with "reservation | overcloud-controller-2.localdomain" - any attempt to power on/off the node fail: [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-1 on Node f723a2cd-4f8d-4dd7-aad5-cb15db6e932d is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed. (HTTP 409) [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-1 off Node f723a2cd-4f8d-4dd7-aad5-cb15db6e932d is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed. (HTTP 409) I'm setting this back to assigned, as it failed QA
Dan, how much time did you wait? Cleaning up reservations is certainly not instant.
(In reply to Dmitry Tantsur from comment #18) > Dan, how much time did you wait? Cleaning up reservations is certainly not > instant. 16 hours so far - still stuck
The bug Dan hit was fixed in OSP 10 in https://github.com/openstack/ironic/commit/d52077f4fe8c668b258702e8298a4beaa19476d8. However, there is one missing change for proper take over, attaching it.
And one more change to complete the picture.
It looks like all patches are in stable/queens. Moving to POST.
I would like https://review.openstack.org/#/c/546273/ to also get in as part of this work, so moving back to ON_DEV for now. Sorry for not updating earlier.
As https://review.openstack.org/#/c/554202/ has landed, which is the backport for https://review.openstack.org/#/c/546273/, moving to POST.
Install latest osp 13 puddle:2018-05-10.3 Step 1) (overcloud) [stack@undercloud-0 ~]$ ironic node-list The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead. +--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+ | e5b6f81f-857b-4867-a7f5-729769609d93 | ironic-0 | 82010342-9c04-421f-8cb0-1ab2277786b3 | power on | active | False | | 4f0ad22a-a246-40b4-8656-04031b3630cb | ironic-1 | None | power off | available | False | +--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+ Step 2) (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | dbec5044-73ad-4301-b5b9-f7a12194216c | ceph-0 | f7a92aa8-f094-4e34-b6c2-5511173472bd | power on | active | False | | 2047d948-5619-4e44-ab3d-ce62c29f469e | ceph-1 | c25e6d66-f671-43a7-a7f8-e7e79d1ff963 | power on | active | False | | 47f9dc3c-dd9f-48f0-bd60-611e56ca5d91 | ceph-2 | a3182ebd-b469-4b5f-bf39-298015ff8ae7 | power on | active | False | | 3c4e1fe8-90a5-4999-be51-c90cf6cbf40a | compute-0 | e18cae11-2f9a-446f-ba55-e708494e0f7d | power on | active | False | | 58168f45-2080-4a0d-aec2-41a23977840f | controller-0 | f578b76f-3ca4-4566-b2ac-d2d81795aae2 | power on | active | False | | 8d14cf19-dba8-4194-a711-0454f728d2eb | controller-1 | 8e6a507a-9c4a-464f-8bcc-08ecf9ed059e | power on | active | False | | b70d02c2-96a0-4de1-a56b-5529baf62f42 | controller-2 | 61eaeb8f-843a-4e29-817e-a6256de5b2dc | power on | active | False | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ Step 3) (overcloud) [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-0 off The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead. (overcloud) [stack@undercloud-0 ~]$ ironic node-show ironic-0 The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead. +------------------------+--------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------+ | boot_interface | None | | chassis_uuid | None | | clean_step | {} | | console_enabled | False | | console_interface | None | | created_at | 2018-05-15T17:13:01+00:00 | | deploy_interface | None | | driver | pxe_ipmitool | | driver_info | {u'ipmi_port': u'6234', u'ipmi_username': u'admin', u'deploy_kernel': | | | u'90e6217f-0839-4833-ae12-76d7a70d3866', u'ipmi_address': u'172.16.0.1', | | | u'deploy_ramdisk': u'4bc40ea6-33fe-407c-bb54-e485b9a7f0e3', | | | u'ipmi_password': u'******'} | | driver_internal_info | {u'agent_cached_clean_steps_refreshed': u'2018-05-15 17:14:29.562653', | | | u'agent_cached_clean_steps': {u'deploy': [{u'priority': 99, | | | u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, | | | u'step': u'erase_devices_metadata'}, {u'priority': 10, u'interface': | | | u'deploy', u'reboot_requested': False, u'abortable': True, u'step': | | | u'erase_devices'}]}, u'clean_steps': None, u'hardware_manager_version': | | | {u'generic_hardware_manager': u'1.1'}, u'is_whole_disk_image': False, | | | u'agent_continue_if_ata_erase_failed': False, | | | u'agent_erase_devices_iterations': 1, u'agent_erase_devices_zeroize': | | | True, u'root_uuid_or_disk_id': u'0d0b8fbf-db98-4612-b551-81fb39aacaec', | | | u'agent_version': u'3.2.1.dev2', u'agent_url': | | | u'http://192.168.24.44:9999'} | | extra | {} | | inspect_interface | None | | inspection_finished_at | None | | inspection_started_at | None | | instance_info | {u'root_gb': u'20', u'display_name': u'instance2', u'image_source': | | | u'b852a157-dc53-4e94-9515-2ce4772f04a6', u'memory_mb': u'1024', | | | u'vcpus': u'1', u'local_gb': u'40', u'configdrive': u'******', | | | u'swap_mb': u'0', u'nova_host_id': u'overcloud- | | | controller-2.localdomain'} | | instance_uuid | 82010342-9c04-421f-8cb0-1ab2277786b3 | | last_error | None | | maintenance | False | | maintenance_reason | None | | management_interface | None | | name | ironic-0 | | network_interface | flat | | power_interface | None | | power_state | power on | | properties | {u'memory_mb': u'4096', u'cpu_arch': u'x86_64', u'local_gb': u'40', | | | u'cpus': u'4', u'capabilities': u'boot_option:local'} | | provision_state | active | | provision_updated_at | 2018-05-15T17:24:10+00:00 | | raid_config | {} | | raid_interface | None | | reservation | overcloud-controller-2.localdomain | | resource_class | None | | storage_interface | noop | | target_power_state | power off | | target_provision_state | None | | target_raid_config | {} | | traits | | | updated_at | 2018-05-15T21:57:25+00:00 | | uuid | e5b6f81f-857b-4867-a7f5-729769609d93 | | vendor_interface | None | +------------------------+--------------------------------------------------------------------------+ As soon as I see the reservation from Step 3 power off on ironic-0, I issued the following in Step 4. Step 4) (undercloud) [stack@undercloud-0 ~]$ openstack baremetal node reboot b70d02c2-96a0-4de1-a56b-5529baf62f42 (overcloud) [stack@undercloud-0 ~]$ ironic node-show ironic-0 | grep reservation The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead. | reservation | None I repeated this a couple time. I did not see any hangup with reservation. I was able to power off/on several more times. Will check with dtantsur if this is sufficient for verification.
Yeah, it seems that the problem from comment 17 is gone.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086