Bug 1403185 - Takeover does not work in case of pxe boot
Summary: Takeover does not work in case of pxe boot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 9.0 (Mitaka)
Hardware: All
OS: Linux
medium
medium
Target Milestone: Upstream M2
: 13.0 (Queens)
Assignee: Dmitry Tantsur
QA Contact: Dan Yasny
URL:
Whiteboard:
Depends On:
Blocks: 1473267
TreeView+ depends on / blocked
 
Reported: 2016-12-09 11:30 UTC by VIKRANT
Modified: 2018-06-27 13:31 UTC (History)
13 users (show)

Fixed In Version: openstack-ironic-10.1.2-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:29:16 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 425124 None stable/newton: MERGED ironic: Fix take over for ACTIVE nodes in PXEBoot (I264d5477523f57552aadf2809021a9a63fee2730) 2018-04-18 14:04:47 UTC
OpenStack gerrit 545806 None stable/queens: MERGED ironic: Allow sqalchemy filtering by id and uuid (I4efc0d5cd5d5d6108a334f954e1718203b47da0a) 2018-04-18 14:04:38 UTC
OpenStack gerrit 545893 None stable/queens: MERGED ironic: Clean nodes stuck in CLEANING state when ir-cond restarts (Ia7bce4dff57569707de4fcf3002eac241a5aa85b) 2018-04-18 14:04:28 UTC
OpenStack gerrit 554202 None stable/queens: MERGED ironic: Rework logic handling reserved orphaned nodes in the conductor (I379c1335692046ca9423fda5ea68d2f10c065cb5) 2018-04-18 14:04:23 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:31:13 UTC

Description VIKRANT 2016-12-09 11:30:53 UTC
Description of problem:

Takeover does not work in case of pxe boot.

the problem happened when you deployed a bare metal Node, and that node is be
provisioned by the ironic-conductor on controller1. at this time, if
controller1 totally down, the ironic-conductor on other controller nodes
should take over the bare metal server. 

Upstream Bug regarding the same issue : 

https://bugs.launchpad.net/ironic/+bug/1559138 Duplicate of
https://bugs.launchpad.net/ironic/+bug/1516816

Version-Release number of selected component (if applicable):
RHEL OSP 9

How reproducible:
Everytime for Cu. 

Steps to Reproduce:
1. 
2. 
3.

Actual results:
Failover is not happening properly.

Expected results:
Failover should have happen without any issue. 

Additional info:

Comment 3 Dmitry Tantsur 2017-01-23 11:12:59 UTC
Hi! The patch for the bug has merged upstream. We're looking into possibility of backporting it. As a work around, use local boot with bare metal instances.

Comment 4 Dmitry Tantsur 2017-02-07 14:18:56 UTC
Backported to Newton, pending OSP 10 rebase. Lucas is looking into the possibility of a backport to Mitaka (OSP 9).

Comment 5 Lucas Alvares Gomes 2017-02-07 14:29:20 UTC
Hi VIKRANT,

Thanks for reporting this. I was looking at the effort required for backporting this fix all the way down to Mitaka and it's not trivial. A couple of methods in the pxe.py module has been re-written since and the fix no longer merges correctly, we would need to rewrite part of the fix. Therefore, for now we won't consider this backport to OSP-9.

I'm changing the target of this bug to OSP-10 where the fix has been backported already.

Please let us know if it's OK with you.

Cheers,
Lucas

Comment 16 VIKRANT 2017-05-04 05:00:12 UTC
Hi Dan,

Here are the steps which I got from Cu. 

~~~
The problem happend when I do not uselocal boot in ironic(it will net boot every time when the baremetal reboot)

The steps:
1. boot a baremetal instance (notice the instance is not local boot).
2. check which ironic-conductor it belong to, then shutdown the physical node of ironic-conductor (it is one of the controller)
3. reboot the baremetal instance.
~~~

Comment 17 Dan Yasny 2017-05-04 20:28:19 UTC
The BZ failed QA.

It looks like takeover doesn't work in any case, not just PXE.

the flow employed:
- have a BM node under OSP10 managed by ironic
- try to power the ironic node off (or on) the idea is to catch the moment when the node's "reservation" field is populated by the current conductor
- kill the controller node that holds the reservation

results:
- ironic node-show remains stuck with "reservation | overcloud-controller-2.localdomain"
- any attempt to power on/off the node fail:
  [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-1 on
Node f723a2cd-4f8d-4dd7-aad5-cb15db6e932d is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed. (HTTP 409)
  [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-1 off
Node f723a2cd-4f8d-4dd7-aad5-cb15db6e932d is locked by host overcloud-controller-2.localdomain, please retry after the current operation is completed. (HTTP 409)

I'm setting this back to assigned, as it failed QA

Comment 18 Dmitry Tantsur 2017-05-05 08:06:03 UTC
Dan, how much time did you wait? Cleaning up reservations is certainly not instant.

Comment 19 Dan Yasny 2017-05-05 13:40:25 UTC
(In reply to Dmitry Tantsur from comment #18)
> Dan, how much time did you wait? Cleaning up reservations is certainly not
> instant.

16 hours so far - still stuck

Comment 23 Dmitry Tantsur 2018-02-16 11:32:28 UTC
The bug Dan hit was fixed in OSP 10 in https://github.com/openstack/ironic/commit/d52077f4fe8c668b258702e8298a4beaa19476d8. However, there is one missing change for proper take over, attaching it.

Comment 24 Dmitry Tantsur 2018-02-16 12:13:19 UTC
And one more change to complete the picture.

Comment 25 Bob Fournier 2018-03-01 22:00:17 UTC
It looks like all patches are in stable/queens.  Moving to POST.

Comment 26 Dmitry Tantsur 2018-03-05 09:52:33 UTC
I would like https://review.openstack.org/#/c/546273/ to also get in as part of this work, so moving back to ON_DEV for now. Sorry for not updating earlier.

Comment 27 Bob Fournier 2018-04-16 13:26:38 UTC
As https://review.openstack.org/#/c/554202/ has landed, which is the backport for https://review.openstack.org/#/c/546273/, moving to POST.

Comment 29 mlammon 2018-05-15 22:07:57 UTC
Install latest osp 13 puddle:2018-05-10.3


Step 1)
(overcloud) [stack@undercloud-0 ~]$ ironic node-list
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| e5b6f81f-857b-4867-a7f5-729769609d93 | ironic-0 | 82010342-9c04-421f-8cb0-1ab2277786b3 | power on    | active             | False       |
| 4f0ad22a-a246-40b4-8656-04031b3630cb | ironic-1 | None                                 | power off   | available          | False       |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+



Step 2)
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| dbec5044-73ad-4301-b5b9-f7a12194216c | ceph-0       | f7a92aa8-f094-4e34-b6c2-5511173472bd | power on    | active             | False       |
| 2047d948-5619-4e44-ab3d-ce62c29f469e | ceph-1       | c25e6d66-f671-43a7-a7f8-e7e79d1ff963 | power on    | active             | False       |
| 47f9dc3c-dd9f-48f0-bd60-611e56ca5d91 | ceph-2       | a3182ebd-b469-4b5f-bf39-298015ff8ae7 | power on    | active             | False       |
| 3c4e1fe8-90a5-4999-be51-c90cf6cbf40a | compute-0    | e18cae11-2f9a-446f-ba55-e708494e0f7d | power on    | active             | False       |
| 58168f45-2080-4a0d-aec2-41a23977840f | controller-0 | f578b76f-3ca4-4566-b2ac-d2d81795aae2 | power on    | active             | False       |
| 8d14cf19-dba8-4194-a711-0454f728d2eb | controller-1 | 8e6a507a-9c4a-464f-8bcc-08ecf9ed059e | power on    | active             | False       |
| b70d02c2-96a0-4de1-a56b-5529baf62f42 | controller-2 | 61eaeb8f-843a-4e29-817e-a6256de5b2dc | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+


Step 3)
(overcloud) [stack@undercloud-0 ~]$ ironic node-set-power-state ironic-0 off
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
(overcloud) [stack@undercloud-0 ~]$ ironic node-show ironic-0
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
+------------------------+--------------------------------------------------------------------------+
| Property               | Value                                                                    |
+------------------------+--------------------------------------------------------------------------+
| boot_interface         | None                                                                     |
| chassis_uuid           | None                                                                     |
| clean_step             | {}                                                                       |
| console_enabled        | False                                                                    |
| console_interface      | None                                                                     |
| created_at             | 2018-05-15T17:13:01+00:00                                                |
| deploy_interface       | None                                                                     |
| driver                 | pxe_ipmitool                                                             |
| driver_info            | {u'ipmi_port': u'6234', u'ipmi_username': u'admin', u'deploy_kernel':    |
|                        | u'90e6217f-0839-4833-ae12-76d7a70d3866', u'ipmi_address': u'172.16.0.1', |
|                        | u'deploy_ramdisk': u'4bc40ea6-33fe-407c-bb54-e485b9a7f0e3',              |
|                        | u'ipmi_password': u'******'}                                             |
| driver_internal_info   | {u'agent_cached_clean_steps_refreshed': u'2018-05-15 17:14:29.562653',   |
|                        | u'agent_cached_clean_steps': {u'deploy': [{u'priority': 99,              |
|                        | u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, |
|                        | u'step': u'erase_devices_metadata'}, {u'priority': 10, u'interface':     |
|                        | u'deploy', u'reboot_requested': False, u'abortable': True, u'step':      |
|                        | u'erase_devices'}]}, u'clean_steps': None, u'hardware_manager_version':  |
|                        | {u'generic_hardware_manager': u'1.1'}, u'is_whole_disk_image': False,    |
|                        | u'agent_continue_if_ata_erase_failed': False,                            |
|                        | u'agent_erase_devices_iterations': 1, u'agent_erase_devices_zeroize':    |
|                        | True, u'root_uuid_or_disk_id': u'0d0b8fbf-db98-4612-b551-81fb39aacaec',  |
|                        | u'agent_version': u'3.2.1.dev2', u'agent_url':                           |
|                        | u'http://192.168.24.44:9999'}                                            |
| extra                  | {}                                                                       |
| inspect_interface      | None                                                                     |
| inspection_finished_at | None                                                                     |
| inspection_started_at  | None                                                                     |
| instance_info          | {u'root_gb': u'20', u'display_name': u'instance2', u'image_source':      |
|                        | u'b852a157-dc53-4e94-9515-2ce4772f04a6', u'memory_mb': u'1024',          |
|                        | u'vcpus': u'1', u'local_gb': u'40', u'configdrive': u'******',           |
|                        | u'swap_mb': u'0', u'nova_host_id': u'overcloud-                          |
|                        | controller-2.localdomain'}                                               |
| instance_uuid          | 82010342-9c04-421f-8cb0-1ab2277786b3                                     |
| last_error             | None                                                                     |
| maintenance            | False                                                                    |
| maintenance_reason     | None                                                                     |
| management_interface   | None                                                                     |
| name                   | ironic-0                                                                 |
| network_interface      | flat                                                                     |
| power_interface        | None                                                                     |
| power_state            | power on                                                                 |
| properties             | {u'memory_mb': u'4096', u'cpu_arch': u'x86_64', u'local_gb': u'40',      |
|                        | u'cpus': u'4', u'capabilities': u'boot_option:local'}                    |
| provision_state        | active                                                                   |
| provision_updated_at   | 2018-05-15T17:24:10+00:00                                                |
| raid_config            | {}                                                                       |
| raid_interface         | None                                                                     |
| reservation            | overcloud-controller-2.localdomain                                       |
| resource_class         | None                                                                     |
| storage_interface      | noop                                                                     |
| target_power_state     | power off                                                                |
| target_provision_state | None                                                                     |
| target_raid_config     | {}                                                                       |
| traits                 |                                                                          |
| updated_at             | 2018-05-15T21:57:25+00:00                                                |
| uuid                   | e5b6f81f-857b-4867-a7f5-729769609d93                                     |
| vendor_interface       | None                                                                     |
+------------------------+--------------------------------------------------------------------------+

As soon as I see the reservation from Step 3 power off on ironic-0, I issued the following in Step 4.

Step 4)
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node reboot b70d02c2-96a0-4de1-a56b-5529baf62f42

(overcloud) [stack@undercloud-0 ~]$ ironic node-show ironic-0 | grep reservation
The "ironic" CLI is deprecated and will be removed in the S* release. Please use the "openstack baremetal" CLI instead.
| reservation            | None


I repeated this a couple time.
I did not see any hangup with reservation.  I was able to power off/on several more times.

Will check with dtantsur if this is sufficient for verification.

Comment 30 Dmitry Tantsur 2018-05-16 16:13:24 UTC
Yeah, it seems that the problem from comment 17 is gone.

Comment 32 errata-xmlrpc 2018-06-27 13:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.