Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1489430 - Node count validation does not take into account nodes that are active and in maintenance
Summary: Node count validation does not take into account nodes that are active and in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Dmitry Tantsur
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-07 12:37 UTC by Robin Cernin
Modified: 2020-12-14 09:55 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-validations-9.1.1-0.20180618123656.d21e7fa.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:48:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 563152 0 None None None 2018-08-14 12:34:49 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:48:38 UTC

Description Robin Cernin 2017-09-07 12:37:31 UTC
Description of problem:

We are hitting validation bug:

let me explain on example case: 

0) initial deployment with 2 compute nodes and 3 controllers
  - nova list (shows 2 compute-N and 3 controller-N in ACTIVE)
1) then you set the ironic node which was already deployed to maintenance
  - you will have 4 nodes + 1 node in maintenance (all deployed with ACTIVE instance associated)
2) then you add additional node to ironic, you will have 5 nodes + 1 node in maintenance
3) try to scale out the compute with --compute-scale 3.
  - this will fail the validation as the ironic only finds 2 compute nodes which are not in maintenance, doesn't account the 1 already deployed but in maintenance.

Result:

We cannot scale compute nodes to 3

Expected result:

Scale out should account the nodes that are already deployed even tho they are in the maintenance.

Comment 1 Alex Schultz 2017-09-07 22:16:47 UTC
This seems like NOTABUG as Heat must be able to manage the existing systems so when they are in 'maintence' they technically are not manageable which means the validation is correct.  Why is the customer putting them in maintenance mode?

Comment 2 Robin Cernin 2017-09-08 06:55:59 UTC
Validation is correct but only from validation point of view. Why instead can't we check for nova ACTIVE instances instead? Because this doesn't allow me to do the scale out while there is one of the *already* deployed node in maintenance. The node in ironic maintenance state is disabled in nova/neutron agent. However my Overcloud is working just fine.

Comment 3 Alex Schultz 2017-09-08 14:44:59 UTC
From a system perspective it's using the available flag as the indicator for active instances in ironic for the hardware instances themselves.  Given that scale out actions may need to touch these systems to update hostnames/network config/etc, they must be manageable. Prior to running the action, it may not know if needs to touch these systems specific systems so it seems to be more of a prevalidation check.  I believe currently it's all a stack update so scaleout/down/deploy don't follow a different code path.  In the future this might change, but right now this is NOTAGBUG and is expected.  Since I think this error comes from the schedulers, maybe the HardProv DFG has more insight into this requirement.  Let's check with them.

Comment 4 Dmitry Tantsur 2017-09-19 17:37:59 UTC
Yeah, we currently skip all nodes in maintenance, while active nodes should be accounted for.

This is a valid bug, but I cannot really promise to backport it all the way to Mitaka.

Comment 5 Dmitry Tantsur 2018-08-14 12:34:50 UTC
Hi!

This code was moved to tripleo-validations since then, and apparently at some point the check was modified to not require maintenance for associated nodes: https://github.com/openstack/tripleo-validations/blob/master/validations/lookup_plugins/ironic_nodes.py#L97-L99.

I assume the bug was fixed by it. Unfortunately, since it's completely new code, backporting is too complicated. Please clone this bug to earlier releases if you *really* need it.

P.S.
You really should not do anything with your stack while any nodes are in maintenance.

Comment 9 Alexander Chuzhoy 2018-09-21 17:33:48 UTC
Verified:
Environment: openstack-tripleo-validations-9.3.1-0.20180831205305.fbfd253.el7ost.noarch


Deployed overcloud with 1 compute.
Then switched the deployed compute to maintenance:
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| ef65dc52-eb7f-4b38-89ca-7e2718c12e2a | ceph-0       | e48c4727-a72b-4db9-b110-729ea96f049f | power on    | active             | False       |
| da80cdec-b2fa-4beb-9589-4520edd7c608 | ceph-1       | a3bcd336-906c-4527-8755-45e8909e27fb | power on    | active             | False       |
| 22d9402d-812b-4231-98bd-cb7d8cf1df04 | ceph-2       | 73b81108-7d44-4772-95e1-1d9a96a80f5e | power on    | active             | False       |
| 5f322891-6b43-4ed8-b664-d8fb0129e27c | compute-0    | 4ef08b13-3fba-49c8-bb92-a89f732c5f27 | power on    | active             | True        |
| 8c296081-1aba-4d99-87d1-b8094d0e6719 | compute-1    | None                                 | power off   | available          | False       |
| 751fa3e0-d72e-4cb5-aa8a-8b174d941655 | controller-0 | f6a3fa9a-10f9-4c5d-ae50-95f0021773d9 | power on    | active             | False       |
| 19252a73-f9fd-470c-a61f-2cc11991e3f3 | controller-1 | bc301185-6dcf-48e4-9e72-deedf1f5dac8 | power on    | active             | False       |
| 1ec67f4a-017c-42ee-a787-22e5510dc62b | controller-2 | 11cd6f31-6750-459f-b0a4-97841da72264 | power on    | active             | False       |



Re-ran the OC deployment command , only this time with "   ComputeCount: 2".

The deployment completed successfully and now there are 2 computes.
i(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| aa2a18c3-6be0-480b-897a-7f9e49b63e00 | compute-1    | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | compute    |
| 11cd6f31-6750-459f-b0a4-97841da72264 | controller-2 | ACTIVE | ctlplane=192.168.24.16 | overcloud-full | controller |
| 73b81108-7d44-4772-95e1-1d9a96a80f5e | ceph-0       | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | ceph       |
| bc301185-6dcf-48e4-9e72-deedf1f5dac8 | controller-1 | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | controller |
| f6a3fa9a-10f9-4c5d-ae50-95f0021773d9 | controller-0 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller |
| e48c4727-a72b-4db9-b110-729ea96f049f | ceph-2       | ACTIVE | ctlplane=192.168.24.21 | overcloud-full | ceph       |
| 4ef08b13-3fba-49c8-bb92-a89f732c5f27 | compute-0    | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute    |
| a3bcd336-906c-4527-8755-45e8909e27fb | ceph-1       | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | ceph       |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

Ironic looks as following:
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| ef65dc52-eb7f-4b38-89ca-7e2718c12e2a | ceph-0       | e48c4727-a72b-4db9-b110-729ea96f049f | power on    | active             | False       |
| da80cdec-b2fa-4beb-9589-4520edd7c608 | ceph-1       | a3bcd336-906c-4527-8755-45e8909e27fb | power on    | active             | False       |
| 22d9402d-812b-4231-98bd-cb7d8cf1df04 | ceph-2       | 73b81108-7d44-4772-95e1-1d9a96a80f5e | power on    | active             | False       |
| 5f322891-6b43-4ed8-b664-d8fb0129e27c | compute-0    | 4ef08b13-3fba-49c8-bb92-a89f732c5f27 | power on    | active             | True        |
| 8c296081-1aba-4d99-87d1-b8094d0e6719 | compute-1    | aa2a18c3-6be0-480b-897a-7f9e49b63e00 | power on    | active             | False       |
| 751fa3e0-d72e-4cb5-aa8a-8b174d941655 | controller-0 | f6a3fa9a-10f9-4c5d-ae50-95f0021773d9 | power on    | active             | False       |
| 19252a73-f9fd-470c-a61f-2cc11991e3f3 | controller-1 | bc301185-6dcf-48e4-9e72-deedf1f5dac8 | power on    | active             | False       |
| 1ec67f4a-017c-42ee-a787-22e5510dc62b | controller-2 | 11cd6f31-6750-459f-b0a4-97841da72264 | power on    | active             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

Comment 11 errata-xmlrpc 2019-01-11 11:48:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.