Bug 1688838

Summary: Ironic should not treat cpu_arch as mandatory
Product: Red Hat OpenStack Reporter: Bob Fournier <bfournie>
Component: openstack-novaAssignee: melanie witt <mwitt>
Status: CLOSED EOL QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 14.0 (Rocky)CC: aarapov, bfournie, dasmith, dtantsur, eglynn, jhakimra, kchamart, mariel, mburns, mgarciac, mwitt, nlevinki, rpittau, sbauza, sgordon, tonyb, vromanso
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1653788 Environment:
Last Closed: 2021-07-06 11:21:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1653788    
Bug Blocks:    

Comment 2 Bob Fournier 2019-03-14 14:53:48 UTC
Cloned bug to continue testing.

Comment 4 Dmitry Tantsur 2019-03-18 13:23:46 UTC
Should we move that to Nova now that Ironic issues seem out of question?

Comment 5 mlammon 2019-03-18 14:03:52 UTC
Hi Dmitry,
I guess it makes sense if Ironic has exhausted its effort. Perhaps you can provided a brief summary and hand off?
It seems some coordination between the two projects.

Mike

Comment 6 Bob Fournier 2019-05-09 17:54:05 UTC
Fix has been released, marking this as TestOnly

Comment 7 Bob Fournier 2019-05-10 17:00:39 UTC
Moving this back to ASSIGNED as it still has the same nova error which is as expected as the fix was only made to Ironic.  I moved this prematurely to ON_QA.

2019-05-10 12:26:18.839 1 WARNING nova.virt.ironic.driver [req-7a1f0eca-fa01-4ab3-8e74-0c0ef4e007c7 - - - - -] cpu_arch not defined for node 'c795c6d0-6877-408d-bd20-73bf575201a6'
2019-05-10 12:26:18.998 1 WARNING nova.virt.ironic.driver [req-7a1f0eca-fa01-4ab3-8e74-0c0ef4e007c7 - - - - -] cpu_arch not defined for node 'c795c6d0-6877-408d-bd20-73bf575201a6'

Including nova team on this as it appears we need fix there.

Comment 8 Bob Fournier 2019-05-29 15:21:32 UTC
Moving this to nova team to have a look as it appears that fix is needed in nova.virt.ironic.driver, see Comment 4 and Comment 7.

Comment 11 melanie witt 2019-10-25 03:01:20 UTC
Hi, based on the comments in comment 0, I understand that nova is returning an empty list of supported nodes [1] when cpu_arch is missing or not specified and this causes an overcloud deployment to fail. I need a bit of advice from someone in the ironic team about what we should return for supported nodes when cpu_arch is not specified. Should it just be something like this? Simply omit cpu_arch as one of the supported node options?

    return [(obj_fields.HVType.BAREMETAL,
             obj_fields.VMMode.HVM)]

[1] https://github.com/openstack/nova/blob/1bfa4626d13d0a73e63745cc4a864ae86d490daf/nova/virt/ironic/driver.py#L103-L109

Comment 13 Dmitry Tantsur 2019-12-09 10:42:44 UTC
Hi Melanie,

I don't really know what the "supported_instances" field does in Nova. Will it work with two-component tuples? Ironic doesn't have any opinion on that.

Comment 14 melanie witt 2019-12-11 20:30:54 UTC
(In reply to Dmitry Tantsur from comment #13)
> Hi Melanie,
> 
> I don't really know what the "supported_instances" field does in Nova. Will
> it work with two-component tuples? Ironic doesn't have any opinion on that.

Hi Dmitry, thanks for responding. Apologies that I don't know much about this area in virt driver land and thus have suggested something that doesn't make sense.

I've dug around in the code some more and found that it would *not* be valid for 'supported_instances' to be a two-component tuple. It's not officially documented anywhere that I could find, but in the libvirt driver it's described as "a list of tuples that describe instances the hypervisor is capable of hosting. Each tuple consists of the triplet (arch, hypervisor_type, vm_mode)":

https://github.com/openstack/nova/blob/d9dc3668f86b57467f4a84133a0c1ed2092ec2fe/nova/virt/libvirt/driver.py#L6484-L6488

So, in the ironic driver case, if cpu_arch is None, we're returning an empty list, else we return (cpu_arch, obj_fields.HVType.BAREMETAL, obj_fields.VMMode.HVM).

I'm trying to figure out what we can return if cpu_arch is None. Would it be (None, obj_fields.HVType.BAREMETAL, obj_fields.VMMode.HVM)? Or something else? I need to do more digging.

Comment 15 melanie witt 2019-12-12 02:29:52 UTC
I investigated more into the nova side of this and learned about how the cpu_arch is used in the scheduling process. Virt drivers in nova can advertise an 'architecture', 'hypervisor_type', and 'vm_mode' which correspond to glance image properties [1]. The nova ironic driver advertises this according to what is found in the ironic node property for 'cpu_arch'. If 'cpu_arch' is present and not None, the ironic driver will advertise that 'cpu_arch' as part of the host capabilities for scheduling: (<cpu_arch>, obj_fields.HVType.BAREMETAL, obj_fields.VMMode.HVM). If the 'cpu_arch' is absent or None, the ironic driver will not advertise any cpu_arch or hypervisor_type or vm_mode for the ironic node.

Now, if the ImagePropertiesFilter is being used as part of the filter_scheduler configuration (it is by default [2]), a host (ironic node in this case) will only be selected as a match for scheduling if the image properties in the image specified for the instance create request for: 'architecture', 'hypervisor_type', and 'vm_mode' match what is being advertised by the ironic driver. The behavior in this bug report is showing that no host (ironic node) matched ("No valid host was found. There are not enough hosts available."), which implies that there is an 'architecture', 'hypervisor_type', and 'vm_mode' in the image metadata for the specified image. This will *not* match any host (ironic node) because there is no cpu_arch on the ironic nodes [3].

Now, the interesting thing is that this behavior in nova has been the same since the ironic driver code was first added in 2014 [4], which was Liberty (OSP8).

I don't believe the ironic commit mentioned in comment 0 [5] is related to the change in behavior in OSP14 and I don't think the fix that landed in ironic [6] would do anything to change the behavior either.

Is it possible that there was a change in the glance image being used to create instances between OSP13 and OSP14 in the original bug report? I know I'm asking this several months since it was first reported but do we know what glance image properties are on the images being used in the instance create requests? Do they have 'architecture', 'hypervisor_type', and 'vm_mode' image metadata associated with them? Have they always (prior to OSP14)?

We still could make a change in nova to address this issue (like making a change to the ironic driver to use a default cpu_arch, maybe "nova.virt.arch.ALL" if cpu_arch is not part of the ironic node properties) but I'm trying to understand how this regression in behavior happened first and let that inform our decision about what to do next. Nothing has changed in nova related to this between OSP13 and OSP14, so I'm wondering if it's possible something changed with the images. Please let me know.

[1] https://docs.openstack.org/glance/rocky/admin/useful-image-properties.html
[2] https://docs.openstack.org/nova/rocky/configuration/config.html#filter_scheduler.enabled_filters
[3] https://github.com/openstack/nova/blob/d9474dde7291d5da77db11d5e390ecdce3305a10/nova/scheduler/filters/image_props_filter.py#L68-L75
[4] https://github.com/openstack/nova/commit/9864a729fa7ec807d8cf084f134c6ef5c9f93d35#diff-1e4547e2c3b36b8f836d8f851f85fde7R109-R113
[5] https://github.com/openstack/ironic/commit/8f89954f9a3edab71db1157578e9d029a395d2d9
[6] https://review.opendev.org/621539

Comment 16 Riccardo Pittau 2019-12-12 08:59:52 UTC
Hey Melanie,

that's quite an interesting analysis, thank you for that.

I just wanted to add that the code to retrieve the cpu architecture is roughly between 4 and 5 years old and it's based on lscpu output [1], so I think it's prior OSP14, and I don't see this specific issue as a regression, but we probably touched an edge case.
It might be that in the meantime something has changed in the tool itself (lscpu), but it wouldn't be the first time that I see it not reporting a bit of information (architecture, number of cpus, model, etc.), or reporting it with a non standard output.
Especially for the cpu architecture, we're used to see common ones, like x86_64, but it could happen that a non standard one is detected, or even nothing.
It would actually be very interesting to see the output of lscpu on the node where we have cpu_arch reported as absent.
The change [2] that has landed in ironic, just takes into consideration the case where we don't have a value for the cpu architecture and we want to avoid issues on ironic side, it was something that needed to be added.
We could assign a default value on ironic side, but that would be wrong in my opinion.
In this case, there should be a mechanism on nova side to handle the absence of cpu_arch, I'm totally for a default generic value as you suggested.

I'm curious to see what other people think also, and we can even continue the discussion on a different channel.

[1] https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L873
[2] https://review.opendev.org/621539

Comment 17 melanie witt 2019-12-13 01:54:21 UTC
(In reply to Riccardo Pittau from comment #16)
> Hey Melanie,
> 
> that's quite an interesting analysis, thank you for that.
> 
> I just wanted to add that the code to retrieve the cpu architecture is
> roughly between 4 and 5 years old and it's based on lscpu output [1], so I
> think it's prior OSP14, and I don't see this specific issue as a regression,
> but we probably touched an edge case.
> It might be that in the meantime something has changed in the tool itself
> (lscpu), but it wouldn't be the first time that I see it not reporting a bit
> of information (architecture, number of cpus, model, etc.), or reporting it
> with a non standard output.
> Especially for the cpu architecture, we're used to see common ones, like
> x86_64, but it could happen that a non standard one is detected, or even
> nothing.
> It would actually be very interesting to see the output of lscpu on the node
> where we have cpu_arch reported as absent.
> The change [2] that has landed in ironic, just takes into consideration the
> case where we don't have a value for the cpu architecture and we want to
> avoid issues on ironic side, it was something that needed to be added.
> We could assign a default value on ironic side, but that would be wrong in
> my opinion.
> In this case, there should be a mechanism on nova side to handle the absence
> of cpu_arch, I'm totally for a default generic value as you suggested.
> 
> I'm curious to see what other people think also, and we can even continue
> the discussion on a different channel.
> 
> [1]
> https://opendev.org/openstack/ironic-python-agent/src/branch/master/
> ironic_python_agent/hardware.py#L873
> [2] https://review.opendev.org/621539

Thanks for all that info, Riccardo. It's helpful to have that background captured on this rhbz.

Today I've tried asking around in the #openstack-nova channel about this and discovered it's a more complicated issue without an obvious answer, so I've started a thread on the openstack-discuss@ ML upstream to get more thoughts and responses from the larger community:

http://lists.openstack.org/pipermail/openstack-discuss/2019-December/011558.html

Please feel free to add your thoughts on the thread!

Comment 18 Dmitry Tantsur 2019-12-13 12:28:29 UTC
Hi Melanie,

great write-up, thanks!

> I'm trying to understand how this regression in behavior happened first

I don't think there was a regression. We have been treating cpu_arch as mandatory for long time. This bugzilla was filed because making cpu_arch mandatory seems redundant from user experience point of view on a single-architecture cluster. That being said, I can easily believe that the behavior we're seeing has been there forever.

If we need to change Nova to modify it, we should probably re-qualify this bugzilla as an RFE.

Riccardo,

> It might be that in the meantime something has changed in the tool itself (lscpu), but it wouldn't be the first time that I see it not reporting a bit of information (architecture, number of cpus, model, etc.), or reporting it with a non standard output.

It's not about inspection not reporting the architecture (it still does), it's more about not requiring fields we do not need.

Dmitry